Cilkview Graph

Cilkview Graph

I having some trouble understanding the graph cilkview generates. For an application computing FFT of random numbers I have the following cilkview stats for recursive FFT:1) Parallelism Profile Work : 38860702188 instructions Span : 485590022 instructions Burdened span : 485590022 instructions Parallelism : 80.03 Burdened parallelism : 80.03 Number of spawns/syncs: 108003329 Average instructions / strand : 119 Strands along span : 504 Average instructions / strand on span : 963472 Total number of atomic instructions : 92844 Frame count : 2527068122) Speedup Estimate2 procs: 1.96 - 2.004 procs: 3.76 - 4.008 procs: 6.96 - 8.0016 procs: 12.13 - 16.0032 procs: 19.29 - 32.00The graph being generated isNow i don't understand why trials are so weird and don't follow a straight line for 8 cores, even though parallelism is around 80? Also how significant is upper bound and lower bound and how do i get it in the graph?

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The upper and lower bounds track parallelism. The bounds you have indicate that you have plenty of parallelism, and the burden from scheduling is not too bad. Parallelism is not the limiting factor in your application.

The poor trial results probably indicate a problem with memory bandwidth limitationsor cache contention. Often memory bandwidth limitations show up when you are working on a large chunk of memory (a long array, or a multi-dimensional array). Cache contention tends to occur when there is locking and atomic operations (true sharing), or when different threads are frequently modifying variables that happen to live on the same cache line (false sharing).

Do either of these sound like possibilities in your code?

Memory bandwidth limitations can sometimes be solved with cache-oblivious algorithms. If there is an algorithm that can do what you are doing, but divides it differently so that all (or most) accesses to a particular address are done in quick succession, then there are fewer cache misses and less queuing on memory. I.e., one algorithm might loop over a whole array n times, but a more cache-friendly one might chunk the array and do alln iterations on one chunk before it moves onto the next.

Unfortunately, some problems are inherently bad for bandwidth. memcpy(), for example, will probably never benefit from parallelism on large arrays.

True sharing is often more difficult than dealing with memory bandwidth. If there's a global lock, try to see if there's a way to use a reducer, instead. Same with atomic operations. If the algorithm simply requires many workers pounding on the same address, you may have to rewrite your algorithm.

False sharing can often be solved by padding a data structure. If adjacent addresses are being written by different workers, and it isn't too much of a space hog, you can pad your structures by cache line (64 bytes on x86 and x86-64) so that no two are sharing a line.

The input signal in form of an array is actually huge.2097152 objects of the form:class complex_no{private: float real_; float img_;Reducing the input signal to a size of 65536, I see no improvement. The numbers are:1) Parallelism Profile Work : 968326172 instructions Span : 15221178 instructions Burdened span : 15221178 instructions Parallelism : 63.62 Burdened parallelism : 63.62 Number of spawns/syncs: 2555905 Average instructions / strand : 126 Strands along span : 304 Average instructions / strand on span : 50069 Total number of atomic instructions : 2798 Frame count : 60948442) Speedup Estimate2 procs: 1.95 - 2.004 procs: 3.70 - 4.008 procs: 6.74 - 8.0016 procs: 11.42 - 16.0032 procs: 17.50 - 32.00and the graph becomes:I can see no improvement. Parallelism is also reducing as the size of the array goes down.I also have a reducer, but the reduce function is default one, i.e. I kept mine empty.The basic recursive FFT method divides the input in odd and even elements. This process goes on recursively till we end with just one element in the array. The two elements are then used for butterfly operation.At no time, however, two separate workers can work on the same element.I am still testing against all the parameters you listed especially cache, I'll share if I find something interesting.

That sounds like a memory bottleneck to me. It isn't merely that the array is big (65,000 8-byte elements is still pretty big). The problem, if I understand correctly, is the way the algorithmaccessesthe array generates a lot of cache misses. You go across the array many times, and by the time you come back to the beginning, the first element is no longer in cache and has to be refetched.

You might look up a cache-oblivious FFT algorithm and implement that. Matteo Frigo has one. I don't know if (or howwell)it parallelizes, but my guess is that it will outperform a cache-unfriendly algorithm, even serially.

Barry pointed out to me that some good references.

Leave a Comment

Please sign in to add a comment. Not a member? Join today