Visualizing Parallel Speedup with Cilkview

Posted by Will Leiserson originally on www.cilk.com on Tue, Jun 30, 2009

Along with our upcoming release of Cilk++ v.1.1 we are including a new tool to help you visualize application performance: Cilkview. Cilkview runs an application binary and generates performance data on sections you specify. It combines this data with performance estimates generated by the work/span calculator and produces a graph of the results.

Consider the following parallel algorithm for calculating fibonacci numbers:


The Directed Acyclic Graph (DAG)
that represents the call tree fib(35) is a binary tree of parallel nodes, so there is parallelism aplenty. Furthermore, a look at the function reveals that it is basically CPU-bound. The only memory accesses are into the frame and they have the highest possible locality. Naturally, this is a terrible way to calculate fibonacci numbers, but it is a simple example of the class of algorithms that are easily parallelized. To make Cilkview predict speedup, insert the following code near the call site:


start() and stop() indicate the interesting sections of code to Cilkview. dump() will associate the collected data with a label that Cilkview will use to generate a graph. Compile your program and run Cilkview to generate a graph:

  % cilkview ./fib


Cilkview runs your binary on a single core with instrumentation that calculates the work and span of the sections of code you specify. In this case, Cilkview made the same prediction that we did. The steeper line indicates perfect linear speedup up until the theoretical parallelism of the algorithm. The theoretical parallelism does not appear on the graph because it is too large. The more gently sloped line indicates Cilkview's prediction of more realistic performance, burdened speedup, based on the overhead of the Cilk++ runtime. Since the two lines are so close (nearly indistinguishable in this plot), Cilkview predicts that actual trials of this application should scale with nearly perfect linearity.

To test this, Cilkview can also run the trials themselves by specifying the "-trials all" option at the command line. No additional code is necessary:

 
% cilkview -trials all ./fib

 


The red stars indicate trial data. Cilkview by default runs one trial for each of 1..n Cilk workers where n is the number of cores. The plot, above, was generated on a computer with 8 cores. As predicted, the trial performances scaled with near-perfect linearity. Slight variations can be attributed to the limitation that only one trial was performed for each configuration.

But what about more common algorithms such as sorting? The following is a simple algorithm for parallel quicksort:


At first glance, this algorithm looks like it has plenty of parallelism, too. Even though there are lots of memory accesses the quicksort algorithm has good cache properties. But consider that the DAG would have a substantial sequential node at the root for the first partition(). Likewise, after the recursive call, each of the first parallel children have to run partition() on their portions of the array before they can do anything parallel. In fact, most of the work in quicksort is done in the partition() algorithm -- serially! Cilkview's analysis bears this out:


The plot represents qsort run on an array 10,000,000 integers. In this graph you can see the that the ideal parallelism (the horizontal line) is small enough to appear on the graph (for this or any reasonable size of input). Plotting the trials gives data that is consistent with this:


The trials fit well within the predicted margin.

Now consider the following code for parallel memcpy:


On the surface, this has lots of parallelism. All of the work is in the cilk_for, and the automatic grain size should handle the overhead of the implicit cilk_spawns with respect to the trivial work done at each iteration. Conceptually, memcpy() is similar to fib() in terms of parallelism. The DAG looks similar to that of fib() except the tree is balanced. Unfortunately, the prediction generated from work/span data is totally inconsistent with the actual performance data from the trials (parallel_memcpy() run on an array 400 million bytes long):


Why does the trial data not match with the prediction? Cilkview takes the work and span into account and estimates the overhead of the runtime to arrive at a reasonable estimation of parallelism. But the limitation was not in the amount of parallelism, here. A discrepancy between the prediction and the trial data indicates other factors at work. In this case, the memory bandwidth became saturated and the speedup ceased to scale as the processors began to queue for access to main memory. The computer on which this was run has two memory controllers, so the scaling would have been even worse on a typical desktop machine. Cilkview's parallelism prediction is useful for understanding whether your algorithm has enough parallelism to scale effectively. A graph in which the trial data falls below the lower-bound indicates that the algorithm is running up against another limitation.

A final, happier example is the parallel matrix multiplication algorithm developed by Matteo Frigo and Matthew Steele. This algorithm was intended as an alternative to more intuitive parallel matrix multiplication algorithms that tend not to use cache effectively:


Note that the algorithm does not actually get super-linear speedup as it might appear in the 8-worker case. Remember that a single trial was run at each level, so slight variations in runtime can cause visible changes in the graph.

Cilkview will ship with Cilk++ v.1.1 for both Windows and Linux. There is more functionality included with the tool (e.g., you can instrument multiple sections of code, generate .csv files, set the number of cores on which to predict performance, and more). The hope is that Cilkview will help users confirm or deny their intuitions about parallel algorithms. Also, it makes pretty pictures.

COMMENTS

Can you comment on the overhead of the cilkview tool? How much longer than the standalone application does cilkview run? The pictures produced are very pretty indeed. If they're produced in a reasonable amount of time this could be an incredibly useful tool.

posted @ Wednesday, July 01, 2009 6:31 PM by Timmie Smith


We have some experience with how users interact with it: a preview of Cilkview was given to the people who signed up for the course we did a few weeks ago and they were generating plots, tweaking algorithms, and generating new plots, again. It didn't seem to be detrimental to the workflow. 

For this blog post, I used one program to generate data for all of the algorithms. The time for Cilkview vs. the program run on a single worker was about 13x. Testing individual functions came out between 10x-20x overhead which should be fairly typical. This is significantly slower than a program on its own, but it was sufficiently speedy for my purposes. If I had found that a function was too slow, I could have given it a smaller input. 

Naturally, trial data is generated at the speed of your application (run on 1..n cores). The overhead is a call to get the system time in start() and stop(), and there's some file IO in dump(). 

posted @ Thursday, July 02, 2009 8:10 AM by Will Leiserson


You specify that "Cilkview runs your binary on a single core with instrumentation that calculates the work and span of the sections of code you specify" Does it mean that you run the program once in single core machine and then predict parallelism of the program? Work/span model is very helpful; however, could you please tell us more about how to predict parallelism so that you can achieve "impressive" parallelism of 732,777.999 in Fibonacy example? 

Thanks you for your time and consideration.

posted @ Wednesday, August 12, 2009 5:01 AM by Le


Yes, cilkview runs the program once on a single core instrumented with Intel's PIN tool. As the program runs PIN tracks the total number of instructions and the number of instructions along the longest path. If you imagine the spawn/call tree of fib(), the span is not very long. In fib(35) it is a stack 35 frames deep. If all of the rest of the work were stolen away, that worker would have only to return 35 times, aggregating results along the way. That is a very short span for a lot of work. 

Don't misunderstand, however: you won't get 732,777.99x speedup if you have that many cores. That number is an ideal that you won't exceed (except for things like cache effects in some algorithms). The idea is to have much more parallelism than you have cores. For p cores, if your algorithm has, say, parallelism = 10 * p you should get linear speedup. 

posted @ Wednesday, August 12, 2009 6:25 AM by Will Leiserson


Will: I wish I knew that before I bought my 732778-core server.

posted @ Wednesday, August 12, 2009 6:31 AM by Dr. Kerslappity


No, no! No buyer's remorse. Just run fib(40) and you should be okay.

posted @ Wednesday, August 12, 2009 7:58 AM by Will Leiserson


Thank you so much Will. 

Sorry for misunderstanding your estimation of parallelism.  

Regarding how you measure parallelism, I am pondering how can you count the number of instructions along the longest path.As PIN is a binary instrumentation tool, how do you create a DAG from binary and then using PIN to count the number of instructions along the longest path? I guess that your Cilk DAG is a DAG which nodes are strands (called tasks) and arcs are the precedences b/w strands. Each strand may correspond to a basic block in PIN. Therefore, while using PIN to trace basic blocks, you implicitly create a DAG and then calculate the longest path of your DAG. But it raises questions about how can you map b/w a strand and a basic block and how you relate the precedence among basic blocks?-because a strand is not merely a function call but also a block of instructions without parallel control structures. It is interesting if we can re-build a DAG from binary but source code. 

Thank you ! 

posted @ Monday, August 24, 2009 4:42 AM by Le


Hi Le, 

That's a topic that requires a blog post unto itself. There are some technical papers on the algorithms used by the Nondeterminator tool for the MIT Cilk project (see the bottom of the linked page). The short story is that our tools are similar so those papers should give a pretty good understanding.

posted @ Wednesday, August 26, 2009 6:01 AM by William Leiserson

 

For more complete information about compiler optimizations, see our Optimization Notice.