Picturing Parallelism: The Cilk Performance Analyzer

Categories:
Posted by Steve Lewin-Berlin originally on www.cilk.com on Fri, Apr 03, 2009

Picturing Parallelism

Today I’d like to highlight a new Cilk++ tool. The Parallel Performance Analyzer, introduced in Cilk++ 1.0.2, is designed to help you evaluate the parallel performance of your Cilk++ program. This article explains how to interpret the performance graphs to find potential problems and predict how well your program might scale.

Running the Parallel Analyzer

The Parallel Analyzer can be run from the Linux or Windows command line [cilkscreen –w program], or within a Cilk++ Visual Studio project by selecting [Run Cilkscreen Parallel Performance Analyzer] from the [Tools] menu.

A Good Parallel Program

I’ve written a very simple test driver to illustrate the parallel analysis. The first run uses cilk_spawn in the outer loop. Of course, I could have used a cilk_for loop instead of the

for (...) { cilk_spawn xxx() }

pattern, but I’ve chosen to spawn the work explicitly to illustrate how performance varies.

``#define OUTER 1000#define INNER 10000#define ASIZE (INNER * OUTER)double results[ASIZE];			// array of 10,000,000 elements// Initialize an array to values that depends on the array index.// Assume Work(x) is an inexpensive function.//void SetArray(int x) {	results[x] = Work(x);}// Initialize an array to values that depends on the array index.// Assume Work(x) is an inexpensive function.//void InnerLoop(int i){	for (int j=0; j<INNER; ++j)		SetArray((i * INNER) + j);}void OuterLoop(){	for (int i=0; i<OUTER; ++i)cilk_spawn InnerLoop(i);	// spawn the inner loop 1000 times}``

Here’s the graph generated in Visual Studio by the Parallel Analyzer:

This graph suggests good parallel performance. The Upper and Lower performance bounds are fairly close together, and both rise as the number of cores goes up. The analyzer predicts that this program will show roughly linear speedup as more cores are added.

Now let’s make a small change to the program:

``void InnerLoop(int i){	for (int j=0; j<INNER; ++j)cilk_spawn SetArray((i * INNER) + j);}void OuterLoop(){	for (int i=0; i<OUTER; ++i)cilk_spawn InnerLoop(i);}``

The only difference is that I’ve moved the cilk_spawn statement from the Outer loop into the Inner loop. The resulting performance graph is very different:

Here we see two problems. First, the upper bound of the application parallelism is about 20. This suggests that the program will not continue to speed up as more than 15-20 cores are added. More troubling is the Lower performance bound. We see that the lower bound is less than 1, suggesting that this program may actually slow down on multiple cores. In fact, this second program takes 3-5 times longer on multiple cores.

The reason is very simple – by moving the cilk_spawn to the inner loop, we are just not spawning enough work to amortize the cost of creating parallel work.

Analyzing QuickSort

Here’s one more graph that tells yet another story. This graph estimates the performance of our QuickSort program:

``void sample_qsort(int * begin, int * end){    if (begin != end) {        --end;  // Exclude last element (pivot) from partition        int * middle = std::partition(begin, end,                          std::bind2nd(std::less>int<(), *end));        using std::swap;        swap(*end, *middle);    // move pivot to middlecilk_spawn sample_qsort(begin, middle);        sample_qsort(++middle, ++end); // Exclude pivot and restore endcilk_sync;    }}``

Here we see a different pattern. There appears to be enough work to generate reasonable parallelism from 1-10 cores, but no more. The upper bound is limited at 10.5x speedup, telling us that there is not enough parallelism to show increasing performance beyond 10 cores. The lower bound suggests that we may not get the full benefit from multiple cores, though we should see some speedup. In fact, on one of our 16-core system, this program shows about 7.5x speedup, just about the center of the predicted range.

The reason that the parallelism is capped is that the "partition" code is run serially before the first recursive call to qsort. This serial process limits the overall performance gain that we can get from the entire program.

We could improve the lower bound by stopping the parallel recursion at some larger number of elements rather than spawning all the way down to a single element.

Conclusion

In addition to the performance prediction graphs shown here, the Parallel Analyzer generates a variety of other low-level information that you can use to evaluate and improve the parallel efficiency of your program. The tool provides a statistical analysis of your Cilk++ program designed to help you understand how your parallel programs will scale to more cores, and helps identify potential problems that can have a dramatic impact on parallel performance.

We would love to hear from you: How do YOU assess and trouble-shoot performance of your multithreaded apps?

Re: The analyzer predicts that this program will show roughly linear speedup as more cores are added...

How the tool *predicts* scalability? Is simple linear approximation used? I.e. tool probes performance on 1 core, then on 2, and so on up to the number of available cores. Or some more advanced techniques used? Is there any paper on this?

posted @ Friday, April 03, 2009 11:22 AM by Dmitriy V'jukov

Re: The reason that the parallelism is capped is that the "partition" code is run serially before the first recursive call to qsort.

Hmmm... but this relates only to very first partition, and all other partitions will run in parallel. So I would expect that quick sort in general case will have nearly perfect linear scalability. What I am missing?

posted @ Friday, April 03, 2009 11:26 AM by Dmitriy V'jukov

Re: How do YOU assess and trouble-shoot performance of your multithreaded apps?

Well, probably I will not say nothing new and extraordinary... I just run app first binded to single core (SetProcessAffinityMask(GetCurrentProcess(), 1) in the main()), then binded to cores 1 and 2, then binded to cores 1 and 3 (on Q6600 pairs of cores share L2\$ so this test gives different results than previous), then to cores 1, 2, 3 and 4. Then calculate speed-up.
If scalability is not linear, then I run under profiler to see what parts of the app slow-down as the number of cores increases.

posted @ Friday, April 03, 2009 11:33 AM by Dmitriy V'jukov

Dmitriy,
1) The prediction is calculated based on measuring the work and span in a dynamic run of the program. That gives a theoretical bound, which we modify to take into account the run-time overhead of work stealing. See the ebook for an explanation of work and span.
2) If the first partition takes about 10% of the total run time, then you can never get more than 10x speedup, even on an infinite number of processors.
3) Note that you can measure but not predict speedup using your technique. Consider a program that does exactly 4 things at once. It may show perfect speedup on 2,3,4 processors, but no additional improvement as more processors are added.
steve

posted @ Friday, April 03, 2009 1:16 PM by Steve Lewin-Berlin

It would be nice to have the speed up graphs for different phases of a program.

posted @ Monday, April 06, 2009 10:52 PM by Shin Yee

The graphs that show the potential for parallel performance based on work and span are nice, but are they actionable? I don't see how they tell you where to tune your program. We devised an actionable approach that can pinpoint where your Cilk code needs adjustment if it is underperforming, and quantify the cost of parallel overhead and idleness. See Effective performance measurement and analysis of multithreaded applications, PPoPP 2009. http://doi.acm.org/10.1145/1504176.1504210

posted @ Thursday, April 09, 2009 11:34 PM by John Mellor-Crummey

Hi John. You are absolutely correct that actionable metrics are the most useful, and we are ourselves experimenting with actionable variants of our critical-path analyzer.

I saw in your paper that you modified the MIT Cilk system to output metadata that identifies the spawn and sync constructs. You may be interested in knowing that our Cilk++ system already outputs this information as part of normal Cilk++ binaries. (We use it for detecting races and for computing the critical path.) If you are still pursuing this line of research and want to use Cilk++ as a basis, we'll be happy to send you some documentation about the metadata.

Do you think that there is any hope that your methods can produce reasonable estimates of the critical path and parallelism of a program?
In theory, the Cilk scheduler spends
O(P T_infinity) time stealing (or idling in your terminology), where T_infinity is the span of the program. Thus, as you observe in your paper, there is a relation between your idleness metric and the span, but only as an inequality (i.e., idleness can provide a lower bound to the critical path but not an upper bound).

posted @ Saturday, April 11, 2009 8:05 AM by Matteo Frigo

We could improve the lower bound by stopping the parallel recursion at some larger number of elements rather than spawning all the way down to a single element.

Let's see this diagram too! So it can be compared to the unoptimized one.

posted @ Wednesday, April 15, 2009 6:43 AM by Stefano Rossi

For more complete information about compiler optimizations, see our Optimization Notice.