| Published On : | October 28, 2009 1:00 AM PDT |
Rate |
|
Picturing Parallelism
Today I’d like to highlight a new Cilk++ tool. The Parallel Performance Analyzer, introduced in Cilk++ 1.0.2, is designed to help you evaluate the parallel performance of your Cilk++ program. This article explains how to interpret the performance graphs to find potential problems and predict how well your program might scale.
Running the Parallel Analyzer
The Parallel Analyzer can be run from the Linux or Windows command line [cilkscreen –w program], or within a Cilk++ Visual Studio project by selecting [Run Cilkscreen Parallel Performance Analyzer] from the [Tools] menu.
A Good Parallel Program
I’ve written a very simple test driver to illustrate the parallel analysis. The first run uses cilk_spawn in the outer loop. Of course, I could have used a cilk_for loop instead of the
for (...) { cilk_spawn xxx() }
pattern, but I’ve chosen to spawn the work explicitly to illustrate how performance varies.
#define OUTER 1000
#define INNER 10000
#define ASIZE (INNER * OUTER)
double results[ASIZE]; // array of 10,000,000 elements
// Initialize an array to values that depends on the array index.
// Assume Work(x) is an inexpensive function.
//
void SetArray(int x)
{
results[x] = Work(x);
}
// Initialize an array to values that depends on the array index.
// Assume Work(x) is an inexpensive function.
//
void InnerLoop(int i)
{
for (int j=0; j<INNER; ++j)
SetArray((i * INNER) + j);
}
void OuterLoop()
{
for (int i=0; i<OUTER; ++i)
cilk_spawn InnerLoop(i); // spawn the inner loop 1000 times
}
Here’s the graph generated in Visual Studio by the Parallel Analyzer:
This graph suggests good parallel performance. The Upper and Lower performance bounds are fairly close together, and both rise as the number of cores goes up. The analyzer predicts that this program will show roughly linear speedup as more cores are added.
Now a "Bad" Parallel Program
Now let’s make a small change to the program:
void InnerLoop(int i)
{
for (int j=0; j<INNER; ++j)
cilk_spawn SetArray((i * INNER) + j);
}
void OuterLoop()
{
for (int i=0; i<OUTER; ++i)
cilk_spawn InnerLoop(i);
}
The only difference is that I’ve moved the cilk_spawn statement from the Outer loop into the Inner loop. The resulting performance graph is very different:
Here we see two problems. First, the upper bound of the application parallelism is about 20. This suggests that the program will not continue to speed up as more than 15-20 cores are added. More troubling is the Lower performance bound. We see that the lower bound is less than 1, suggesting that this program may actually slow down on multiple cores. In fact, this second program takes 3-5 times longer on multiple cores.
The reason is very simple – by moving the cilk_spawn to the inner loop, we are just not spawning enough work to amortize the cost of creating parallel work.
Analyzing QuickSort
Here’s one more graph that tells yet another story. This graph estimates the performance of our QuickSort program:
void sample_qsort(int * begin, int * end)
{
if (begin != end) {
--end; // Exclude last element (pivot) from partition
int * middle = std::partition(begin, end,
std::bind2nd(std::less>int<(), *end));
using std::swap;
swap(*end, *middle); // move pivot to middle
cilk_spawn sample_qsort(begin, middle);
sample_qsort(++middle, ++end); // Exclude pivot and restore end
cilk_sync;
}
}
Here we see a different pattern. There appears to be enough work to generate reasonable parallelism from 1-10 cores, but no more. The upper bound is limited at 10.5x speedup, telling us that there is not enough parallelism to show increasing performance beyond 10 cores. The lower bound suggests that we may not get the full benefit from multiple cores, though we should see some speedup. In fact, on one of our 16-core system, this program shows about 7.5x speedup, just about the center of the predicted range.
The reason that the parallelism is capped is that the "partition" code is run serially before the first recursive call to qsort. This serial process limits the overall performance gain that we can get from the entire program.
We could improve the lower bound by stopping the parallel recursion at some larger number of elements rather than spawning all the way down to a single element.
Conclusion
In addition to the performance prediction graphs shown here, the Parallel Analyzer generates a variety of other low-level information that you can use to evaluate and improve the parallel efficiency of your program. The tool provides a statistical analysis of your Cilk++ program designed to help you understand how your parallel programs will scale to more cores, and helps identify potential problems that can have a dramatic impact on parallel performance.
We would love to hear from you: How do YOU assess and trouble-shoot performance of your multithreaded apps?
posted @ Friday, April 03, 2009 11:22 AM by Dmitriy V'jukov
posted @ Friday, April 03, 2009 11:26 AM by Dmitriy V'jukov
posted @ Friday, April 03, 2009 11:33 AM by Dmitriy V'jukov
posted @ Friday, April 03, 2009 1:16 PM by Steve Lewin-Berlin
posted @ Monday, April 06, 2009 10:52 PM by Shin Yee
posted @ Thursday, April 09, 2009 11:34 PM by John Mellor-Crummey
posted @ Saturday, April 11, 2009 8:05 AM by Matteo Frigo
posted @ Wednesday, April 15, 2009 6:43 AM by Stefano Rossi
