Finding Performance Bottlenecks & Data Races

Submit New Article

Published On :   October 29, 2009 1:00 AM PDT
Rate
 


by Ilya Mirman

At this point, we have several dozen organizations worldwide exploring Cilk++. When embarking on a multithreading project, the first question many folks ask is, "Where do I start?"

 

Because some applications have as little as 1,000 lines of code while others have 500,000 or more, there is a good deal of variation in the calendar time involved. The project can take as little as a couple days, or stretch to a couple months. But the workflow itself - illustrated below - is fairly consistent from project to project: start with the serial code and verify its correctness (steps 1 and 2); identify and address performance bottlenecks by adding Cilk++ keywords (steps 3 and 4); verify serial and parallel correctness, potentially employing hyperobjects to resolve races (steps 5 through 7); and potentially repeating this process, guided by performance targets.

Because the initial projects typically target existing serial applications, the early Cilk++ projects typically jump to step 3, and the first question is often, "Where are the performance bottlenecks?" And shortly thereafter, "Did exposing the parallelism create a race condition?" Fortunately, there's some great tools available to answer both questions. Let's take a look at a recent example.

Identifying performance bottlenecks

As an example of zeroing in on a hotspot, consider the case of Dan Mirman, a UConn Psychology Postdoctoral Fellow investigating how word meanings are organized and structured in the brain. As part of his research, Dan uses the MikeNet Neural Network Simulator library.

Dan's goal was to accelerate the research feedback loop. Running each simulation takes days, and dozens of neural network simulations are required for each research project. Speeding up simulations from days to hours could have a great impact on the research feedback loop. Software engineers at Persistent Ltd (a Cilk Arts partner interested in providing multicore enablement services to their customers) used Intel's VTune Performance Analyzer software to zero in on the hotspots in the code.

VTune helps you identify and characterize performance issues by collecting performance data from your application, organizing and displaying the data in a variety of interactive views (from system-wide down to source code or processor instruction perspective), identifying potential performance issues and suggesting improvements.

The sampling source view displays source code annotated with performance data. It turned out that three functions were responsible for 97% of the compute time:


Exposing Parallelism with Cilk++

Again with VTune's help, we can quickly drill down to the very lines responsible for the delay.

In this case, it was the outer for loops in the three functions. It was a trivial effort to multithread them by replacing the serial for loops with cilk_for loops:

// parallelized

void mikenet_matrix_vec_mult_t_p (Real * outvec,int nout,Real *invec,

int nin,Real **mat) {

cilk_for (int i = 0; i < nout; ++i) {

for (int j = 0; j < nin; ++j) {

outvec[i] += mat[j][i] * invec[j];

}

}

}

The following illustrates the performance gains achieved by multithreading MikeNet (4-core 1.7GHz system with 2GB of RAM and 512 KB cache/core).

3.5X improvement on 4 cores is pretty darn good...but can we do better? It turns out we can: in one of the functions every memory access misses the cache, and there is thus an opportunity for a more cache-friendly algorithm.

Swapping the inner and outer loop results in a more cache-friendly algorithm:

// inverted, parallelized (with a race on outvec).

void mikenet_matrix_vec_mult_t_p (Real * outvec,int nout,Real *invec,

int nin,Real **mat) {

cilk_for (int j = 0; j < nin; ++j) {

for (int i = 0; i < nout; ++i) {

outvec[i] += mat[j][i] * invec[j];

}

}

}

But when we run the Cilkscreen race detector, we discover that there is now a race on outvec.

Fortunately, Cilk++ reducer hyperobjects are ideally suited for eliminating this race condition without requiring a lock on outvec. Here's the new code:

 

// inverted, parallelized with reducer.

void mikenet_matrix_vec_mult_t_p (Real * outvec,int nout,Real *invec,

int nin,Real **mat) {

array_reducer_t art(nout, outvec);

cilk::hyperobject<array_reducer_t> rvec(art);

cilk_for (int j = 0; j < nin; ++j) {

Real *array = rvec().array;

for (int i = 0; i < nout; ++i) {

array[i] += mat[j][i] * invec[j];

}

}

}

The cache-friendly algorithm picked up a factor 3X relative to the serial implementation on a single processor, and on 4 cores reached 8X relative to the original serial version (for direct comparison, the data from the previous chart is included in the chart below). As usual, improvement to a serial algorithm paid dividends when multicore-enabling it.

 


Conclusion

A key part of a multithreading project is figuring out early on where the performance bottlenecks are, exposing parallelism, and then verifying that the parallelism has not created a race condition. The combination of familiar profiling tools such as VTune, and the Cilkscreen race detector, take a lot of guesswork out of multithreading.