Parallel Building Blocks - Eliminate Data Races, THEN Add Vector + Thread Parallelism with Cilk

In the last 5 blogs, I’ve explained various ways you can have the compiler generate vectorized code for you. If you understand and master each of the different ways Array Notations can vectorize your code, there is one last* recommended step before you start optimizing the “meat” of your ‘for’ loops – eliminate possible data races with the Cilk reducer. This syntax creates a “lock down” on the variable in question and eliminates the data race. If you don't eliminate the data races before you add the vectorization, your debugging time will be unnecessarily compounded. Rule out issues before they become an issue. Let me explain with some code:

int x = 0;
cilk_for(int i = 0; i < N; i++)
x++;


Remember this trivial example from the previous blog?

Here this cilk_for increments a variable. Just a review, the keywords indicates to the Cilk Plus runtime that it can split the work of this for loop into chunks of work that can be run on different CPU cores at the same time. There is a potential data race problem on the increment of x, as one thread might read x but not increment and write it back to memory before another thread can load the value of x and increment it.

#include
cilk::reducer_opadd x;
x.set_value(0);
cilk_for(int i = 0; i < N; i++)
x++;
// Use x.get_value() to access x afterward


That data race can be resolved with a Cilk Plus reducer which provides an efficient, lock-free way of eliminating data races. Simply declaring a variable with the provided Cilk Plus template class type allows you to safely increment this shared variable while still maintaining scalable performance. You’ll need to include the cilk\reducer_opadd header file as well. The set_value and get_value reducer member functions allow you to access the data in a reducer outside of the parallel region. For a full list of supported reducers and information on how to create your own customized reducers, see the compiler documentation.

Now let’s take a look at vector + thread parallelism with the cilk_for. I wrote a function in Array Notations that handles a size-m “chunk” of data. I then call the function multiple chunks of data, in parallel, using multi-threading:

void saxpy_vec(int m, float a, float x[m], float y[m]) {
y[:] += a * x[:]; // Vector code for size m data
}
void main(void) {
int a[2048], b[2048];
cilk_for (int i = 0; i < 2048; i +=256) {
// Call function on size-256 chunks in parallel
saxpy_vec(256, 2.0, &a[i], &b[i]);
}
}


Vectorization has a limit. For example, 4 floats on SSE, 8 floats on AVX. If we want to get more parallelism, we need to use thread-level parallelism. Here, we have a saxpy_vec function that uses vectors to compute y += a * x. We can give this function vectors of size 256, and use threads to execute multiple function calls in parallel. The 256 value needs to be tuned for the OS and hardware. There is overhead involved in creating a thread, we need to make sure that each thread has enough work to do. You can use Parallel Studio to drill down farther for that.

*Last, meaning for beginners in parallel programming just starting to learn Cilk. I recommend others with more experience use Parallel Inspector to find and eliminate data races. We have a host of materials on ISN to show you how to do that. It’s a little more complex than just adding the Cilk keyword, but it has the lowest learning curve out of the analysis tools I’ve tried.

For more complete information about compiler optimizations, see our Optimization Notice.