Potential bug in cilk_for

Potential bug in cilk_for


I am reporting a potential bug in "cilk_for":

I have a following program:
Pochoir_Iterator iter0(b);
Pochoir_Iterator iter1(b);
const int l_stride_b_1 = b.stride(1), l_stride_b_0 = a.stride(0);
const int gap_b_1 = l_stride_b_1 + (1-(N_SIZE-1)) * l_stride_b_0;

for (int t = 0; t < T_SIZE; ++t) {
iter0.set(t+1, 1, 1);
iter1.set(t, 1-1, 1+1);
cilk_for (int i = 1; i < N_SIZE-1; ++i) {
for (int j = 1; j < N_SIZE-1; ++j,
++iter0, ++iter1) {
iter0 = 0.01 + iter1;
} }

Here, b is a 2D array. Pochoir_Iterator is a class to wrap over the array, and provide an STL style iterator to the array. If I replace the 'cilk_for' by normal 'for'. The result is correct, and the running time of this piece of code for array size 555x555 and 555 T_SIZE (time step) is 153 milisecond on my desktop. But if I use 'cilk_for' instead of common 'for', first, the result will be wrong, and second, the running time drops down dramatically from 153 milisecond to 6248 milisecond. Would you give me some hints why?

Best regards!


2 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Best Reply

It looks like you have a race condition on iter0 and iter1. A race is when two or more threads access the same location in memory and at least one of them performs a write.

Since (theoretically) all iterations of the cilk_for loop run in parallel, and since all of them are writing to iter0 and iter1, the iterations are all clobbering one another. Iterations should either be independent, or if a reducer can be leveraged to cover the race condition, you can wrap the racing object.

More info on race conditionsand race handling in chapter 12 of the Cilk++ manual:

The reason that it is going slower (I suspect) is that since all of the workers are accessing the same (or adjacent) locations in memory,the program has high cache contention.This means that the processor cores are constantly invalidating cache lines in each other as they perform writes. Thus, the cores are doing many more accesses to higher (slower)levels of memory than they would in a serial execution.

Ultimately, I think the answer is a different algorithm. Is this some kind of a stencil? There are good, cache-oblivious stencil algorithms (and cache-oblivious algorithms tend to have an affinity to parallelism). See the paper: "Cache Oblivious Stencil Computations" by Matteo Frigo and Volker Strumpen (Google it -- it's available online).

Leave a Comment

Please sign in to add a comment. Not a member? Join today