Parallel Building Blocks – Cilk ‘for’ Primer

So you’ve made the decision to go with a language extension and finish the parallel coding quickly. You don’t have to spend a lot of time thinking about your algorithm and problem with Cilk, and it's not meant for people that do want to spend a lot of time thinking about it. For this PBB model you can simply learn the syntax and usage of the Cilk ‘for’ and quickly augment your serial ‘for’ to get task parallelism across cores. After that you can protect against data races and add vectorization later (I will talk about those things in subsequent blogs/videos).

First things first, if you already have the latest Intel compiler, it’s in there. Otherwise, there are a variety of ways you can get it. It’s available in the Intel® Parallel Composer product in Parallel Studio or Parallel Studio XE 2011. Composer contains much more than just the compiler, just like Intel Parallel Studio contains much more than just Composer (it has 4 large components). At the very least I would recommend checking out the 90 day trial version of the Intel Parallel Studio and seeing everything that’s in there. Most people I have talked to outside of Intel find at least one part in the whole “all-in-one toolkit for parallel programming” that they either didn’t know existed and were planning to pay for in a separate program (e.g. Amplifier), or they start using a component they had never considered (e.g. inspecting for memory errors) because it appeared to be too complicated and too time consuming for them to use.

Let’s take a high level look at the Cilk™ Plus syntax. It’s only three keywords for what they call “fine grained” task parallelism:

cilk_spawn
cilk_sync
cilk_for


There’s also a class of custom data types called reducers that will create what I call a “lock down” on a variable to prevent a data race.
So this is pretty straightforward, it’s like a regular ‘for’

cilk_for (int x = 0; x < 1000000; ++x) { … }


If you have taken a look at any Cilk training on ISN, you’d see the following caveats to using it

- Any or all iterations may execute in parallel with one another.
- All iterations complete before the program continues.
- Constraints:
- Limited to a single control variable.
- Must be able to jump to the start of any iteration at random.
- Iterations should be independent of one another.

So how does this ‘for’ loop compare with the other ones offered with TBB and ArBB? This one has the least amount of control. You’re relying on the runtime to make the threading decisions for you and don’t have much say over distribution across cores. It's not a substitute for a finely tuned 'for' implementation or an expert TBB implementation, but it still does a great job.

We used the following example in our PBB lab held at both IDFs.
int x = 0;

cilk_for(int i = 0; i < N; i++)
x++;


If it looks too simple, it's because it is meant to be. In this example, the cilk_for loop simply increments a variable. Using cilk_for indicates to the runtime that it can split the work of this for loop into chunks of work that can be run on different CPU cores at the same time. Note how there’s a potential data race problem on the increment of x, as one thread might read x but not increment and write it back to memory before another thread can load the value of x and increment it. We will discuss how to resolve that data race with a reducer in the next blog.

For more complete information about compiler optimizations, see our Optimization Notice.