Thread Parallelism Using Cilk Notation for C/C++

Getting top performance out of  modern processors requires both SIMD and thread  parallelism.   Intel® Cilk Plus is an easy way to express both.   My first blog covered the SIMD part.   This blog explains the thread part.


As outlined in the first blog, fork-join and SIMD parallelism can be combined to solve a computational problem:

    • fork-join parallelism can specify which subproblems can be solved in parallel by different hardware threads.

    • SIMD parallelism can be applied to each subproblem, which helps the compiler exploit SIMD instructions.

In fork-join parallelism, control flow forks into separate flows, and later the flows join back together.  Recursive fork-join can be particularly effective.  For example, 10 levels of two-way splits creates 1024-way potential parallelism.  1024 separate hardware threads would be inefficient.  But systems like Cilk (and TBB)  are careful to turn potential parallelism into actual parallelism only when necessary.  Indeed with these systems you should specify as much fork-join parallelism as you can, and let the system determine when, where, and how much to exploit.

An important point is that the subproblems should fit in cache, otherwise multiple threads can easily run up against memory bandwidth limitations.  Sometimes it is hard to determine what a cache-sized subproblem is.  When in doubt, go as small as practical.  The subproblems are too small if the overhead of scheduling a chunk  becomes significant compared to the work for solving the subproblem serially.  Because Cilk Plus is tightly integrated into the compiler, and has highly structured parallelism, it tends to have lower chopping overhead than systems like TBB that are less structured and supplied as  libraries.  So you can often afford to chop into finer subproblems than you might with a system such as TBB.  [If you are interested in a deep comparison of Cilk vs. TBB, let me know, and I'll post some blogs on the subject.]

A particularly effective form of chopping is exhibited by cache oblivious algorithms.   These optimize for all levels of cache  that might exist (including virtual memory!), oddly enough without knowing anything about the levels.  See the link for more information.

Quick Introduction to Cilk Notation

There are two ways to express fork-join parallelism in Cilk:

    • cilk_spawn / cilk_sync

    • cilk_for

For example, the statement cilk_spawn f(); asynchronously calls a function f.  Here asynchronous means that thethe caller keeps going without waiting for the callee to return.   This is a neat way to express parallelism because most of what you learned about subroutine calls still applies.  The actual arguments are evaluated, bound to formal arguments, and the callee's body is invoked.    The only thing that changes is that the caller continues to execute after the callee is entered, if there is a hardware thread available to continue execution of the caller.

Eventually a caller must wait on its callees.  To do so, the caller invokes cilk_sync.  For example, the following code lets f(x), g(y), and h(z) run in parallel if there are sufficient hardware resources. It waits for them to complete before printing their results.

a = cilk_spawn f(x);

b = cilk_spawn g(y);

c = h(z);


cout << a << b << c << "n";

Here is a picture that shows the execution order.

The last call is not spawned as a matter of good Cilk style.  It could be spawned, but doing so  is considered poor style, because there is no other work to do between that call and the cilk_sync.  Doing nothing in parallel with doing something is pointless overhead.

The semantics of cilk_sync is that it waits for all cilk_spawn calls that were issued by the current Cilk block. A Cilk block is the body of a function, or the body of a cilk_for loop. Complex control flow is allowed between cilk_spawn and cilk_sync.  For example, this code:

for( list::iterator i=x.begin(); i!=x.end(); ++x )

    cilk_spawn f(*i);


walks a linked list x sequentially and issues calls f(*i) that run concurrently.  Execution waits for all of them to complete at the cilk_sync.

There is always an implicit cilk_sync when a routine returns.  Hence a callee cannot accidentally leave "dangling tasks" running after it returns. This feature makes it much easier to reason about Cilk parallelism than it is to reason about raw threads.

The callee can be function object (functor) too.  Since a C++1x lambda expression returns a function object, you can also spawn a block of work like this:

cilk_spawn [&]{for( int i=0; i<5; ++i ) cout<<"quack ";} ();

other_work(); // Do other work while quacking.


In the example, the lambda expression [&]{...} returns a function object.  The  cilk_spawn...() causes it to execute asynchronously.  Do not forget the trailing parentheses, or the compiler will balk.

The notation cilk_for is like a C/C++ for, except that iterations run in parallel if resources permit.   For example:

cilk_for( vector::iterator i=x.begin(); i!=x.end(); ++x )


allows, but does not mandate, each loop iteration to run in parallel.  A cilk_for loop is usually more efficient than a serial for loop wrapped around cilk_spawn requests.  The reason is that a cilk_for can distribute the work across processors in parallel, and much more efficiently, than the combination of for and cilk_spawn.  However, a cilk_for is more limited than a plain for.  The iteration variable must be of an integral type or a random access iterator, and the cilk_for must fit a certain pattern, which permits the number of iterations to be computed before the loop iterations commence.  See the documentation here for details on the limitations.

Another key feature of Intel® Cilk Plus are hyperobjects.  They deal the "join" part of fork-join parallelism when there is data to be joined, such as in reductions.  Hyperobjects are a topic for another day.


If you've read my other recent blogs, you know I'll use  Seismic Duck as the example.  Two parts of Seismic Duck lend themselves to fork-join parallelism. The first is the three physical models that can be updated in parallel:

    • wavefield propagation

    • seismogram

    • reservoir

This parallelism can be expressed with cilk_spawn and cilk_sync.  Here is the code from the actual source:

WavefieldFunctor wf(subsurface, pausedRequest);

cilk_spawn wf();

SeismogramFunctor sf(seismogramClip, pausedRequest&NimbleDraw);

cilk_spawn sf();

ReservoirFunctor rf(pausedRequest);



Two of the functors turn out to be just wrappers around function calls.  They were wrapped because I was using TBB.  I really don't need the wrappers  in Cilk.  For example, SeismogramFunctor looks like:

class SeismogramFunctor {

    const NimbleRequest request;

    NimblePixMap seismogramClip;


    SeismogramFunctor( const NimblePixMap& seismogramClip_, NimbleRequest request_ ) :




     void operator()() const {

         SeismogramUpdateDraw( seismogramClip, request, TheColorFunc, IsAutoGainOn );



so instead of writing that class definition and the lines:

    SeismogramFunctor sf(seismogramClip, pausedRequest&NimbleDraw);

    cilk_spawn sf();

I could have discarded the wrapping and just written:

    cilk_spawn SeismogramUpdateDraw( seismogramClip, request, TheColorFunc, IsAutoGainOn );

That's short and sweet.

The second part that can be profitably parallelized is wavefield propagation, by using geometric decomposition and the ghost cell pattern, as described in another blog.  There are two phases:

    • Exchange boundary information between panels.

    • Update each panel independently.

The second phase dominates execution time, so that's the part I'll parallelize.  The parallel code is:

cilk_for( int p=0; p<NumPanel; ++p ) {

    if( request&NimbleUpdate )

        WavefieldUpdatePanel( p );

    if( request&NimbleDraw )

        WavefieldDrawPanel( p, map );


All I had to do was change the keyword for to cilk_for.  As I mentioned earlier, to fully utilize the processor, we also need SIMD parallelism.  The panel updates are further divided into sequences of tile updates that improve cache reuse.  Part 1 explained how a tile updatecan be expressed concisely with Array Notation .

Serial Semantics

It's no accident that the parallel code is almost identical to the original serial code.  Cilk is essentially annotations on a serial program that say where parallelism is permitted.  In fact, your code can still compile with non-Cilk compilers, albeit without the parallelism. Just use the following defines to undo the annotations:

#define cilk_spawn

#define cilk_sync

#define cilk_for for

Code compiled this way behaves exactly the way the parallel program behaves if limited to a single hardware thread.  I hope that eventually all C/C++ compilers recognize Cilk notation so that the #define trick will not be necessary and Cilk parallelism becomes ubiquitous.


Cilk notation makes fork-join parallelism easy.  Array Notation makes SIMD parallelism easy.  Intel® Cilk Plus provides both.  Combined, they are a powerful pair of tools for squeezing high performance from modern microprocessors.

For more complete information about compiler optimizations, see our Optimization Notice.