Thread Parallelism Using Cilk Notation for C/C++

Getting top performance out of  modern processors requires both SIMD and thread  parallelism.   Intel® Cilk Plus is an easy way to express both.   My first blog covered the SIMD part.   This blog explains the thread part.

Background


As outlined in the first blog, fork-join and SIMD parallelism can be combined to solve a computational problem:

    • fork-join parallelism can specify which subproblems can be solved in parallel by different hardware threads.

    • SIMD parallelism can be applied to each subproblem, which helps the compiler exploit SIMD instructions.



In fork-join parallelism, control flow forks into separate flows, and later the flows join back together.  Recursive fork-join can be particularly effective.  For example, 10 levels of two-way splits creates 1024-way potential parallelism.  1024 separate hardware threads would be inefficient.  But systems like Cilk (and TBB)  are careful to turn potential parallelism into actual parallelism only when necessary.  Indeed with these systems you should specify as much fork-join parallelism as you can, and let the system determine when, where, and how much to exploit.

An important point is that the subproblems should fit in cache, otherwise multiple threads can easily run up against memory bandwidth limitations.  Sometimes it is hard to determine what a cache-sized subproblem is.  When in doubt, go as small as practical.  The subproblems are too small if the overhead of scheduling a chunk  becomes significant compared to the work for solving the subproblem serially.  Because Cilk Plus is tightly integrated into the compiler, and has highly structured parallelism, it tends to have lower chopping overhead than systems like TBB that are less structured and supplied as  libraries.  So you can often afford to chop into finer subproblems than you might with a system such as TBB.  [If you are interested in a deep comparison of Cilk vs. TBB, let me know, and I'll post some blogs on the subject.]

A particularly effective form of chopping is exhibited by cache oblivious algorithms.   These optimize for all levels of cache  that might exist (including virtual memory!), oddly enough without knowing anything about the levels.  See the link for more information.

Quick Introduction to Cilk Notation


There are two ways to express fork-join parallelism in Cilk:

    • cilk_spawn / cilk_sync

    • cilk_for



For example, the statement cilk_spawn f(); asynchronously calls a function f.  Here asynchronous means that thethe caller keeps going without waiting for the callee to return.   This is a neat way to express parallelism because most of what you learned about subroutine calls still applies.  The actual arguments are evaluated, bound to formal arguments, and the callee's body is invoked.    The only thing that changes is that the caller continues to execute after the callee is entered, if there is a hardware thread available to continue execution of the caller.

Eventually a caller must wait on its callees.  To do so, the caller invokes cilk_sync.  For example, the following code lets f(x), g(y), and h(z) run in parallel if there are sufficient hardware resources. It waits for them to complete before printing their results.

a = cilk_spawn f(x);

b = cilk_spawn g(y);

c = h(z);

cilk_sync;

cout << a << b << c << "n";


Here is a picture that shows the execution order.

The last call is not spawned as a matter of good Cilk style.  It could be spawned, but doing so  is considered poor style, because there is no other work to do between that call and the cilk_sync.  Doing nothing in parallel with doing something is pointless overhead.

The semantics of cilk_sync is that it waits for all cilk_spawn calls that were issued by the current Cilk block. A Cilk block is the body of a function, or the body of a cilk_for loop. Complex control flow is allowed between cilk_spawn and cilk_sync.  For example, this code:

for( list::iterator i=x.begin(); i!=x.end(); ++x )

    cilk_spawn f(*i);

cilk_sync;


walks a linked list x sequentially and issues calls f(*i) that run concurrently.  Execution waits for all of them to complete at the cilk_sync.

There is always an implicit cilk_sync when a routine returns.  Hence a callee cannot accidentally leave "dangling tasks" running after it returns. This feature makes it much easier to reason about Cilk parallelism than it is to reason about raw threads.

The callee can be function object (functor) too.  Since a C++1x lambda expression returns a function object, you can also spawn a block of work like this:

cilk_spawn [&]{for( int i=0; i<5; ++i ) cout<<"quack ";} ();

other_work(); // Do other work while quacking.

cilk_sync;


In the example, the lambda expression [&]{...} returns a function object.  The  cilk_spawn...() causes it to execute asynchronously.  Do not forget the trailing parentheses, or the compiler will balk.

The notation cilk_for is like a C/C++ for, except that iterations run in parallel if resources permit.   For example:

cilk_for( vector::iterator i=x.begin(); i!=x.end(); ++x )

    f(*i);


allows, but does not mandate, each loop iteration to run in parallel.  A cilk_for loop is usually more efficient than a serial for loop wrapped around cilk_spawn requests.  The reason is that a cilk_for can distribute the work across processors in parallel, and much more efficiently, than the combination of for and cilk_spawn.  However, a cilk_for is more limited than a plain for.  The iteration variable must be of an integral type or a random access iterator, and the cilk_for must fit a certain pattern, which permits the number of iterations to be computed before the loop iterations commence.  See the documentation here for details on the limitations.

Another key feature of Intel® Cilk Plus are hyperobjects.  They deal the "join" part of fork-join parallelism when there is data to be joined, such as in reductions.  Hyperobjects are a topic for another day.

Example


If you've read my other recent blogs, you know I'll use  Seismic Duck as the example.  Two parts of Seismic Duck lend themselves to fork-join parallelism. The first is the three physical models that can be updated in parallel:

    • wavefield propagation

    • seismogram

    • reservoir



This parallelism can be expressed with cilk_spawn and cilk_sync.  Here is the code from the actual source:

WavefieldFunctor wf(subsurface, pausedRequest);

cilk_spawn wf();

SeismogramFunctor sf(seismogramClip, pausedRequest&NimbleDraw);

cilk_spawn sf();

ReservoirFunctor rf(pausedRequest);

rf();

cilk_sync;


Two of the functors turn out to be just wrappers around function calls.  They were wrapped because I was using TBB.  I really don't need the wrappers  in Cilk.  For example, SeismogramFunctor looks like:

class SeismogramFunctor {

    const NimbleRequest request;

    NimblePixMap seismogramClip;

public:

    SeismogramFunctor( const NimblePixMap& seismogramClip_, NimbleRequest request_ ) :

        seismogramClip(seismogramClip_),

        request(request_)

     {}

     void operator()() const {

         SeismogramUpdateDraw( seismogramClip, request, TheColorFunc, IsAutoGainOn );

     }

};


so instead of writing that class definition and the lines:

    SeismogramFunctor sf(seismogramClip, pausedRequest&NimbleDraw);

    cilk_spawn sf();


I could have discarded the wrapping and just written:

    cilk_spawn SeismogramUpdateDraw( seismogramClip, request, TheColorFunc, IsAutoGainOn );


That's short and sweet.

The second part that can be profitably parallelized is wavefield propagation, by using geometric decomposition and the ghost cell pattern, as described in another blog.  There are two phases:

    • Exchange boundary information between panels.

    • Update each panel independently.



The second phase dominates execution time, so that's the part I'll parallelize.  The parallel code is:

cilk_for( int p=0; p<NumPanel; ++p ) {

    if( request&NimbleUpdate )

        WavefieldUpdatePanel( p );

    if( request&NimbleDraw )

        WavefieldDrawPanel( p, map );

}


All I had to do was change the keyword for to cilk_for.  As I mentioned earlier, to fully utilize the processor, we also need SIMD parallelism.  The panel updates are further divided into sequences of tile updates that improve cache reuse.  Part 1 explained how a tile updatecan be expressed concisely with Array Notation .

Serial Semantics


It's no accident that the parallel code is almost identical to the original serial code.  Cilk is essentially annotations on a serial program that say where parallelism is permitted.  In fact, your code can still compile with non-Cilk compilers, albeit without the parallelism. Just use the following defines to undo the annotations:

#define cilk_spawn

#define cilk_sync

#define cilk_for for


Code compiled this way behaves exactly the way the parallel program behaves if limited to a single hardware thread.  I hope that eventually all C/C++ compilers recognize Cilk notation so that the #define trick will not be necessary and Cilk parallelism becomes ubiquitous.

Summary


Cilk notation makes fork-join parallelism easy.  Array Notation makes SIMD parallelism easy.  Intel® Cilk Plus provides both.  Combined, they are a powerful pair of tools for squeezing high performance from modern microprocessors.

For more complete information about compiler optimizations, see our Optimization Notice.

6 comments

Top
anonymous's picture

Please post a detailed comparison of Cilk Plus vs TBB -- as you've offered to, near the beginning of your blog. It'll be very helpful.

lightwave22's picture

I would be very interested in a blog post on Cilk vs. TBB (and possibly also OpenMP). I am an undergrad and the amount of information out there can be somewhat overwhelming at times.

My academic curriculum offers little exposure to parallel programming so I am having difficultly on choosing a platform to just start coding in. With a background heavy in object oriented C++ it appears OpenMP, TBB, and Cilk have some similarities with regards to C++. Thanks for the blogs posts.

anonymous's picture

I would be very interested in a blog post on Cilk vs. TBB (and possibly also OpenMP). I am an undergrad and the amount of information out there can be somewhat overwhelming at times.

My academic curriculum offers little exposure to parallel programming so I am having difficultly on choosing a platform to just start coding in. With a background heavy in object oriented C++ it appears OpenMP, TBB, and Cilk have some similarities with regards to C++. Thanks for the blogs posts.

anonymous's picture

I would be very interested in a blog post on Cilk vs. TBB (and possibly also OpenMP). I am an undergrad and the amount of information out there can be somewhat overwhelming at times.

My academic curriculum offers little exposure to parallel programming so I am having difficultly on choosing a platform to just start coding in. With a background heavy in object oriented C++ it appears OpenMP, TBB, or Cilk have some similarities with regards to C++. Thanks for the blog posts.

Arch D. Robison (Intel)'s picture

Cilk has evolved over time. A short history is "Cilk -> Cilk++ -> CIlk Plus". Neither Cilk++ or Cilk Plus support abort or inlet, and instead focus on strictly determinisitc computation. Non-deterministic cancellation can be simulated by polling. Hyperobjects have a fairly efficient implementation for doing reductions, so it's not clear if extra support for non-deterministic reductions is still necessary.

We're evolving Cilk Plus with emphasis on keeping it simple. This is the first release. We'll grow it based on experience.

The most radical change so far is another old Cilk notation that we removed: the keyword "cilk". It was required in the original Cilk to mark routines that were Cilk code, because those routines had an unusual linkage convention. Cilk++ had something similar in the form of 'extern "Cilk++"' declarations. These were a real pain in highly templated code, because C++ templates have no deduction rules for linkage. Cilk Plus uses the same calling convention as C/C++, so the nuisance of incompatible linkage is gone.

Dmitry Vyukov's picture

Interestingly, Cilk notation includes the following keywords:
spawn
sync
abort
inlet
spawn and sync are the same as cilk_spawn and cilk_sync. abort provides non-deterministic cancellation of tasks, and inlet provides non-deterministic reductions.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.