Serial Equivalence of Cilk Plus programs

The serial equivalence of a Cilk™ Plus parallel program

There is a trend in the C++ community to grow capabilities thru more libraries and as much as possible, avoid adding language keywords.
Consistent with these trends are Intel’s Threading Building Blocks and Microsoft’s Parallel Patterns Library.
The question arises, then, why implement Intel’s Cilk™ Plus as language extensions rather than a library?
One of the answers is that the language is implemented by compilers, and compilers can provide certain guarantees. One such guarantee is serial equivalence.
Every Cilk Plus program that uses the 3 taking keywords for parallelism has a well-defined serial elision.
The serial elision is defined by replacing each cilk_spawn and each cilk_sync with white spaces, and each cilk_for with the for keyword.
Obviously, the serial elision of a Cilk Plus program is a valid C/C++ program.

A program has a determinacy race if two logically parallel strands both access the same memory location and at least
one of them modifies the memory location.
If a Cilk Plus parallel program has no determinacy race, then it will produce the same results as its serial elision.
What are the compiler’s contributions to the serial equivalence guarantees? Consider the following code illustration:

    int x1 = func1();
    int d1 = 0;
    int x2 = cilk_spawn child1(bar1(), bar2());  //spawn a funciton whose arguments are function calls
int x3 = cilk_spawn child2(&d1);                        // pass a stack address to a spawned function
    int x4 = func3();                                                              // func3 can execute in parallel with child1 and child2
   cilk_sync;                                                                          // wait for the child tasks to complete before their results can be used
return x1 + x2 + x3 + x4 + d1;

The function foo spawns the function child1() so that it can execute concurrently with the function func2().
The statement cilk_sync causes execution to wait until child1 and child2 return, so that their return values can be used.
The compiler makes 3 contributions that determine the execution of the code here:
1. The functions bar1() and bar2(), which produce values that are arguments to child1(), are evaluated sequentially by the parent thread,
i.e. the same thread on which the function foo executes. A different execution models would allow them to execute in parallel to each other,
or on a thread that is different from the parent thread. However, only this execution model corresponds to the way in which the underlying,
C/C++ languages work.
2. The compiler inserts a cilk_sync statement before the return statement in function foo().
The compiler inserts such a statement before the return of every function that includes a cilk_spawn.
This enforces structured fork – join parallelism. It makes the program behavior easier to understand.
On the practical side, the “implicit” cilk_sync inserted by the compiler ensures that the stack frame of foo() is in place throughout the
execution of functions it spawns.
In this illustration, since foo passed the address of d1 to child2(), it guarantees that when child2 writes into the location of d1,
the memory location is as expected, in the stack of foo().
3. The thread that executes foo proceeds to execute child1(). The continuation of foo(), starting from the statement that follows the
cilk_spawn statement, is what it enqueued for later execution and being made available for stealing by other threads. This is called ‘parent stealing’.
Library implementations of work stealing use ‘child stealing’. In this illustration, child stealing would mean that child1() would be
enqueued for later evaluation and be made available for other workers to steal. However, only parent stealing is equivalent to the execution of
the sequential program.
Here is an example with a small code fragment that actually does something. Assume you have a ternary tree, in which nodes points to left, middle and right child nodes, and in addition have a color.  A linked list of all red nodes can be constructed with a recursive traversal of the tree, where each red node gets pushed onto a forming linked list. When parallelizing the recursion, care has to be taken not to create a data race when pushing nodes onto the global linked list. The recommended way to resolve the data race in Cilk Plus is to use a hyper object for the linked list. The hyper object provides a local view for each strand. A possible implementation of the parallelized recursion is here:

cilk::reducer_list_append<terntreenode *> root;
find_reds_par(terntreenode *p)
      if (p->color == red) {
      if (p->left) cilk_spawn find_reds_par(p->left);
if (p->middle) cilk_spawn find_reds_par(p->middle);
if (p->right) find_reds_par(p->right);

The cilk Plus parallel traversal will produce the same linked list, with the same order of nodes, as the serial elision of the program.

For more complete information about compiler optimizations, see our Optimization Notice.


Matthias H. (Intel)'s picture

well, over time you find a lot enhancements / additions to languages. If you want to run OpenMP the compiler has to support this. Staying with the OpenMP example: Intel has introduced OpenMP additions like taskq in the past which are now more or less in the OpenMP 3 standard. So probably the root of your question is whether Cilk will be part of the standard at some stage. In my opinion this would definitely be a good thing.

anonymous's picture

By introducing new click keywords in the core language itself, the core objective of C++ is defeated. When Intel TBB allows parallelism, why these new keywords?. Are there any guidelines on when to use TBB or click keywords?.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.