Left versus Right Holders in Intel® Cilk Plus

This article covers a topic of interest to both Cilk and TBB programmers. It's about holders in Intel® Cilk and a similar pattern in TBB, which are good to know about when programming with those frameworks. The key point is the difference between "left holders" and "right holders", and the scenarios where each is appropriate.

Two Kinds of Holders

A "holder" in Cilk Plus is a hyperobject for a reduction operation opthat "forgets" one of the values. There are two kinds of holders:

  • "left holders", where x op y x.
  • "right holders", where x op y y.

TBB does not have direct support for holders like Cilk does, but does support parallel reductions using tbb::parallel_reduce, and you can specify either variant of op for the reduction operation. For both TBB and Cilk, op is associative but not commutative. Reducing the sequence a b c ... z results in a for a left holder, and z for a right holder. Here is one way to implement a right holder in Cilk Plus:

    #include <cilk/reducer.h>
    template<typename U>
    class RightHolder {
        struct Monoid : cilk::monoid_base<U> {
            static void reduce(U *left, U *right) {
                swap(*left,*right);
            }
        };

        cilk::reducer<Monoid> impl;
    public:
         inline U& get_view() {
            return impl.view();
        }
    };

Method reduce must implement *left = *left op *right, and can assume that *right is never used again. Invoking swap does this, and is more efficient than doing *left=*right in some contexts. In C++0x, using a "move assignment" instead of swap would be natural. A left holder is similar, but with the swap removed. Thus left holders have a slighty efficiency advantage, and work even when type U does not support swap. Earlier I said that reducing the sequence a b c ... z results in a for a left holder and z for a right holder. So if you are using the final result, it's obvious which one to use. However, sometimes holders are used when the final result is notgoing to be used, and only intermediate results are of interest. The next three sections give examples of this use case, and advice on how to choose left versus right.

Pattern 1: Optional Loop Carried Dependence

The pattern is when there is a loop carried dependence that can be optionally removed, but the removal is expensive. Use a right holder to reduce the expense to where it is required for actual parallelism instead of paying it for potential parallelism. Here is a sketch of the pattern, which is quite common in serial programming.

    x = initial_value;
    for( int k=0; k<n; ++k ) {
        foo(x);
        x = next(x);
    }

Assume that it is safe to run multiple invocations of foo(x) in parallel. The only thing stopping us from executing iterations in parallel is the assignment x=next(x). This is called a loop carried dependence. Such dependences can sometimes be eliminated by transforming the code. For example, if next(x) is "x+c", the value of x for the invocation of foo is always foo(initial_value+k*c). Let's generalize this transformation by supposing that we can create a function kth(initial_value,k) that computes the value that x should have when entering the kth iteration. The general transformation is:

    x = initial_value;
    cilk_for( int k=0; k<n; ++k ) {
        foo(xkth(initial_value,k));
        x = next(x);
    }
    x = kth(initial_value,n);

When doing this kind of hand transformation, do not forget to compute the last value of x if it is used after the loop. Sometimes kth is much more expensive to evaluate than next, possibly undoing gains from parallelism. The ideal solution will use next when evaluating consecutive iterations, and only use kth when we do not know the previous value of x. For example, in the prime finder example that comes in the TBB distribution, x is a std::vector full of counters that are advanced by various stride values by the foo operation. In that code, next requires doing nothing, but kthrequires doing some arithmetic to compute each vector element. Here's how to implement the ideal solution using a holder. Have a special value that identifies a holder view as uninitialized. It's simplest if that value can be the default value. For example, in the "primes" example , x is a std::vector that is not normally empty, so an empty vector can serve as the special value. Then a right holder does the trick. The general scheme is:

    RightHolder<X> x;
    ...
    x = initial_value;
    cilk_for( int k=0; k<n; ++k ) {
        if( x is default value )
            x = kth(initial_value,k);
        foo(x);
        x = next(x);
    }

To see why it is important to use a right holder, and not a left holder, suppose that n==6. The serial execution of the loop can be summarized as shown:

Each ... stands for "foo(x); x=next(x);" In the serial execution, x is never the default value. There are many possible parallel executions of the loop. Here is one to demonstrate why a right holder is appropriate:

Each of the three arrows is a strand of execution. Each strand gets its own view of x. The strand with "k=0" gets the original view. The strand with "k=2" gets a default-initialized view. At the join point, two views reduce to a single view that is used by the strand with "k=4". When that strand executes the iteration with "k=4", it should be using the value of x updated by "x=next(x) for k=3, not the value of x after "x=next(x)" for k=1. Thus the right view should be propagated and the left one thrown away. That's what a right holder does. When I wrote the TBB primes example that uses the aforementioned pattern, I mistakenly used x op y x for the reduction operation in tbb::parallel_reduce. The error went unnoticed until recently because executions where the reduction value was used were rare.

Pattern 2: Scratch Object

Sometimes in serial loops this pattern arises:

    for( int k=0; k<n; ++k ) {
        ExpensiveToConstructObject x;
        ...use x...
    }

If it does not matter in the ... whether the object is freshly constructed, a programmer will often optimize by hoisting construction/destruction of the object, like this:

    {
        ExpensiveToConstructObject x;
        for( int k=0; k<n; ++k ) {
            ...use x...
        }
    }

This scenario often arises when ExpensiveToConstructObject is a heap-allocated array that is used for scratch space. Though good for faster serial optimization, the transform prevents parallelization using cilk_for, because concurrent iterations will interfere with each other. What we want is to create a ExpensiveToConstructObjectonly when necessary to prevent interference. Making x into a holder does the trick:

    {
        LeftHolder<ExpensiveToConstructObject> x;
        for( int k=0; k<n; ++k ) {
            ...use x...
        }
    }

Each time a strand starts ab initio, it gets a freshly constructed ExpensiveToConstruct object as its view. The view will be carried across subsequent iterations on the same strand. At join points, one of the incoming strands will donate its view for use by the outgoing strand. It does not matter whether it is a left or right holder, because it does not matter what value is carried between iterations or returned by the reduction. However, the reason for choosing a LeftHolder over RightHolder is that RightHolder requires a swap operation that LeftHolder does not. ExpensiveToConstructObjectmight not support a swap operation, or the swap operation might be expensive.

Pattern 3: Detecting a Steal

My blog on detecting theft covered this application of a holder. There I used a left holder for simplicity, but it really does not matter if the holder is a right holder in that blog's example. The reduction is always over "true op true", so it doe not matter which argument op returns. However, Pablo Halpern pointed out to me that a right holder should be used in the following scenario. Suppose that a function spawns work, and after a cilk_sync you want to detect whether there were anysteals. Then a right holder should be used, like this:

    void f() {
        RightHolder<bool> x;
        x.get_view() = true;
        cilk_spawn g();
        cilk_spawn h();
        cilk_sync;
        if( !x.get_view() )
            printf("continuation of g() or h() was stolen\n";
    }

If a continuation after g() or h() is stolen, it gets a fresh view of x, default initialized to false. When execution reaches the cilk_sync, the views merge. Using a right holder ensures that the final view will be one of the fresh views, if any were generated. Thus x will be false after the cilk_syncunless no stealing occured.

Summary

Holders come in left and right forms. Right holders can be useful for exploiting optional loop-carried dependences in parallel code. Both left and right holders are useful for optimizing temporary objects used in loops when it does not matter what value is carried between iterations, though left holders have a slight advantage in that case. Though the concepts directly map to Cilk Plus, they also apply to tbb::parallel_reduce.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.