Using Pedigrees in Intel® Cilk™ Plus

Last time, I talked about DotMix, a contributed code for a deterministic parallel random-number generator (DPRNG) in Cilk™ Plus. To quickly recap, DotMix deterministically generates random numbers by hashing pedigrees  — deterministic names for strands in a Cilk Plus program. Pedigrees are a new feature implemented in Intel Cilk Plus and currently available in Intel® Composer XE 2013. Today, I will put on my Intel® Cilk™ Plus runtime-developer hat, and explain what pedigrees are, how they work, and how you can use them in Cilk Plus.



Strands and Execution Dags

To briefly review the execution model of Cilk Plus, recall that a Cilk Plus program can be conceptually decomposed into strands — serial chains of instructions without any parallel control. Intuitively, the three Cilk Plus keywords define boundaries for strands. For example, whenever execution encounters a cilk_spawn or cilk_sync statement, the currently executing strand ends and a new one begins. Every execution of a Cilk Plus program can be conceptually thought of as generating an execution dag — a directed acyclic graph of strands. For more details about the Cilk Plus execution model, see the following online documentation.

Before diving into the details, I will make two remarks. First, to be consistent with the online documentation, I should technically be using the term “maximal strand” (a strand that is not part of a longer strand) instead of just “strand,” since one can always keep breaking a strand into multiple strands until we get down to strands containing a single instruction. Strands that are not maximal are not really interesting for this discussion though, so I will use the term “strand” to mean “maximal strand” unless I specifically indicate otherwise. Second, for now I am going to skip over the discussion of how exactly to model the strands generated by a cilk_for loop in the execution dag. In the Cilk Plus runtime, a cilk_for loop decomposes into a recursive divide-and-conquer algorithm using cilk_spawn and cilk_sync, so in theory cilk_for does not add anything new. But as I'll describe later, it turns out we can do more interesting optimizations of pedigrees for cilk_for.

With those caveats out of the way, let’s look at a concrete example of an execution dag and strands. The computation of fib(4) generates an execution dag with strands as shown in Figure 1 below. The strands in Figure 1 are labeled with letters. For example, A is the initial strand in fib(4), B is the initial strand in fib(3) which was spawned from fib(4), and L is the strand for the continuation of that cilk_spawn.

 int fib(int n) { if (n < 2) { return n; } else { int x, y; x = cilk_spawn fib(n-1); y = fib(n-2); cilk_sync; return x+y; } } 
Figure 1: Execution dag for fib(4).

I’ve intentionally skipped a few letters in this figure. Recall that I ended my last post with the puzzle of how you might define pedigrees for this rand_fib(4). The rand_fib(4) call generates a dag which looks similar to fib(4), but has a few extra strands because of DPRNG calls, as shown in Figure 2. In particular, I’ve split strands so that each call to get() falls in a different strand. For example, strand L in the dag for fib(4) becomes K and L in the dag for rand_fib(4). More precisely, K names the strand with the get() call in Line 10 for rand_fib(4), and L names the strand with the get() call in Line 08 of rand_fib(n-2).

 int rand_fib(int n) { if (n < 2) { sum += global_rng.get() * n; return n; } else { int x, y; sum += global_rng.get() * n; x = cilk_spawn rand_fib(n-1); sum += global_rng.get(); y = rand_fib(n-2); cilk_sync; sum += global_rng.get() * n; return x+y; } } 
Figure 2: Execution dag for rand_fib(4).

Many Cilk Plus programs are dag-invariant, that is, when run on a given fixed input, they always generate the same execution dag every time, no matter how many workers are used to execute the program. Being completely precise about when a program is dag-invariant seems quite tricky, but it is usually fairly intuitive to spot a dag-invariant program when you see one. Usually, as long as your Cilk Plus program does not have a structure of cilk_spawn, cilk_sync, and cilk_for statements that depends on how the program was actually scheduled, the execution dag should remain the same between runs. This condition usually holds for many Cilk Plus programs. To violate the condition and have a nondeterministic execution dag, you often have to do something ugly like query the number of workers and write conditional code based on that value, or use some other mechanisms like locks or exceptions in a way that changes the parallel control flow.



Pedigrees in Cilk Plus

The pedigree of a strand can be thought of as a unique "name" for that strand during a program execution. The idea is that pedigrees should be deterministic for dag-invariant Cilk Plus programs. Conceptually, imagine taking a dag-invariant Cilk Plus program, and changing it so that every strand also prints out its name (pedigree) as output. Sort the list of named strands in some canonical order, for example, the order that strands are encountered in a serial execution of the program. First, the sorted list should have no duplicates, that is, every pedigree should be unique. Second, this sorted list of names should be deterministic across multiple runs of the program, i.e., the list is always the same, no matter how many workers you use to execute the program.

The tricky part is, of course, figuring out how to name the strands. We can’t actually just simply name the strands of fib(4) as A through Q, because when two subdags execute in parallel, we don’t know beforehand how many strands will be needed in each subdag. For example, when we are executing what is labeled as strand L in Figure 3 below, (L is the continuation of the cilk_spawn of fib(3)), we don’t know how many letters are needed inside fib(3). Thus, in Cilk Plus, the pedigree of strand is a variable-length sequence of 64-bit integer values, rather than a single value.

What do pedigrees actually look like in Cilk Plus? Rather than try to explain the algorithm for maintaining pedigrees in gory detail, it is more intuitive to just show you the pedigrees Cilk Plus generates for fib(4):

A: [0, 0]
B: [0, 0, 0]
C: [0, 0, 0, 0]
D: [0, 0, 0, 0, 0]

F: [0, 0, 0, 1]
G: [0, 0, 0, 2]

I: [0, 0, 1]
J: [0, 0, 2]

L: [0, 1]
M: [0, 1, 0]

O: [0, 2]
P: [0, 3]
Q: [0, 4]
Figure 3: Pedigrees for fib(4).

Why do we get these particular numerical values for pedigrees? I claim that the exact values don’t actually matter. In fact, the exact numbers should be considered to be something that can vary between different pedigree implementations. What is significant, however, is that pedigrees in Cilk Plus satisfy three important properties:

  1. Uniqueness: Each strand in the dag has a unique pedigree. In the example above, each strand A through Q has a different pedigree.
  2. Lexicographic ordering: In a serial execution of the program, the pedigrees of strands is increasing in “dictionary” order. In the above example, a serial execution encounters strands in alphabetical order, from A through Q. Note that the pedigrees are also in dictionary order, with the first term being most significant, and the last term being the least significant.
  3. Determinism: For a dag-invariant Cilk Plus program, the pedigrees of strands is the same across multiple runs of the program. Thus, no matter how many workers we use to execute fib(4), we should always get the same pedigrees shown above.

The DotMix DPRNG exploits the properties of uniqueness and determinism. Uniqueness ensures that two different strands won’t generate the same random number unless the hashes of their pedigrees happen to collide. Determinism guarantees that the random numbers generated are repeatable across multiple runs of the program. Lexicographic ordering turns out not to be useful for DotMix, since hashing the pedigree is supposed to destroy any ordering information anyway.

A natural question to ask is, how long can the pedigree be? Theoretically, that detail might vary by implementation. But currently in Cilk Plus, the length of the pedigree increases with the nesting depth of spawned functions. For example, notice that L has pedigree of [0, 1], and the after the cilk_spawn of fib(1), we add a least-significant term of 0 to the pedigree, to get a strand M with pedigree [0, 1, 0].

As another example, Figure 4 shows our answer to the puzzle I posted last week, about pedigrees for rand_fib(4) using DotMix. The main difference between rand_fib(4) and fib(4) is that rand_fib(4) has extra strands at E, H, K, and N  — the places where DotMix generated more than one random number in the corresponding strand in fib(4).

A: [0, 0]
B: [0, 1, 0]
C: [0, 1, 1, 0]
D: [0, 1, 1, 1, 0]
E: [0, 1, 1, 2]
F: [0, 1, 1, 3]
G: [0, 1, 1, 5]
H: [0, 1, 2]
I: [0, 1, 3]
J: [0, 1, 5]
K: [0, 2]
L: [0, 3]
M: [0, 4, 0]
N: [0, 5]
O: [0, 6]
P: [0, 8]
Q: [0, 10]
Figure 4: Pedigrees for rand_fib(4).


Cilk Plus Runtime Interface for Pedigrees

The Cilk Plus runtime provides an interface for querying the pedigree of any currently executing strand. To use this API, include cilk/cilk_api.h in your program.

An easy way to understand the interface is to look at an example. The following method

  1. Reads in the current pedigree to determine its length,
  2. Stores the terms of the current pedigree into an array, and
  3. Prints out the pedigree in order, from most-significant to least-significant term.

This method also advances the current pedigree if advance_pedigreeis nonzero.

 inline void get_and_print_current_pedigree(const char* desc, int advance_pedigree) { // Get the current pedigree. This function returns a copy of the // least-significant node in list representing the current // pedigree. __cilkrts_pedigree ls_node = __cilkrts_get_pedigree(); // Traverse the linked list for the current pedigree to figure out // its length. The length should never be 0. int len = 0; const __cilkrts_pedigree *ped = &ls_node; while (NULL != ped) { ped = ped->parent; len++; } assert(len > 0); // Now allocate an array to store the pedigree values. uint64_t* pedigree = new uint64_t[len]; // Reset, and walk the list again. // Walk the pedigree list again and store the pedigree into an // array in order of most-significant node to least-significant // node. Note that our original pointer is still valid because // the current "Cilk block" has not ended yet. ped = &ls_node; int i = 0; while (i < len) { assert(ped != NULL); // Store the rank into the pedigree. pedigree[len - 1 - i] = ped->rank; ped = ped->parent; i++; } // Now print out the pedigree we just read in. printf("%15s: [ ", desc); for (int i = 0; i < len; ++i) { printf("%llu ", pedigree[i]); } printf("]\n"); if (advance_pedigree) { // Increment the current rank. __cilkrts_bump_worker_rank(); } delete[] pedigree; } 

Intuitively, in Cilk Plus, each worker thread maintains the pedigree of its currently executing strand as a singly linked list of pedigree nodes, each of type __cilkrts_pedigree. This list is maintained in “reverse” order, with the least-significant pedigree node at the beginning of the list and the most-significant node at the end of the list. The method __cilkrts_get_pedigree() returns a copy of the least-significant node in this list. Each pedigree node has a rank field  — a term in the pedigree, and a const parent field  — the pointer to node's parent in the chain. One can read in the current pedigree on a worker by following the chain of parent pointers up to the root.

The __cilkrts_bump_worker_rank() function advances the current pedigree. In other words, the value of the current pedigree immediately after this call is guaranteed to be different from the value of the current pedigree value immediately before this call. The uniqueness property of pedigrees in Cilk Plus says that we always have a one-to-one mapping between pedigrees and strands. To preserve this property, we conceptually consider any call to __cilkrts_bump_worker_rank() as ending the currently executing strand, and starting a new strand. (Alternatively, you might choose to view this function as associating multiple pedigrees to the same strand, although then you would need to extend the uniqueness property of pedigrees appropriately.)

In DotMix, every call to get() invokes __cilkrts_bump_worker_rank(). This call leads to the differences in pedigrees we get between fib(4) in Figure 3 and rand_fib(4) in Figure 4. The get() call effectively "breaks" the currently executing strand into multiple strands, so that we can say that each strand makes at most one call to get a random number. Then, the hash of the pedigree of the strand becomes its random number.

The __cilkrts_bump_worker_rank() function has a slightly strange name in my opinion, as it describes the specific behavior of the current implementation of pedigrees in Cilk Plus, rather than the purpose of the function. A call to this function “bumps” (by incrementing in the implementation) the rank of the least-significant node in the current pedigree. This least-significant node happens to be stored in the current worker, hence the words "worker rank" in the function name. Perhaps we might consider changing the name of this function to something like __cilkrts_advance_pedigree()? Who knows... :)


NOTE:  I compiled the program above using Intel® Composer XE 2013. You will need a compiler that implements Cilk Plus ABI 1 to use pedigrees. Intel® Composer XE 2013 supports ABI 1, but as of today (2013-04-16) the Cilk Plus branch of GCC currently supports only ABI 0.

Other Details

Here are a few other miscellaneous details about pedigrees that may be useful to keep in mind.

  1. In general, one should not expect or rely upon pedigrees having specific numerical values. In fact, the same program compiled may generate slightly different pedigree values when compiled at different optimization levels. This variation can arise in part because of how different compilers may choose to optimize implicit cilk_sync statements.
  2. The parent field of the __cilkrts_pedigree object has type const __cilkrts_pedigree*, to keep users from accidentally modifying the __cilkrts_pedigree object. Of course, since we are working in C or C++, there is not really a good way to actually stop you from modifying a memory location. But you have officially been warned...
  3. The rank field of a __cilkrts_pedigree stores a term in the current pedigree. Each rank is an unsigned 64-bit integer. Theoretically, one could overflow this counter, and then two strands in the program could have the same pedigree. But in our current implementation, ranks change only by incrementing by 1. Thus, if your Cilk Plus program actually manages to overflow a rank field, chances are your code may have other problems... :)
  4. In the current pedigree implementation, the depth of the current pedigree is roughly the current depth of nested cilk_spawn statements, plus the few extra terms at the beginning. In most programs, the depth of nested cilk_spawn’s ends up being relatively small. It is possible, of course, to construct programs where the spawn depth is large, and reading the pedigree may be expensive for these programs.

Finally, I will briefly mention a potentially thorny topic, prompted by the following question: how long do the pointers to __cilkrts_pedigree nodes in the pedigree list remain valid? The documentation in cilk_api.h states that the list remains valid until the end of the current function. Thus, in our get_and_print_current_pedigree, function, we only need to call __cilkrts_get_pedigree() once. Outside this function, however, the list may become invalid.

Note that the list being valid is NOT equivalent to the pedigree remaining the same. For example, after the call to __cilkrts_bump_worker_rank(), the current pedigree has changed, but the list representing the old pedigree is still valid. If we had made another call to __cilkrts_get_pedigree() after the call to __cilkrts_bump_worker_rank(), we would get a node for a second pedigree list, and both lists would be valid for traversal until the current function ends.

It turns out that this condition about the lifetime of __cilkrts_pedigree nodes is actually somewhat conservative for the current implementation of pedigrees. Could we be more precise and less conservative? Probably, but it is not a good idea to rely on that behavior. When implementing DotMix, my coauthors and I actually optimized the implementation of what we called “scoped pedigrees” by storing pointers to __cilkrts_pedigree nodes and using them in an undocumented (i.e., implementation-specific) way to only read in a suffix of the current pedigree, instead of the entire pedigree. But you should consider that part of the code to be experimental at the moment. For now, the safe and portable thing to do is to stick to the documented pedigree interface. :)


Pedigrees for cilk_for Loops

Two additional wrinkles arise when we consider pedigrees and cilk_for loops, namely the issues of “flattened” pedigrees and the loop grainsize.

Flattened pedigrees are an optimization for reducing the length of a pedigree of cilk_for iterations. In Cilk Plus, cilk_for loops are implemented in the runtime using a recursive divide-and-conquer algorithm using only cilk_spawn and cilk_sync. Thus, in principle, we don’t need to do anything special with pedigrees for cilk_for loops. This approach has the disadvantage, however, that a cilk_for loop of n iterations may generate a pedigree with roughly lg n terms, due to the lg n levels of nested spawns in the recursive divide-and-conquer. Instead, one can flatten the pedigree, and use the index of the loop as a term in the pedigree.

The flattening of pedigrees is implemented automatically in Cilk Plus. If one queries the pedigree inside a cilk_for loop, that loop conceptually generates only two additional terms in the pedigree: a more significant term that corresponds to the index in the loop, and a less significant term that (roughly) counts strands within the body of particular iteration.

The grainsize of a cilk_for loop turns out to introduce complications that may require some programmer intervention however. Conceptually, we would like to think of the beginning of every iteration of a cilk_for loop as creating a new separate strand. In actual implementation, however, it is sometimes more convenient to actually have new strands begin at every grain of a cilk_for loop. Thus, in the current Cilk Plus implementation, if we query the current pedigree in every iteration of a cilk_for loop, consecutive iterations may see the same pedigree if they are part of the same grain.

This issue also complicates the determinism property of pedigrees, because the default grainsize of a cilk_for loop sometimes varies depending on the number of workers used to execute a program. Therefore, one might see different pedigree values when comparing two runs of the same program that use different number of workers, even though conceptually the original program was dag-invariant. Thus, by default, with cilk_for, we can still get the same pedigrees between two program runs that use the same number of workers, but the pedigrees might differ if we change the number of workers.

Fortunately, we can force each iteration of a cilk_for loop to have a distinct pedigree by calling the __cilkrts_bump_loop_rank() function once in the body of the cilk_for loop at the beginning of the iteration. Using this call, we can preserve the determinism property of pedigrees for programs with cilk_for loops.

 cilk_for(int i = 0; i < n; ++i) { __cilkrts_bump_loop_rank(); // Body of loop as normal. ... } 

The __cilkrts_bump_loop_rank() call conceptually advances the pedigree to the next iteration of the cilk_for loop. Making the call at the end of the iteration is also usually safe, although you might get some counterintuitive pedigree values in the destructors of any objects declared in the loop body. Note that adding this call may inhibit vectorization within the grain of a cilk_for loop and hurt performance, so use this function only when you really care about pedigrees within that cilk_for loop.


Pedigrees in a Nondeterministic Cilk Plus Program

As discussed earlier, pedigrees satisfy the determinism property for programs that are dag-invariant. For programs that are not dag-invariant, one can still read in the pedigree of a currently executing strand, but since the execution dag will be nondeterministic, pedigrees will also be nondeterministic.

For example, the following Cilk Plus program that uses exceptions is not dag-invariant.

 void g(int i) { if (i == 1) throw 5; else printf("Executed g(%d)\n', i); } int main(void) { try { cilk_spawn g(1); cilk_spawn g(2); g(3); cilk_sync; catch (...) { printf("Caught exception\n"); } return 0; } 

In this program above, the continuation of the cilk_spawn in g(2) may or may not execute, depending on whether the continuation of g(1) is stolen and executed before g(1) manages to throw its exception. Thus, the pedigree of the final strand may vary, depending on how the program executed. Other ways you might get nondeterministic behavior include using other constructs such as explicit locking or thread-local storage.



Summary

An execution of a Cilk™ Plus program can be conceptually modeled as generating an execution dag of strands, with each strand being named by a unique pedigree. Many Cilk Plus programs are dag-invariant, in that they generate the same execution dag on a given input, no matter how many workers are used to execute the program. For dag-invariant programs, pedigrees are deterministic. Intel® Cilk™ Plus provides an interface for querying the pedigree of a currently executing strand.

The DotMix DPRNG is one example of a code that uses pedigrees, by hashing pedigrees to deterministically generate random numbers. Are there other use cases for pedigrees? If pedigrees seem like they would be a useful for your program, I'd encourage you to try them out in Intel® Composer XE 2013 and send us your feedback!

For more information about Intel Cilk Plus, see the website http://cilkplus.org . For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.