A DPRNG for Cilk™ Plus?

Last time, I introduced DotMix, a deterministic parallel random-number generator (DPRNG) for Intel® Cilk™ Plus. To briefly summarize the last post, DotMix is a new contribution that you can download from our Contributed Code section. I previously mentioned two advantages that DotMix has over a traditional serial random-number generator (RNG) such as rand():

  1. DotMix avoids some of the scalability bottlenecks of a traditional serial RNG, and
  2. DotMix generates random numbers deterministically for most Cilk Plus programs when using the same seed and program input.

Today, I will discuss these advantages in greater detail, and explain why the problem of deterministic random-number generation for Cilk Plus turns out to be more complicated than one might initially expect.


Origins of DotMix


DotMix is a code developed by Tao B. Schardl and myself, with input from Charles E. Leiserson. DotMix comes out of some recent research done at MIT. For those of you who are not familiar with the history of Cilk, Cilk Plus is a descendant of the Cilk project from MIT in the mid 90's. Since that time, there has been a long tradition of Cilk-related research coming out of MIT, particularly from the SuperTech research group, headed by Charles E. Leiserson. In general, at any point in time, there are a number of ongoing research projects relevant to the Cilk Plus community, happening both at MIT and elsewhere. But I am able to tell you the most details about DotMix, since I had the opportunity to participate in this particular research effort.

Before diving in any further, it is useful for me to give you a bit of background about myself. My day job is Intel® Cilk™ Plus runtime developer, which means I am one of a number of Intel employees whose job it is to make sure that Cilk Plus works as it should, and to help fix it when it doesn’t. Before joining Intel however, I was an MIT student, member of the SuperTech group, and a frequent user of Cilk technologies. Thus, you could say that one of my long-time hobbies is writing Cilk code and experimenting with new ways to make Cilk better. In today’s post, I'll remove my Intel® Cilk™ Plus runtime developer hat, put on my Cilk researcher and individual contributor hat, and tell you the story of DotMix.

To the best of my knowledge, the research into what eventually led to DotMix began sometimes around 2010, while I was a student at MIT. As those of you who have worked on similar projects may be aware, it is often the case with any kind of research that the origins of any particular idea or research topic are often impossible to attribute to any one person or event. Moreover, the journey towards a solution one takes towards a solution is inevitably more convoluted than it “should have been” in hindsight. Nevertheless, for the purposes of telling this story, I will describe a relatively straightforward path to the solution we (my coauthors at MIT and I) ended up with. I will also pretend that our investigations began with a variant of our favorite Cilk program: fib. For the latter, I suspect that I'm not really stretching the truth too far. :)

In particular, we started with (the conceptual equivalent of) the following Cilk Plus program.

#include <cstdio>
#include <cstdlib>
#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>

// Global reducer to add up random numbers.
cilk::reducer_opadd<uint64_t> sum(0);

int rand_fib(int n) {
   if (n < 2) {
      // Generate a random number, add it to sum.
      sum += rand() * n;
      return n;
   }
   else {
      int x, y;
      sum += rand() * n;
      x = cilk_spawn rand_fib(n-1);
      sum += rand() * n;
      y = rand_fib(n-2);
      cilk_sync;
      sum += rand() * n;
      return x+y;
  }
}

int main(void) {
   int n = 30;
   int ans = rand_fib(n);
   std::printf("fib(%d) = %d\n", n, ans);
   std::printf("sum is %llu\n", sum.get_value());
   return 0;
}

In this program, the rand_fib() method computes the nth Fibonacci number, but also uses a (pseudo)random-number generator (RNG) to generate a random number at various points in the computation. These random numbers are multiplied by the current value of n, and added together into the reducer sum. We would like this program to run efficiently in parallel. After all, the whole point of parallelizing a program is to make it run faster!

On the other hand, we might also want the program to generate deterministic output, that is, output the same final sum on each run of the program. Using a typical implementation of rand(), this program has deterministic output for a serial execution, since rand() generates a deterministic stream of random numbers. For a parallel execution of this program, however, we don't always get deterministic behavior. As an aside, the multiplication of the random number by n is significant here. If we simply added all the random numbers together, we might still get deterministic output because integer addition is commutative!

I'm sure many of you have experienced firsthand the frustration that stems from trying to track down a bug in a program that you can't consistently reproduce because the program behaves differently every time you run it. All other things being equal, for a parallel execution it would be nice to have a deterministic parallel random-number generator — an RNG that generates the same random numbers each time you run the code, even in parallel. Using such an RNG eliminates a source of nondeterminism in a parallel program, which I assert makes parallel programming easier.

From this point forward, I'm going to use the acronym defined earlier: DPRNG. I claim this acronym rolls off the tongue a bit more easily than "deterministic parallel random-number generator." :)

Can we achieve both good performance and deterministic execution? As we thought about the problem, we realized that achieving both goals simultaneously was trickier than we first thought.


Stream RNGs and Parallel Loops


To understand some of the issues, it is helpful to review how a typical RNG such as rand() is implemented. RNGs like rand() are typically stream RNGs — RNGs designed to output a stream of numbers, one at a time, based on the generator's internal state. Seeding the RNG initializes the RNG’s internal state. After every call to rand(), the RNG transforms this state, and then uses it to output the next number in the stream.

We can be slightly more mathematical in this description. Let xi be the internal state of the RNG after the ith call to rand(), and suppose each call to rand() uses some function f to transform the internal state. Then, a sequence of n calls to the RNG transforms the internal state into xn as follows:

x0
x1 = f(x0)

x2 = f(x1) = f(f(x0))

x3 = f(x2) = f(f(f(x0)))

x4 = f(x3) = f(f(f(f(x0))))

...

xn = f(xn-1) = f(n)(x0)

Over the years, many people have put significant effort into designing good RNGs, and the functions used as f can get pretty complicated. I won't pretend I understand enough about them to explain all the details. Fortunately, understanding the exact function f is not actually important for our discussion. The important thing that the math tells us is that for a generic f, there is a serial chain of dependencies to compute xn from x0.

With that fact in mind, let's consider what happens when we use a stream RNG in a simple parallel loop. In the following program, each iteration of the parallel loop generates a random number and uses it to update a reducer sum.

#include <cstdio>
#include <cstdlib>
#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>

int main(void) { 
    cilk::reducer_opadd<int> sum(0); 
    int n = 1000000; 
    cilk_for(int i = 0; i < n; ++i) { 
         sum += (i+1) * rand(); 
    } 
    std::printf("Final sum = %d\n", sum); 
    return 0; 
}

How does this program behave? One of two scenarios seems likely, depending on the specific implementation of rand().

  1. Suppose the rand() implementation uses a single stream RNG for all worker threads. In this case, multiple worker threads in Cilk are trying to manipulate the RNG's internal state concurrently to generate a random number. Consider two subcases.
    1. If the implementation of rand() is not properly synchronized, then there is a race condition, since multiple worker threads are trying to changes the RNG's internal state concurrently. According to the man page on my Linux desktop, the function rand() is not thread-safe, so technically the original rand_fib() function has a bug!
    2. If the implementation of rand() were thread-safe, e.g., the internal state is protected by a lock, then the program should run correctly. But in a parallel execution, the internal RNG state that iteration i sees for its call to rand() depends on the order that the iterations manage to acquire the lock, and this order is likely to vary between program runs. Thus, this approach does not guarantee deterministic output.
    Both subcases (a) and (b) have a potential performance bug. For (b), there is clearly contention on the lock. But for (a), even though we don't have a lock, the cache lines storing the RNG's internal state are still likely to be highly contended between all the worker threads, since all worker threads are sharing the same RNG. Thus, the cilk_for loop above is likely to exhibit in either no speedup or even slowdown when executed using more than one worker.
  2. Suppose the rand() implementation maintains a separate stream RNG for each worker thread, for example, in thread-local storage. This implementation avoids the performance bug of the previous scenario, since each thread has its own private RNG state. But it still has the problem that the random numbers that are generated are nondeterministic, since the runtime scheduler may execute a given iteration of the cilk_for loop on a different worker between different runs.

    To give a concrete example, in two consecutive runs of the same Cilk Plus program on 3 workers, the mapping of iterations to worker threads could be:
    Run 1:
    Iterations 0 to 499999 on worker 0
    Iterations 750000 to 999999 on worker 1
    Iterations 500000 to 749999 on worker 2

    Run 2:
    Iterations 0 to 249999 on worker 0
    Iterations 500000 to 749999 on worker 1
    Iterations 750000 to 999999, 250000 to 499999 on worker 2
    Run 1 takes 500000 numbers from worker 0's RNG, while Run 2 takes only 250000 numbers. Thus, the final sum is likely to be different between the two runs.

The nondeterminism in the second case comes from the combination of the dynamic load balancing with thread-local storage. In particular, the Cilk Plus runtime uses a randomized work-stealing scheduler, where one worker can steal work from another when it runs out of work to do. This scheme has the advantage of efficient scheduling even when the work of each loop iteration varies significantly. The tradeoff is, however, that with dynamic load balancing, the identity of the worker that executes a given iteration now varies from run to run in our sample program.

As an aside, this nondeterministic behavior is exactly why we should generally frown upon the use of thread-local storage in Cilk Plus. By default, thread-local storage in Cilk Plus is in some sense, "worker-local" storage, because the state gets associated with a given worker thread. The use of worker-local storage exposes the underlying nondeterminism in the runtime scheduler to the programmer writing a Cilk Plus application, something that the design of Cilk Plus specifically tries to avoid. But I digress, as this topic is best left for another day...

Returning to the story of DotMix, after some investigation of the simple parallel loop, we concluded that these two implementations of rand() left us with nondeterministic Cilk Plus programs, a situation that was less than ideal. Could we make these programs deterministic?


Strand-local RNGs


After thinking about the parallel loop example for a bit, we came up with a third, deterministic implementation for rand(), based on "strand-local" RNGs. In the previous code example, each iteration of the cilk_for loop represents a separate strand, a serial chain of instructions without any parallel control in it. Moreover, each strand has an obvious name that does not change between program runs  — the value of the loop index for the iteration. If we create a new RNG for each loop iteration, with the loop index being the RNG's seed, then each iteration uses an RNG whose identity is independent of the worker the Cilk Plus runtime used to execute that iteration. Thus, we get a deterministic parallel RNG for the parallel loop.

This notion could be generalized to an arbitrary Cilk Plus program by creating a new RNG for every strand. As long as you can find a unique and deterministic name for every strand to use as the initial seed for the RNG, then you get a deterministic RNG. One can think of this approach as using a "strand-local" RNG, since each strand has a separate RNG.

As we were investigating prior work on the topic of parallel random-number generation, we discovered that this notion of a strand-local RNG was conceptually equivalent to solutions that were already implemented and used in practice. For instance, SPRNG, a popular library for parallel random-number generation, provides an API for essentially spawning new RNGs. If we spawned a new RNG for every cilk_spawn statement in rand_fib, then we get a deterministic parallel RNG. Problem solved... right? :)

As you might guess, the answer was "not quite." Using strand-local RNGs leads to a number of issues in Cilk Plus, but the biggest one we encountered was overhead. Creating a new stream RNG can be an expensive operation, possibly much more expensive than the overhead of a cilk_spawn operation itself. For a program like rand_fib, creating a new RNG for each strand of execution is analogous to going on a vacation and buying a new car at my destination instead of renting one at the airport. While it might be nice to have my own car while I am traveling, it is not going to be worth the cost unless I am going to be on vacation for a really long time.

The situation with a typical stream RNG is analogous. A good stream RNGs is designed to generate a long stream of n random numbers, where n is usually measured in at least millions or billions. On the other hand, a typical Cilk Plus program like rand_fib might have millions of strands, but with each strand asking for at most ten or a hundred random numbers. Creating and seeding a new stream RNG for each strand can be quite expensive, and most of its capabilities are going to waste.

In parallel programs where users create and manage pthreads explicitly, creating a new stream RNG for each pthread is usually affordable, since creating a new pthread is relatively expensive. In contrast, strands in Cilk Plus are created at every cilk_spawn or cilk_sync, and the cost of creating a new RNG is often unacceptable. Using a new stream RNG for every strand works, but it does not really use the RNG in the way it was originally meant to be used.


Hashing for DPRNGs


After that long explanation, I've finally arrived at the approach we settled upon for implementing a DPRNG in Cilk Plus. We decided to implement the DotMix DPRNG based on two ideas: hashing and pedigrees.

Hashing enables DotMix to generate random numbers for strands in a Cilk Plus program. Consider a special case, where every strand only needs to generate a single random number. One can conceptually break a strand into multiple strands if more than one random number is required. Instead of creating a new stream RNG for each strand, one can instead generate a random number by applying a "good" hash function to the strand. At the extreme, one could imagine using a cryptographic hash function, like a variant of SHA or md5sum to generate a number that looks sufficiently random. Alternatively, using a simpler hash function might produce numbers that are "less random", but might be more efficient to compute. But what does it mean to hash a “strand”? That is where pedigrees come in.

Pedigrees provide the key to guaranteeing deterministic execution for DotMix. Suppose that the Cilk Plus runtime provided support for figuring out the pedigree of a currently executing strand — a unique name for that strand that remains the same between different runs of the program. Then, one can generate a random number by hashing the pedigree of strand. Conceptually, the idea of pedigrees seems fairly simple and intuitive. The hard part is, of course, figuring out exactly what the pedigrees should be... more on this topic in a bit.

Once we had a way to assign pedigrees to strands, hashing pedigrees gave us a way to implement an efficient DPRNG for Cilk Plus. Our DPRNG is called DotMix because it hashes a pedigree by computing a dot product of the pedigree with entries from a small table of pre-generated random numbers, and then mixes the result to output a random number. In our paper, we show that the hash function satisfies some nice mathematical properties, namely that the probability of getting collisions is reasonably small.

We also ran some tests on DotMix's hash function using the Dieharder test suite, and observed that DotMix seems to generate random numbers of reasonably high quality. If your application has stringent requirements on random-number quality, then DotMix may or may not be appropriate. But DotMix's hash function is intended to be more than good enough for codes where you simply need output that "looks" random.

If you don't want to take our word for it that DPRNG via hashing is a viable approach, I would encourage you to look at this recent paper on parallel random-number generation, which won the Best Paper Award at Supercomputing 2011. This work, which was done independently from our research at about the same time, discusses ways to generate random numbers by hashing counter values. This approach is conceptually equivalently to hashing a pedigree which is the loop index in a parallel loop. Their paper focused more on the question of picking good hash functions, as opposed to describing a way to maintain pedigrees in programs with complex parallelism structure.

There is more that one might say about hashing for a DPRNG, but I will refer the interested reader to the relevant papers for details. Instead, I'll spend the remainder of this post talking about pedigrees, since in my (obviously biased) opinion, pedigrees are the more exciting concept for Cilk Plus. :)



Pedigrees in Cilk Plus?


For programs whose parallelism has a nice, predictable structure, there are often obvious ways to conceptually assign pedigrees to strands. For example, in a simple cilk_for loop where each iteration requires a single random number, one can use the loop index as the pedigree. In a slightly more complicated case, if iteration i requires ni random numbers, the pedigree could be a pair of numbers  — the loop index i, plus the count of how random numbers we've taken from iteration i thus far. This latter case conceptually requires two terms in the pedigree instead of one, because in a parallel execution we need to be able to figure out the pedigree of each iteration i independently, without knowing exactly how many random numbers will be needed in other iterations.

For a Cilk Plus program with a more complicated parallelism structure (e.g., rand_fib), how can we define pedigrees? Recall the sample rand_fib program using DotMix that I showed last time (also repeated in the figure below). Notice that the code above makes no mention of pedigrees anywhere, which means that Cilk Plus and the DotMix library are secretly working to maintain and utilize pedigrees under the covers.

How does it all work? Since this post is long enough already, I will actually postpone that discussion until next time. Instead, I'll conclude with an open-ended puzzle.

The figure below shows a dag (directed acyclic graph) representing the calls to the DotMix DPRNG in an execution of rand_fib(4). Each letter conceptually represents a call to generate a random number. For example, A is the first call to get() in rand_fib(4) before the cilk_spawn of rand_fib(3), and K is the call to get() immediately after the cilk_spawn. Each call to get() needs to read in a unique but deterministic pedigree. How might we assign pedigree values to each letter?

int rand_fib(int n) {
   if (n < 2) {
      sum += global_rng.get() * n;
      return n;
   }
   else {
      int x, y;
      sum += global_rng.get() * n;
      x = cilk_spawn rand_fib(n-1);
      sum += global_rng.get();
      y = rand_fib(n-2);
      cilk_sync;
      sum += global_rng.get() * n;
      return x+y;
  }
}

In general, many answers are possible. Next time, I'll describe the answer we ended up with. Keep in mind, however, that a good scheme for assigning pedigrees should not introduce additional dependencies into the dag below. For example, the pedigree value for K, should not depend on how many calls to get() happen within rand_fib(3). For those of you who want the answers now, you can always check our paper or other documentation. :)

Summary


The DotMix deterministic parallel random-number generator (DPRNG) avoids the performance problems and nondeterminism that you can run into when using a traditional serial stream-based random-number generator in Cilk Plus. DotMix generates deterministic random numbers by hashing pedigrees, names for strands in the execution of a Cilk Plus program. Stay tuned for my next post, where I will talk more about pedigrees and how you can use them in Cilk Plus!

For more information about Intel Cilk Plus, see the website http://cilkplus.org. For questions and discussions about Intel Cilk Plus, see the forum http://software.intel.com/en-us/forums/intel-cilk-plus.

For more complete information about compiler optimizations, see our Optimization Notice.