A tour of Haskell CnC schedulers

This is Part 2 in a series of posts on Haskell CnC (Part 1 here).

As you can see in the API documentation, Haskell CnC exposes several different versions of the Intel.CnC module, including Intel.Cnc3, Intel.Cnc5, Intel.Cnc6, and Intel.Cnc8.  Each of these contains a different runtime scheduler.  (Perhaps they deserve names, but for now they get numbers.)  All implementations expose the same API with the same documentation, but if you click on the "Source" links you can see how the source differs.

From the perspective of the CnC team, part of the purpose of Haskell CnC is to provide a platform for rapid experimentation on scheduling policies and programming model extensions.  This initial spate of schedulers is part of that (with more to come).  Below we will dive into the details of these schedulers, and the conversation will become rather GHC-specific.

Currently, there three major kinds of scheduler for Haskell CnC.  Here I will begin to describe the strategies used by these schedulers, and will hopefully shed some light on criteria that should be used to select one.   Future posts will dive into much greater detail on the performance of particular benchmarks.

  • Scheduler 3 is based on Haskell IO threads.

  • Schedulers 5 & 6 use a global task pool.

  • Scheduler 8 uses Cilk style nested-parallelism (based on pure parallelism -- that is, pure functions with par annotations).

IO Threads: Haskell has very lightweight user threads.  (For example, for a long time Haskell won the "language shootout" Threadring benchmark.)  In Scheduler 3 we map each CnC step onto its own IO thread (e.g. one forkIO per step).  Therefore scheduler 3 is simple and predictable.  But, alas, it suffers on programs with finer grained steps.  Haskell threads, while lightweight, are still preemptable (and require per-thread stack space).  Thus they are overkill for CnC steps which need not be preempted.

Note that Haskell applications can contain both imperative parallelism (IO threads) and pure parallelism (par annotations).  A given system thread may be executing either type of computation.  Scheduler 3 puts pressure on the IO thread system and MVar synchronization.  Perhaps if the workload of the larger application itself uses IO threads rather pure parallelism scheduler 3 would be a better fit than, say, scheduler 8 which relies on par for parallelism.  This remains an untested hypothesis at the moment.

Global task pool: These implementations use a global stack or queue of steps, with all worker threads feeding from that pool.  The number of worker threads is roughly equal to the number of processors.

Cilk style: CnC steps are pure computations in spite of being represented by IO actions in these schedulers.  In this version we imitate a Cilk spawn on steps using unsafePerformIO and par, allowing us to put steps into the "spark pool" for execution by GHC's parallel runtime.  But because sparks may be dropped, we also tuck sparked steps away in a hidden list (added to the monad through a State monad transformer, of course).  That list is flushed at the end of every step (a sync in Cilk terminology).

Using this infrastructure, the scheduler begins a mostly depth-first traversal of a CnC graph by simply executing the initialize step.  When that initialize step uses putt to produce tags downstream steps are exposed for parallel execution via work stealing.  These downstream steps become the children of the initialize step, and it their parent.  If no steals occur, the graph is traversed in a depth-first order.  A get operation on an unavailable item terminates the current step and registers it in a wait list before "returning" to the parent.  An item put, therefore, spawns as children any steps waiting on that item.

Note that because of its depth-first nature Scheduler 8 will not work for CnC graph with cycles.  It will stack-overflow.

Scheduler 8 was the most complicated and intended to be the most efficient of these schedulers.  But as of this writing it has unsolved problems with parallel performance.

Schedulers 5 and 6, on the other hand, have performed remarkably well so far, in spite of the conventional wisdom is that such a work-sharing approach should be less efficient and scalable than work stealing.  Therefore Scheduler 6 (an improvement on Scheduler 5) is the default and should be used in most situations for the time being.
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.