Parallel performance in Intel Concurrent Collections for Haskell: an in-depth example.

This is Part 4 in my ongoing series of posts on Haskell CnC (Part 1 here, Part 2 here, Part 3 here).

The mandelbrot program, found in examples/mandel.hs within the source package, generates fractals as seen in the images here.  It is a simple benchmark that illustrates some of the issues that can affect performance in Haskell CnC.

Looking at the program, we see the following structure (abridged):








-- Serial function that does the math:
mandel :: Int -> Complex Double -> Int
mandel max_depth c = -- ...omitted...

-- Function to build the graph
mandelGrph max_row max_col max_depth =
do position :: TagCol (Int,Int) <- newTagCol
dat :: ItemCol (Int,Int) CDbl <- newItemCol
pixel :: ItemCol (Int,Int) Int <- newItemCol

-- A step in the CnC graph.
-- This step is a wrapper around "mandel".
-- It gets its input, computes and puts its output.
let mandelStep tag =
do cplx <- get dat tag
put pixel tag (mandel max_depth cplx)

-- Wire the step into the graph:
prescribe position mandelStep

-- Loop to put out tags that invoke the steps:
initialize $
forM_ [0..max_row] $ \i ->
forM_ [0..max_col] $ \j ->
let z = (4.0 / max_row * fromIntegral j + -2) :+
(4.0 / max_col * fromIntegral i + -2.0) in
do put dat (i,j) z
putt position (i,j)

-- Loop to read back all results:
finalize $
foldM (\acc i ->
foldM (\acc j ->
do p <- get pixel (i, j)

-- ...do whatever you like with the pixels...
)
acc [0..max_col]
) 0 [0..max_row]





This is typical of simple CnC programs.  The graph has only one node.  Any parallelism is data parallelism between data elements processed by that node.

Alright, if we run mandelGrph with inputs 300 300 4000, what kind of performance do we get?  Well, that run would involve 90,000 step instances spawned, and in GHC 6.12.1 the parallel scaling is a mess.  Below is a graph relating number of threads to speedup over the serial version.  I use a convention where the "0 thread" setting represents the serial program compiled without threading support in the runtime at all.



This was run on a single processor, a Core i7-980X (3.33 ghz) with four cores.  For the datapoint at "8" threads we're using hyper-threading -- yes, that went especially badly.

We won't get into all the details here.  But suffice it to say that GHC 6.12 had problems with how "blackholes" were handled that are fixed now.  If you do the legwork to get the latest GHC compiled from source (I'll walk through this in a future post) things look much sunnier:



Doing better, but still less than 3X speedup on a four core machine.  Scheduler 3 forks a lightweight thread for each step instance.  It's getting swamped by the number of tasks here.  Scheduler 8 has problems that are still a mystery (forget about it for now).

Ok, now let's optimize.  In this post we try two things: first we change the data-type used for the tags (keys) of the item collection.  Second, we change the way we spawn work using cncFor as described in the previous post.

(1) Change the keys: The keys in the mandel program were originally (Int,Int) keys.  If we change them to (Int16, Int16) then they can fit into a single word.  This allows a more efficient data structure to be swapped in automagically underneath the hood. (Thank you, indexed type families.)



Not bad!  No we're pushing close to a 3.5X speedup.

(2) Employ cncFor. Here we simply replace the forM_ loops in the originaly with a single cncFor2D.



Now we're talking!  Schedulers 5&6 chug along steadily and get close to a 3.75X speedup.  This is why Scheduler 6 is currently the default!

By the way, you can make scaling graphs like this yourself using the "run_all_examples.sh" and "scaling.hs" scripts in the source distribution. Just change the for loop in that bash script to substitute your programs name and its arguments.
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.