As per the agreement with Intel, I am reporting my experiences with the Intel Manycore Testing Lab (Linux). This was my first time in the lab, and I wanted to test GHC's  SMP parallelism  features.
The first challenge was to actually get GHC to work on the lab. There was a working version of ghc under /opt/ghc6.13/bin/ghc, but I really needed GHC 7. So first I built GHC 7.0.2-rc2, which worked without much trouble.
Next step was to get all the necessary libraries in place. Since the lab has no direct internet access, cabal-install  wouldn't be of much use. Instead, I downloaded a snapshot of hackage  with the latest version of every package and manually installed the packages I needed. A bit boring, but doable.
Finally I was ready to compile my programs and test. First thing I tried was an existing algorithm I had which, at some point, takes a list of about 500 trees and, for each tree, computes a measure which is expressed as a floating point number. This is basically a map over a list transforming each tree into a float. Each operation is independent of the others, and all require the same input, so it seems ideal for parallelisation. A quick benchmark revealed the following running times:
(Note the non-linear number of cores at the end of the x-axis.) Apparently there are performance gains with up to 6 cores; adding more cores after this makes the total running time worse.
While this might sound bad, do note that all that was necessary to parallelise this algorithm was a one line change: basically, at the point where the list of floats @l@ is generated, it is replaced with @l `using` parList rdeepseq@. This change, together with recompilation using -threaded, is all that is necessary to parallelise this program.
Later I performed a more accurate benchmark, this time using the equality function (take two elements and compare them for equality). The first step was to parallelise the equality function, which, again, is a very simple task:
-- Tree datatype
data Tree a = Leaf | Bin a (Tree a) (Tree a)
-- Parallel equality
eqTreePar :: Tree Int -> Tree Int -> Bool
eqTreePar Leaf Leaf = True
eqTreePar (Bin x1 l1 r1) (Bin x2 l2 r2) = x1 == x2 && par l (pseq r (l && r))
where l = eqTreePar l1 l2
r = eqTreePar r1 r2
eqTreePar _ _ = False
`par` and `pseq` are the two primitives for parallelisation in GHC . The performance graph follows:
(This time I ran the benchmark several times; the error bars on the graph are the standard deviations.) Again we get performance improvements with up to 6 cores, and after that performance decreases. What I find really nice is the improvement with two cores, which is almost a 50% decrease in running time. The ratios for 2 to 4 cores wrt. the running time with 1 core are 0.52, 0.39, and 0.35, respectively. This is really good for such a simple change in the source code, and most people only have up to 4 cores anyway. In any case, the results of this (very preliminary) experiment seem to indicate that GHC's SMP parallelism is not particularly optimized for a high number of cores (yet).
I'm planning to explore this line of research further, and I'm hoping to be able to conduct more experiments in the near future. Feel free to contact me if you want more information on what I've done.