I've been using Cilk++ for about a year now as a graduate student doing research in parallelism. I was excited to hear that Cilk Plus is supposed to have improved resource management over Cilk++ so I've been porting my old code to see what performance benefit I can get. The results are not quite what I expected, so it'd be great if the experts could give me their opinion.
My benchmark is a tiled Cholesky factorization that heavily relies on BLAS software.
Initially I ran my benchmark on our Opteron 6100 server with four 12-core Opteron 6168 processors. The Cilk++ code is compiled with GCC 4.2.4 (Cilk Arts Build 8503). The Cilk Plus code is compiled with ICC version 12.0.4. Both versions were compiled with just the -O2 option and using the ACML BLAS.
Unfortunately, these performance numbers show Cilk Plus doing about 10-15% worse then Cilk++ code across the board for different work/span sizes and numbers of processors. This was surprising, I know this benchmark is on an AMD processor but I at least expected Cilk Plus to be no worse than Cilk++.
I don't have easy access to a server with Intel processors but when I try the benchmark on my laptop's Core-i5 540M I get different results. In this environment both the ICC and GCC versions are almost exactly equal.
I'm hoping that I'm just missing a compiler option or something. Will compiling with -O3 make that big of a difference over Cilk++ with -O2? Are there any official Intel benchmarks to show what kind of speedup I could hope to get?
Other than that, the biggest difference I can think of between Cilk Plus and Cilk++ is the ICC vs GCC compiler. As far as I know there's not an ICC version of Cilk++ or a GCC version of Cilk Plus to directly test the difference, is there? Does our AMD processor mean that we just aren't going to get as good performance from Cilk Plus?