Cilk Plus Performance Help

Cilk Plus Performance Help


I've been using Cilk++ for about a year now as a graduate student doing research in parallelism. I was excited to hear that Cilk Plus is supposed to have improved resource management over Cilk++ so I've been porting my old code to see what performance benefit I can get. The results are not quite what I expected, so it'd be great if the experts could give me their opinion.

My benchmark is a tiled Cholesky factorization that heavily relies on BLAS software.

Initially I ran my benchmark on our Opteron 6100 server with four 12-core Opteron 6168 processors. The Cilk++ code is compiled with GCC 4.2.4 (Cilk Arts Build 8503). The Cilk Plus code is compiled with ICC version 12.0.4. Both versions were compiled with just the -O2 option and using the ACML BLAS.

Unfortunately, these performance numbers show Cilk Plus doing about 10-15% worse then Cilk++ code across the board for different work/span sizes and numbers of processors. This was surprising, I know this benchmark is on an AMD processor but I at least expected Cilk Plus to be no worse than Cilk++.

I don't have easy access to a server with Intel processors but when I try the benchmark on my laptop's Core-i5 540M I get different results. In this environment both the ICC and GCC versions are almost exactly equal.

I'm hoping that I'm just missing a compiler option or something. Will compiling with -O3 make that big of a difference over Cilk++ with -O2? Are there any official Intel benchmarks to show what kind of speedup I could hope to get?

Other than that, the biggest difference I can think of between Cilk Plus and Cilk++ is the ICC vs GCC compiler. As far as I know there's not an ICC version of Cilk++ or a GCC version of Cilk Plus to directly test the difference, is there? Does our AMD processor mean that we just aren't going to get as good performance from Cilk Plus?

7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

For such a broad range of applications as you stated, there is no general answer as to the efficacy of CILK Plus. The answer is best covered by YMMV. It depends on the CPU/number of threads, cache, nature of CILK transformations, and whether there is significant need for synchronization between threads.

I have observed cases where using CILK PLUS has given

(i) speed increase over serial code of over a factor of 3 with 4 cores (Intel i7-2720QM),

(ii) speed increase of a factor 1.7 with a dual core Athlon X2, and

(iii) speed decrease of about 10 to 20 percent.

Thus, in general, you have a good chance of obtaining significant parallelism, but your algorithm must permit it and your CILK directives must be placed intelligently.

As you noted, you might want to try the -O3 optimization level on ICC. It does more agressive optimization and may help. You might also want to try -Qipo to turn on inter-procedural optimizations. I'm not sure if -O3 turns that on.

You might also look into using the array notation (the "Plus" in "Cilk Plus" :o) which can help clue the compiler in on vectorization optimizations it can use.

We are developing a set of benchmarks for internal use to compare the various threading packages and demonstrate their strengths (and weaknesses). I'm not sure what the policy is on releasing them of how ready they are. One of the things that has been shown is that the combination of Cilk's workstealing and the improved use of the vector instructions can result in amazing performance improvements.

- Barry

I'll look into those options and the vector instructions and let you know how it turns out.

Are there any publications that detail how Cilk Plus differs significantly from Cilk++ in terms of the runtime system? I'm working with a fairly complex system (Nabbit) built on top of Cilk++ so we might be interacting in unexpected ways.

I believe what you're looking for is details on the implementation of the runtime. No, that's not well documented.

The major change is that Cilk++ used a cactus stack for its frames, much like MIT Cilk does. Intel Cilk Plus uses multiple "standard" stacks. This means that Cilk Plus loosens some of the memory bounds that were promised by Cilk++. The additional memory use may be some of the cause of your performance degredation.

Using standard stacks allowed us toremove"Cilk linkage" functions. Whichmeans that you can now freely call from C/C++ code into Cilk code. Cilk++ didn't allow you to do that without creating a whole new Cilk context and starting up a new batch of workers. On the down side, this means that every spawning function (any function that contains a cilk_spawn statement) has code in its prolog to check whether it needs to start up the Cilk runtime.

There are probably additional issues here and there, but those are the big ones I can think of off-hand.

- Barry

Hi dsf,

Glad to hear your excitement about Cilk Plus. Let me respond on a couple of points.

-O3 would be a good option to try if your performance hotspots are especially loopy. -O3 contains a lot of loop optimizations/transformations under it. There's a lot of other potential performance options provided by the compiler to look at, but they depend a bit on your performance characteristics. For example, if your code is really branchy in hotspots, PGO would be something worthwhile to look into.

Regarding the performance of Intel Cilk Plus and the Intel C++ Compiler on AMD* platforms, we do test and support the compiler on AMD platforms, and if you are seeing less performance vs. a competing compiler like gcc, we consider that an issue and will work with you to resolve it.

So to really diagnose this issue at this point, the most useful thing would be to have a test case reproducible on our end that shows the issue. You can attach code here or via our Premier Support website at by creating a new issue. It would also be useful to know what precisely what compiler options you are using in your builds. I'll be happy to work with you on this if you can provide some more specific information along these lines. Just let me know.

Barry's comment about memory use sounds most reasonable off the top of my head. It sounds like the Cilk Plus scheme generates more communication during a steal, so maybe the expanded memory use isn't so much a factor as the expanded communication overhead? We've already demonstrated our computations are bandwidth limited under Cilk++ so that sounds like the most elegant explanation.

I realized the simplest way to test ICC vs GCC would be to just compare the serialization of both programs. I suppose I got so caught up in the parallelism aspect the obvious solution just flew over my head. The serializations of both versions perform almost exactly equal on the AMD server and ICC performed better on my Intel laptop, so that strongly indicates it's not a compiler issue.

I'm going to see if I or my coworkers can devise a test for the issues raised above. If we get stuck in a rut I'll look into making the source avaliable for testing.

Thanks for all your quick and expert replies!


Leave a Comment

Please sign in to add a comment. Not a member? Join today