How to cause cilk_for to use all cores ?

How to cause cilk_for to use all cores ?


I ran a (very) simple cilk_for loop on a CoreI5-2400 CPU under windows XP-32bit.

The code is attached. It was compiled and built with the latest intel compiler using MSDEV 2010 

It seems that this loop runs a little bit faster than this loop implemented with intrinsic C.

But my CPU has 4 cores.

I expect the cilk code to run 4 times faster.  

How can I cause all cores to participate in the calculation ?



Downloadtext/x-chdr cilk1.h117 bytes
Downloadtext/x-c++src cilk1.cpp1.03 KB
7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

By default the Intel Cilk Plus runtime will query the OS for the number of cores and use all of them.  You can override this using the Cilk Plus API, or by setting the CILK_NWORKERS environment variable.

On Windows* you can see the cores being used by bringing up Task Manager (right click in the task bar and select "Start Task Manager")  and click the "Performance" tab.  You'll see a graph for each of your CPUs in the block "CPU Usage History".  You should see a spike on each of the cores when your program runs.

The question is whether your application is actually doing enough work to engage all of the cores.  Intel Cilk Plus distributes work using a technique called "work stealing".  Each of the idle cores will randomly pick another core and try to steal work from it.  If it fails to steal work, then it pauses briefly and then tries again. If the work gets done faster than it can be stolen, then the other cores won't have an opportunity to contribute. As Arch pointed out earlier, your application will be memory bandwidth limited.

    - Barry

The code as you've written it should use all of the available cores.  Here are a few things to consider:

1. I your machine relatively "quiet" when you run this test?  If there is significant use of the CPU, then you will not get linear speedup.

2. What kind of numbers are you seeing?  My concern is that the loop is doing so little work (even at 10 million iterations) that you are running into precision problems with the timers.  

3. Another possibility (related to 2) is that the work per iteration is too small and that parallel overhead is overpowering the benefits.  I recommend that you run cilkview on the program and look at the "burdened span" and expected speedup numbers.

4. Also related to the lack of work: you have only one parallel section in the program.  The timing you are getting includes the start-up cost for spinning up the Cilk worker threads.  On Windows, this cost can be substantial.  To exclude the startup from your timing, you can try calling __cilkrts_init() explicitly before starting the timer (you need to #include <cilk/cilk_api.h> to use __cilkrts_init() ).

Let me know if any of that helps,


The problem is likely the same as for -- the memory bandwidth, not the arithmetic units, are the limiting resource.

Hi All,

I checked with task manager. There is a peak in all cores.
When each iteration does 3 operations (and not 1 as in my code), the cilk code runs ~40% faster than the regular code.

I also tried calling: __cilkrts_init() . The first iterations runs much faster (as the following ones).

Your help is highly appreciated.

Best regards,


It might be worthwhile to learn about cache-oblivious and cache-aware approaches to dealing with the flops/memory issue. would be a good start on cache-oblivious algorithms. describes a common cache-aware approach.

The first time you issue a Cilk_for (or other Cilk_... that instantiates the thread pool) you will encounter additional overhead. Try the following:

for(iRep = 0; iRep<5; ++iRep)
  cilk_for (int i=0; i<N_ELEMENTS; i++) 
    R[i] = CalcOneElement (I[i],Q[i]);
  totalTime = (double)(stopTick.QuadPart - startTick.QuadPart)/(double)ticksPerSecond.QuadPart;
  printf ("cilk time=%fn",totalTime);

Jim Dempsey

Leave a Comment

Please sign in to add a comment. Not a member? Join today