I wrote the attached code and built it using MSDEV 2008.
My PC is Core2Duo (E8400). O.S: Windows 7 Pro. 32bit
For some reason the for loop works faster (0.049946 sec) than the cilk_for loop (0.067103)
Can I be sure that both cores are executing the for loop ?
The example involves going through data that is 240,000,000 bytes (10,000,000 doubles* 8 bytes/double * 3 arrays). That's much larger that the outer-level cache. The benchmark has a high memory-access to flop ratio (three memory accesses for each floating-point operation). So the benchmark is really measuring how fast the memory system can feed the processors. A single core is likely capable of using the full memory bandwidth for this benchmark. The Cilk code may be slower because the Cilk run-time takes some time to get started the first time Cilk is invoked. (After that, the Cilk threads are parked so that they can be woken up instead of created from scratch.) One way to see if the initial startup is part of the issue is to repeat the two benchmark loops several times and see if the Cilk times improve the second time around.
Dear Mr. Robison,
You are right !
On the second iteration, with 1000 elements (smaller than outer cache), cilk_for was faster.