My program contains cilk_for in main cycle and parallel reducer in inner for cycle.
Cilk is really magic thing for me :). I got 2x speedup vs OpenMP on any number of working threads, including 1 (I set 1 thread as in http://software.intel.com/en-us/forums/showthread.php?t=83541). How this is possible? Graphics in task manager are very strange for me. With 2 working threads: at the beginning first core working hard and over some time second core also starts working. Period before second core starts working is sufficiently large. With OpenMP I got full load of 2 cores from the beginning to the end, but execution time is larger in 2 times. I dont understand 'where' Intel Cilk got this speedup with 1(!) or 2 workers, especially with 1. How you can explain this?