Intel® Cilk™ Plus

cilk_spawn inside cilk_for


For some reason, whenever I have a spawn and sync inside a cilk_for it seems as though the spawn does not get recognized.  I end up getting a compile time error of Expected _Cilk_spawn before _Cilk_sync.  As an example consider the following (overly simple) program:

void foo(){
    cout << "foo";

void bar(){
    cout << "bar";

void baz(){
    cout << "baz";

int main(){
    cilk_for(int i-0; i<10; i++){
        cilk_spawn foo();


Cilk™ Plus Trademark License for product distribution

Dear all,

I need a help to make clear with my customer about Cilk Plus license.

I have bought a license of Intel Parallel Studio XE 2013 (contains Cilk Plus) to develop a product for my customer.

Now, my customer want to distribute the product to market.

Mustn't my customer need to buy a license because I bought it ? can you give for me some evidences to me negotiate with my customer ?


Tam Nguyen

利用Cilk™ Plus的Reducer解决并行程序中的竞态及按序计算问题

    使用Cilk™ Plus来对程序进行并行化比使用传统的Pthread方式来建立管理线程库容易得多,在一般情况下,利用关键字cilk_sync以及cilk_for可以使串行的程序更容易改写为并行的代码,尽管在一些复杂的并行情况下,使用cilk_sync以及cilk_for并不能解决程序中本身存在的数据竞态及多线程并行的协调管理问题。

    值得注意的是Cilk™ Plus并不是只有关键字的方式,Cilk™ Plus库也包含一些用于解决并行程序中的竞态、锁、多线程协调等问题的功能features。本文将提到的是Cilk Reducer它能有助于解决常见的累计型算法中存在的数据竞态及多线程间按序计算等问题。

1.    累计型算法常见于对一个变量进行多次叠加地更新值,比如以下代码:

  • Developers
  • C/C++
  • Intel® C++ Composer XE
  • Intel® Cilk™ Plus
  • License Agreement: 

    Cilk™ Plus并行程序的串行等价程序的执行过程

        C++社区的趋势近年来主要是通过以添加更多的库而不是语言关键字来实现增加程序的功能性,比如Threading Building Blocks以及Parallel Patterns库,但与主流发展趋势不同的是,Intel的Cilk™ Plus的实现方式则是以后者的形式——语言关键字来增加程序功能的,本文将就此给出分析。


        每一个使用关键字来定义并行Cilk™ Plus的程序都有一个已在编译器实现中定义好的串行语义。 通过将每一个cilk_sync及cilk_spawn替换为空,且将每一个cilk_for以for关键字来替代,编译器由此将并行的Cilk™ Plus程序处理为一个有效的串行C/C++程序。 当两个逻辑并行的线程同时访问同一内存位置且至少一个为写内存操作时,程序行为此时出现竞态,如果一个Cilk™ Plus并行程序没有竞态发生的话,此时它将产生与其串行等价程序相同的结果。编译器是如何保证其串行等价的结果一致的?考虑以下的代码:

  • Developers
  • C/C++
  • Intel® C++ Composer XE
  • Intel® Cilk™ Plus
  • License Agreement: 


    I've been trying to understand what the implicit_index intrinsic may be intended for.  It's tricky to get adequate performance from it, and apparently not possible in some of the more obvious contexts (unless the goal is only to get a positive vectorization report).

    It seems to be competitive for the usage of setting up an identity matrix.

    In the context of dividing its result by 2, different treatments are required on MIC and host:

    Optimizing Big Data processing with Haswell 256-bit Integer SIMD instructions

    Big Data requires processing huge amounts of data. Intel Advanced Vector Extensions 2 (aka AVX2) promoted most Intel AVX 128-bits integer SIMD instruction sets to 256-bits. Intel AVX brought 256-bits floating-point SIMD instructions, but it didn't include 256-bits integer SIMD instructions. Intel AVX2 allows you to operate with the AVX 256-bits wide YMM register for integer data types. In this post, I’ll explain how developers can speedup big data processing with the new 256-bits integer SIMD instructions.

    Less performance on 16 core than on 4 ?!

    Hi there,

    I evaluated my cilk application using "taskset -c 0-(x-1) MYPROGRAM) to analyze scaling behavior.


    I was very suprised to see, that the performances increases up to a number of cores but decreases afterwards.

    for 2 Cores, I gain a speedup of 1,85. for 4, I gain 3.15. for 8 4.34 - but with 12 cores the performance drops down
    to a speedup close to the speedup gained by 2 cores (1.99).
    16 cores performe slightly better (2.11)

    Subscribe to Intel® Cilk™ Plus