Using cilk_for, but no speed up!

Using cilk_for, but no speed up!

I want to convert about 34 GB data on RAM with unsigned int16 type to double type. Therefore I selected cilk_for to 

run for in parallel. This is my code:

#include "stdafx.h"
#include "stdafx.h"
#include <iostream>
#include <ctime>
#include <stdlib.h>
#include <cilk\cilk.h>

using namespace std;

long long itr_1 = (long long)(18e9);
unsigned __int16 * myArr_1 = new unsigned __int16[itr_1];
//unsigned __int16 * myArr_2 = new unsigned __int16[itr_1];
double * myArr_2 = new double[itr_1];

int _tmain(int argc, _TCHAR* argv[])

cilk_for(long long k = 0; k < itr_1; ++k)
myArr_1[k] = rand() % 1000 + 1;

cilk_for(long long i = 0; i < itr_1; ++i) {
myArr_2[i] = (double)myArr_1[i];

return 0;

When I use cilk_for to initialize my first array execution time decreases from 239 seconds to 36 seconds.

But When I use cilk_for to convert my first array to double and put it into my second array execution time increases from 

70s to 187s. Why doesn't cilk_for speed up the converting loop?

Enviroment: I'm using Intel C++ 2017, Microsoft Visual Studio 2013 and OS is Windows Server 2016. Intel Xeon CPU E5-2699, two nodes and 192 GB RAM for each node. My first array takes about 34 GB of RAM and the second one takes about 136 GB.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

It is possible that the number of cilk workers by default (= number of logical processors) is too large for the given work.

Can you try smaller number by setting the environment variable, CILK_NWORKERS?

It is also worth trying OpenMP loop instead since it may give a better result for regular data parallelism like this.


The function rand() contains a critical section. IOW a serializing section with the overhead of managing the mutex.

If you are simply setting up test routines to determine the effectiveness of parallelization, then select statements without serializing function calls. If you really need fast random number generation, then consult the MKL Random Number Generators.

Jim Dempsey

I changed initializing array statement to this:

cilk_for(long long k = 0; k < itr_1; ++k)
myArr_1[k] = k % 1000 + 1;

But there was no improvement! Even changed to:

cilk_for(long long k = 0; k < itr_1; ++k)
myArr_1[k] = 1;

But no improvement! I think that memory read and write speed is the issue!
I am trying to solve it. I'll inform the results.

I tried lots of ways, but unfortunately no improvement! Finally I ran memory test by MemTest86. The results demonstrate in below picture.

As you can see, L1 Cache speed is very low for such that CPU. As well about memory speed.
When I tried the simple codes presented in previous post, the write speed was about 1.2 GB/s.
I am really puzzled why I have such a low speed! Does anybody have any suggestions?

Leave a Comment

Please sign in to add a comment. Not a member? Join today