No parallelism in my cilk code.

No parallelism in my cilk code.

I am very new to Cilk++ and I have read the manual a few times. So far I have not hadproblemsconverting C++ code into Cilk++ or compiling them into executable files. However I have not been able to achieve anyparallelism yet. So I must be doing something wrong.If anyone could kindly provide any advice, I would really appreciate it.

The work below was done exclusively ina 64-bit Unix workstation. The workstation is equipped with two quad-core CPUs (AMD Opteron Processor 2354) and 64Gb memory. Cilk++ was installed in the path /usr/local and therefore to call the compiler, I use/usr/local/cilk/bin/cilk++; to use the header file, I call /usr/local/cilk/include/cilk++/cilk.h.

hcui@cos:~$ pwd
/home/hcui
hcui@cos:~$ cat helloworld.race.cilk
#include
#include "/usr/local/cilk/include/cilk++/cilk.h"
using namespace std;

int n = 10; // declare a variable that is visible to three functions

void output() { cout << "n = " << n << endl; }
void strand1() { n = 1; }
void strand2() { n = 2; }

int cilk_main()
{
output(); // output current n value
cilk_spawn strand1(); // modify n value
strand2(); // modify n value again
cilk_sync;
output(); // output modified n value
}
hcui@cos:~$ /usr/local/cilk/bin/cilk++ helloworld.race.cilk -o helloworld.race.cilk.exe -O2
hcui@cos:~$ ./helloworld.race.cilk.exe
n = 10
n = 2

This is practically the same as the data race example from pp87-88 in the Cilk++ manual. I am expecting data race here. That means I want to see in the second line of the resultn = 2 or n = 1 depending on whether strand2() or strand1() wins the race. However theresult above stays the same after many runs and compilations. The result from cilkscreen though indicates data race.

I also wrotea program inCilk++ to createFibnacci numbers and, oddly Cilk++ running time was slower then the C++ code. However, when I checked with cilkview it does show the speedup. I guess it's also because I have not been able to achieve parallelism in that cilk code either.

I am not sure what went wrong and will really appreciate any directions and advice.

Regards,
Hailong

8 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The use of cilk_spawn does not command parallelism. It's a note to the runtime that there is an opportunity for parallelism. When you start your Cilk++ program, "worker threads" are started to run parallel sections of your code. Each of the threads is randomly choosing another worker to try to steal from.

Meanwhile, your main thread has continued executing. There is so little work being done in strand1(), I'm guessing that it's completing before it can be stolen. Try adding something more time consuming. But be careful if you add a loop that the compiler doesn't get clever and optimize it away.

- Barry

Pablo Halpern (Intel)'s picture

As barry said, the spawned function is not doing enough work for you to see parallelism. Even if the parent is stolen, the child will almost certainly finish before the continuation has fully begun. Thus, strand2() will always be the last thing that executes before the sync.

If you add a busy loop to strand1() (e.g., increment a volatile counter 10000 times) before setting n to 1, you will likely end up with the reverse behavir. The child will always complete after the continuation, and you will see a final result of n = 1. With a trivial example like this one, it is almost impossible to see nondeterministic behavior manefest as random output.

If you want to see random output caused by a race, try a large cilk_for loop like the following:

unsigned x;
cilk_for (int i = 1; i <= 1000000; ++i)
x ^= x << 1 ^ i;

I haven't tried this, but I suspect that you will get different results each time you run it on a multicore machine.

As for fibonacci, you would need to post the code for us to be able to analyse it. I'm guessing that you've structured it such that the overhead is overcoming the natural parallelism.

- Pablo

Thank you so muchBarry and Pablo. It worked!! And I am learning.:)

Below is my Cilk++ code to compute theFibnacci numbers. I know that the algorithm isnot a good one for this task and I just wanted tosee the performance increase.

hcui@cos:~$ pwd
/home/hcui
hcui@cos:~$ cat fib.cilk
#include
#include
#include "/usr/local/cilk/include/cilk++/cilk.h"
#define uint64 unsigned long long int // max: 2^64-1
using namespace std;

// no error check was performed for the argument type or value

uint64 fib(int i)
{
// F(0) = 0, F(1) = 1.
if (i < 2) return (i);
else // F = F(r-1) + F(r-2) for r > 2
{
uint64 x, y;
x = cilk_spawn fib(i - 1);
y = fib(i - 2);
cilk_sync;
return (x + y);
}
}

int cilk_main(int argc, char* argv[])
{
// take argument n from input for F(n)
int n = atoi(argv[1]);

// compute and output F(n)
uint64 m = fib(n);
cout << "Fib(" << n << ") = "<< m << " computed in Cilk++\n";

return 0;
}

hcui@cos:~$ /usr/local/cilk/bin/cilk++ fib.cilk -o fib.cilk.exe -O2
hcui@cos:~$ time ./fib.cilk.exe 40
Fib(40) = 102334155 computed in Cilk++

real 0m2.750s
user 0m10.310s
sys 0m0.270s

A comparison of time spent in C++ and Cilk++ is given below.

Fib(30) = 832040 computed in C++
real 0m0.030s
user 0m0.030s
sys 0m0.000s

Fib(30) = 832040 computed in Cilk++
real 0m0.144s
user 0m0.080s
sys 0m0.160s

Fib(40) = 102334155 computed in C++
real 0m1.613s
user 0m1.610s
sys 0m0.000s

Fib(40) = 102334155 computed in Cilk++
real 0m2.798s
user 0m10.390s
sys 0m0.070s

Fib(50) = 12586269025 computed in C++
real 3m14.171s
user 3m14.170s
sys 0m0.000s

Fib(50) = 12586269025 computed in Cilk++
real 3m14.776s
user 21m7.700s
sys 0m0.020s

The results above indicate that the C++ version of this code is much faster, andthere seems to be a linearrelationship between C++ and Cilk++ running time, with C++about 8 times faster than Cilk++. However, cilkview shows the following and suggests that Cilk++ is faster.

hcui@cos:~$ /usr/local/cilk/bin/cilkview ./fib.cilk.exe 40
cilkview: generating scalability data
Fib(40) = 102334155 computed in Cilk++
Whole Program Statistics:

Cilkview Scalability Analyzer V1.1.0, Build 8503
1) Parallelism Profile
Work : 27,490,134,827 instructions
Span : 3,835,526 instructions
Burdened span : 4,131,771 instructions
Parallelism : 7167.24
Burdened parallelism : 6653.35
Number of spawns/syncs: 165,580,140
Average instructions / strand : 55
Strands along span : 79
Average instructions / strand on span : 48,550
Total number of atomic instructions : 12
Frame count : 496740424
2) Speedup Estimate
2 processors: 1.90 - 2.00
4 processors: 3.80 - 4.00
8 processors: 7.60 - 8.00
16 processors: 15.20 - 16.00
32 processors: 30.40 - 32.00

Any advice on the speedup problem will be appreciated.

Thanks!
Hailong

Pablo Halpern (Intel)'s picture

This is strange, indeed. If you look the Real vs. User time for the last run, the results are telling:

Fib(50) = 12586269025 computed in C++
real 3m14.171s
user 3m14.170s

Fib(50) = 12586269025 computed in Cilk++
real 3m14.776s
user 21m7.700s

In the serial C++ case, the real (wall clock) time is almost identical to the user (cpu) time. In the Cilk++ case, the real time is only 1/6.5 as much as the user time. This suggests that the Cilk++ code is getting a real parallelism of about 6.5 vs. the single-threaded case. You can test this by setting the environment variable CILK_NPROC=1 and re-running the Cilk++ code. The real time will be the same or larger than the user time on one worker.

The question, then, is why there is so much more CPU time being expended in the Cilk++ code than in the serial C++ code. Are you sure that you are using identical algorithms? (You might want to post your serial fib code.) Are you using the same compiler optimization switches?

Thanks Pablo for the prompt response and the advice.

Here are myC++ & Cilk++ codes:

hcui@cos:~$ cat fib.cpp
#include
#include
#define uint64 unsigned long long int // max: 2^64-1
using namespace std;

// no error check was performed for the argument type or value

uint64 fib(int i)
{
// F(0) = 0, F(1) = 1.
if (i < 2) return (i);
else // F = F(r-1) + F(r-2) for r > 2
{
uint64 x, y;
x = fib(i - 1);
y = fib(i - 2);
return (x + y);
}
}

int main(int argc, char* argv[])
{
// take argument n from input for F(n)
int n = atoi(argv[1]);

// compute and output F(n)
uint64 m = fib(n);
cout << "Fib(" << n << ") = "<< m << " computed in C++\n";

return 0;
}

hcui@cos:~$ cat fib.cilk
#include
#include
#include "/usr/local/cilk/include/cilk++/cilk.h"
#define uint64 unsigned long long int // max: 2^64-1
using namespace std;

// no error check was performed for the argument type or value

uint64 fib(int i)
{
// F(0) = 0, F(1) = 1.
if (i < 2) return (i);
else // F = F(r-1) + F(r-2) for r > 2
{
uint64 x, y;
x = cilk_spawn fib(i - 1);
y = fib(i - 2);
cilk_sync;
return (x + y);
}
}

int cilk_main(int argc, char* argv[])
{
// take argument n from input for F(n)
int n = atoi(argv[1]);

// compute and output F(n)
uint64 m = fib(n);
cout << "Fib(" << n << ") = "<< m << " computed in Cilk++\n";

return 0;
}

Here is how I compiled both codes:

hcui@cos:~$ g++ fib.cpp -o fib.cpp.exe
hcui@cos:~$ /usr/local/cilk/bin/cilk++ fib.cilk -o fib.cilk.exe -O2

Here is the running time for computing F(50). I waiited unti C++ code to finish before running Cilk++ code.

hcui@cos:~$ time ./fib.cpp.exe 50
Fib(50) = 12586269025 computed in C++

real 3m21.873s
user 3m21.860s
sys 0m0.000s
hcui@cos:~$ time ./fib.cilk.exe 50
Fib(50) = 12586269025 computed in Cilk++

real 4m3.214s
user 21m5.430s
sys 0m0.240s

Here are the results from using different number of cores. Note that I have a two quad-core CPUs instead of one eight-core CPU. Could that be a reason?

hcui@cos:~$ time ./fib.cilk.exe 50 CILK_NPROC=1
Fib(50) = 12586269025 computed in Cilk++

real 4m15.389s
user 21m7.450s
sys 0m0.180s
hcui@cos:~$ time ./fib.cilk.exe 50 CILK_NPROC=2
Fib(50) = 12586269025 computed in Cilk++

real 3m32.036s
user 21m6.630s
sys 0m0.360s
hcui@cos:~$ time ./fib.cilk.exe 50 CILK_NPROC=4
Fib(50) = 12586269025 computed in Cilk++

real 3m48.440s
user 21m47.550s
sys 0m0.300s
hcui@cos:~$ time ./fib.cilk.exe 50 CILK_NPROC=8
Fib(50) = 12586269025 computed in Cilk++

real 3m48.397s
user 21m10.390s
sys 0m0.280s

Thanks,
Hailong

Pablo Halpern (Intel)'s picture

Your command line for setting CILK_NPROC is incorrect. It should be:
# CILK_NPROC=1 time ./fib.cilk.exe

Putting CILK_NPROC=1 at the end of the line makes it a command-line argument, not an environment variable.

For this particular problem, there should be virtually no difference between a single-socket 8-core and a dual-socket 2x4-core machne.

- Pablo

Thanks Pablo, and I apologize for the late reply. I ran further experiments, and here is the result.

n

F(n)

Computing time for F(n)

C++

Cilk++

20

6765

0m0.005s

0m0.015s

30

832040

0m0.030s

0m0.144s

40

102334155

0m1.613s

0m2.798s

50

12586269025

3m17.062s

3m0.767s

52

32951280099

8m34.267s

7m52.261s

54

86267571272

22m13.255s

20m41.340s

56

225851433717

58m8.163s

47m17.228s

58

591286729879

157m22.413s

142m14.375s

Here parallelization begins to pay around n = 50. Using multiple cores does increase the speed also. I was confused with the Unix utility time.

Thanks a lot for the help.

Regards,

Hailong

Login to leave a comment.