Forum Jump

Select Group :
Select Forum :
Sorted By :
Sort Order :
From The :
 
Thread Tools  Search this thread 
kickingf
Total Points:
100
Status Points:
50
Green Belt
November 5, 2009 4:10 AM PST
comparison cilk++, openmp, pthreads first results
Hello,

Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).

best regards

Ferdinand
 Attachments 
Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 5, 2009 4:27 AM PST
Rate
 
#1
Quoting - kickingf
Hello,

Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).


What does "nproc=8(4)" mean? Does it mean 2x over-subscription of threads, i.e. 4 physical cores and 8 OS threads?
Also it would be interesting to see some brief characteristics of the implementations, for example how does pthread version divide, distribute and schedule work?



kickingf
Total Points:
100
Status Points:
50
Green Belt
November 5, 2009 4:41 AM PST
Rate
 
#2 Reply to #1
Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable.

The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.

hope this answers your question.

best regards

Ferdinand


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 5, 2009 4:56 AM PST
Rate
 
#3 Reply to #2
Quoting - kickingf
Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable.

The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.

hope this answers your question.


Thanks, Ferdinand.

As far as I see, OpenMP is better than Cilk++, and pthread is better than OpenMP on this workload.
Also it seems that Cilk++ badly handles thread oversubscription.




themightywilltor
Total Points:
20
Registered User
November 5, 2009 8:10 AM PST
Rate
 
#4 Reply to #3
Hi kickingf,

I'm on the Cilk team and I'd like to look more deeply at these experiments.  It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++.  Are you able to provide the source code and data sets you used?  Also, if possible, which compiler and compiler options you used would be helpful.


Matteo Frigo (Intel)
Total Points:
130
Status Points:
80
Green Belt
November 5, 2009 8:17 AM PST
Rate
 
#5 Reply to #2
Hi Ferdinand.

How do you accumulate the output list?  Do you use locks?


kickingf
Total Points:
100
Status Points:
50
Green Belt
November 5, 2009 9:03 AM PST
Rate
 
#6 Reply to #4
Hi kickingf,

I'm on the Cilk team and I'd like to look more deeply at these experiments.  It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++.  Are you able to provide the source code and data sets you used?  Also, if possible, which compiler and compiler options you used would be helpful.

Hello,

Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.

To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at

best regards

Ferdinand



kickingf
Total Points:
100
Status Points:
50
Green Belt
November 5, 2009 9:10 AM PST
Rate
 
#7 Reply to #5
Hi Ferdinand.

How do you accumulate the output list?  Do you use locks?

Hello Matteo,

No, I don't use locks. It works, that I allocate all the memory (possibly) needed by the sons. Then I perform the calculations (testing if the triangles of the father-boxes are in the son-boxes) in parallel. After this I check if all the memory was needed and free the rest. I observed, that this (the memory allocation) is in my application neglectable. When I was starting with the pthreaded version some years ago, I also tried to use locks for be able to write on the same structure (with all threads) but it was a mess, or I was simply not able to achieve good speedups.

best regards

Ferdinand


Matteo Frigo (Intel)
Total Points:
130
Status Points:
80
Green Belt
November 5, 2009 10:05 AM PST
Rate
 
#8 Reply to #6
Quoting - kickingf

Hello,

Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.

To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at

best regards

Ferdinand


One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.  Consequently, inlining becomes more important if you have many small functions.  Can you try -O3, which enables automatic inlining?  Can you also try -finline-limit=1000 or so?

Thanks for your feedback.
Cheers,
Matteo Frigo


Matteo Frigo (Intel)
Total Points:
130
Status Points:
80
Green Belt
November 5, 2009 10:12 AM PST
Rate
 
#9 Reply to #6
Quoting - kickingf

Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on.

Ferdinand,

any code that you are willing to share with us would be really appreciated.  We need the help of people like you to improve Cilk++.

Cheers,
Matteo


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 5, 2009 10:26 PM PST
Rate
 
#10 Reply to #8
One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.

Btw, Matteo, I'm curious why you add overhead to *each* function call? Why you do not penalize only spawns?
I would expect something along the lines of: cilk compiler generates normal version of a function, and a wrapper that copes with all these __clik_box<>, etc and then calls normal function. cilk_spawn uses wrapped version, while plain calls call normal version of a function.
Are there any principal reasons to not do that? Or it's just "not done yet"? Or I am missing something?



Matteo Frigo (Intel)
Total Points:
130
Status Points:
80
Green Belt
November 6, 2009 6:01 AM PST
Rate
 
#11 Reply to #10
Quoting - Dmitriy Vyukov

Btw, Matteo, I'm curious why you add overhead to *each* function call? Why you do not penalize only spawns?
I would expect something along the lines of: cilk compiler generates normal version of a function, and a wrapper that copes with all these __clik_box<>, etc and then calls normal function. cilk_spawn uses wrapped version, while plain calls call normal version of a function.
Are there any principal reasons to not do that? Or it's just "not done yet"? Or I am missing something?


Hi Dmitriy.

Some of the reasons are discussed in my blog post here.

The fundamental problem is that if A calls a function B that spawns, you must be able to steal A as well as B.  If you don't do that, you might end up with A's stack being busy with no processor working on it.  The Cilk Arts Cilk++ solution was to make all functions stealable by not using a stack at all, and instead allocating all frames from a heap. This heap allocation adds overhead to every function call.


Dmitriy Vyukov
Total Points:
25,382
Status Points:
25,382
Black Belt
November 6, 2009 6:34 AM PST
Rate
 
#12 Reply to #11

Hi Dmitriy.

Some of the reasons are discussed in my blog post here.

The fundamental problem is that if A calls a function B that spawns, you must be able to steal A as well as B.  If you don't do that, you might end up with A's stack being busy with no processor working on it.  The Cilk Arts Cilk++ solution was to make all functions stealable by not using a stack at all, and instead allocating all frames from a heap. This heap allocation adds overhead to every function call.

I am not read the blog yet, but if A calls (not spawns) B, then why I must be able to steal just A? Direct function call does not assume any stealing possibility, right? So, as far as I understand, 1 stack piece for both A and B is enough. I.e. direct function calls must be completely transparent for Cilk++. Cilk++ may think as there is no function calls at all, as they are all inlined, sort of.




Matteo Frigo (Intel)
Total Points:
130
Status Points:
80
Green Belt
November 6, 2009 12:59 PM PST
Rate
 
#13 Reply to #12
Quoting - Dmitriy Vyukov

I am not read the blog yet, but if A calls (not spawns) B, then why I must be able to steal just A? Direct function call does not assume any stealing possibility, right? So, as far as I understand, 1 stack piece for both A and B is enough. I.e. direct function calls must be completely transparent for Cilk++. Cilk++ may think as there is no function calls at all, as they are all inlined, sort of.



Dmitriy,

In Cilk++, when you write ``cilk_spawn foo()'', we do not steal foo().  Instead, we execute foo() immediately as if it were a call, and we steal the ``continuation of the spawn'', i.e., the parent of foo() resumed after the spawn.  The reason why we do it in this way is twofold.  First, a Cilk++ program running on one processor has the same meaning as the C++ program that you obtain by removing all spawns (including implicit properties such as cache behavior etc.).  Second, this program always runs in bounded memory:

for (;;)
cilk_spawn foo();

and it does not create an infinite number of tasks.  (A block compressor such as bzip2 is a program of this form.)

One consequence of this choice is that we steal the whole stack above the spawn, including all function calls.





Intel Software Network Forums Statistics

8435 users have contributed to 31535 threads and 100338 posts to date.
In the past 24 hours, we have 26 new thread(s) 130 new posts(s), and 171 new user(s).

In the past 3 days, the most popular thread for everyone has been Strange results from operator overloading The most posts were made to IVF 11.1.051 freezes during build The post with the most views is Quoting - Steve Lionel (In

Please welcome our newest member dearjason