comparison cilk++, openmp, pthreads first results

comparison cilk++, openmp, pthreads first results

Hello,

Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).

best regards

Ferdinand

AnhangGröße
Herunterladen threadsvsclik.pdf26.2 KB
17 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Quoting - kickingf
Hello,

Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).

What does "nproc=8(4)" mean? Does it mean 2x over-subscription of threads, i.e. 4 physical cores and 8 OS threads?
Also it would be interesting to see some brief characteristics of the implementations, for example how does pthread version divide, distribute and schedule work?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable.

The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.

hope this answers your question.

best regards

Ferdinand

Quoting - kickingf
Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable.

The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.

hope this answers your question.

Thanks, Ferdinand.

As far as I see, OpenMP is better than Cilk++, and pthread is better than OpenMP on this workload.
Also it seems that Cilk++ badly handles thread oversubscription.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Hi kickingf,

I'm on the Cilk team and I'd like to look more deeply at these experiments. It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++. Are you able to provide the source code and data sets you used? Also, if possible,which compiler and compiler options you used would be helpful.

Hi Ferdinand.

How do you accumulate the output list? Do you use locks?

Quoting - themightywilltor
Hi kickingf,

I'm on the Cilk team and I'd like to look more deeply at these experiments. It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++. Are you able to provide the source code and data sets you used? Also, if possible,which compiler and compiler options you used would be helpful.

Hello,

Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.

To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at

best regards

Ferdinand

Quoting - Matteo Frigo (Intel)
Hi Ferdinand.

How do you accumulate the output list? Do you use locks?

Hello Matteo,

No, I don't use locks. It works, that I allocate all the memory (possibly) needed by the sons. Then I perform the calculations (testing if the triangles of the father-boxes are in the son-boxes) in parallel. After this I check if all the memory was needed and free the rest. I observed, that this (the memory allocation) is in my application neglectable. When I was starting with the pthreaded version some years ago, I also tried to use locks for be able to write on the same structure (with all threads) but it was a mess, or I was simply not able to achieve good speedups.

best regards

Ferdinand

Quoting - kickingf

Hello,

Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.

To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at

best regards

Ferdinand

One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call. Consequently, inlining becomes more important if you have many small functions. Can you try -O3, which enables automatic inlining? Can you also try -finline-limit=1000 or so?

Thanks for your feedback.
Cheers,
Matteo Frigo

Quoting - kickingf

Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on.

Ferdinand,

any code that you are willing to share with us would be really appreciated. We need the help of people like you to improve Cilk++.

Cheers,
Matteo

Quoting - Matteo Frigo (Intel)

One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.

Btw, Matteo, I'm curious why you add overhead to *each* function call? Why you do not penalize only spawns?
I would expect something along the lines of: cilk compiler generates normal version of a function, and a wrapper that copes with all these __clik_box<>, etc and then calls normal function. cilk_spawn uses wrapped version, while plain calls call normal version of a function.
Are there any principal reasons to not do that? Or it's just "not done yet"? Or I am missing something?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

Btw, Matteo, I'm curious why you add overhead to *each* function call? Why you do not penalize only spawns?
I would expect something along the lines of: cilk compiler generates normal version of a function, and a wrapper that copes with all these __clik_box<>, etc and then calls normal function. cilk_spawn uses wrapped version, while plain calls call normal version of a function.
Are there any principal reasons to not do that? Or it's just "not done yet"? Or I am missing something?

Hi Dmitriy.

Some of the reasons are discussed in my blog post here.

The fundamental problem is that if A calls a function B that spawns, you must be able to steal A as well as B. If you don't do that, you might end up with A's stack being busy with no processor working on it. The Cilk Arts Cilk++ solution was to make all functions stealable by not using a stack at all, and instead allocating all frames from a heap. This heap allocation adds overhead to every function call.

Quoting - Matteo Frigo (Intel)

Hi Dmitriy.

Some of the reasons are discussed in my blog post here.

The fundamental problem is that if A calls a function B that spawns, you must be able to steal A as well as B. If you don't do that, you might end up with A's stack being busy with no processor working on it. The Cilk Arts Cilk++ solution was to make all functions stealable by not using a stack at all, and instead allocating all frames from a heap. This heap allocation adds overhead to every function call.

I am not read the blog yet, but if A calls (not spawns) B, then why I must be able to steal just A? Direct function call does not assume any stealing possibility, right? So, as far as I understand, 1 stack piece for both A and B is enough. I.e. direct function calls must be completely transparent for Cilk++. Cilk++ may think as there is no function calls at all, as they are all inlined, sort of.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy Vyukov

I am not read the blog yet, but if A calls (not spawns) B, then why I must be able to steal just A? Direct function call does not assume any stealing possibility, right? So, as far as I understand, 1 stack piece for both A and B is enough. I.e. direct function calls must be completely transparent for Cilk++. Cilk++ may think as there is no function calls at all, as they are all inlined, sort of.

Dmitriy,

In Cilk++, when you write ``cilk_spawn foo()'', we do not steal foo(). Instead, we execute foo() immediately as if it were a call, and we steal the ``continuation of the spawn'', i.e., the parent of foo() resumed after the spawn. The reason why we do it in this way is twofold. First, a Cilk++ program running on one processor has the same meaning as the C++ program that you obtain by removing all spawns (including implicit properties such as cache behavior etc.). Second, this program always runs in bounded memory:

for (;;)
cilk_spawn foo();

and it does not create an infinite number of tasks. (A block compressor such as bzip2 is a program of this form.)

One consequence of this choice is that we steal the whole stack above the spawn, including all function calls.

Quick question first: isn't it possible to allocate and fill stack frame *before* spawn and not inside function?
This will allow (hypothesis) to not allocate frames for plain calls (not spawns), and to simplify boxing/unboxing.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - kickingf
Hello,

Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).

best regards

Ferdinand

As the Cilk++ sdk is still whatif, I think we have to wait longer for a proper benchmark.

Quoting - ninhngt

As the Cilk++ sdk is still whatif, I think we have to wait longer for a proper benchmark.

Independently of the whatif status, feedback from users is very valuable to us because it helps us improve Cilk++. We take Ferdinand's report seriously and we are thankful for it.

Cheers,
Matteo Frigo

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen