Loading...
You are not logged-in Login/Register





  • Posts   Search Threads
  • kickingfNovember 5, 2009 3:10 AM PST   
    comparison cilk++, openmp, pthreads first results

    Hello,

    Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
    Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).

    best regards

    Ferdinand
     Attachments 

    Dmitriy VyukovNovember 5, 2009 3:27 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Quoting - kickingf
    Hello,

    Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.
    Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4).


    What does "nproc=8(4)" mean? Does it mean 2x over-subscription of threads, i.e. 4 physical cores and 8 OS threads?
    Also it would be interesting to see some brief characteristics of the implementations, for example how does pthread version divide, distribute and schedule work?



    ---------------------------------------------
    All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
    www.1024cores.net

    kickingfNovember 5, 2009 3:41 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable.

    The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.

    hope this answers your question.

    best regards

    Ferdinand


    Dmitriy VyukovNovember 5, 2009 3:56 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Quoting - kickingf
    Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable.

    The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.

    hope this answers your question.


    Thanks, Ferdinand.

    As far as I see, OpenMP is better than Cilk++, and pthread is better than OpenMP on this workload.
    Also it seems that Cilk++ badly handles thread oversubscription.




    ---------------------------------------------
    All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
    www.1024cores.net

    themightywilltorNovember 5, 2009 7:10 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Hi kickingf,

    I'm on the Cilk team and I'd like to look more deeply at these experiments.  It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++.  Are you able to provide the source code and data sets you used?  Also, if possible, which compiler and compiler options you used would be helpful.


    Matteo Frigo (Intel)November 5, 2009 7:17 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Hi Ferdinand.

    How do you accumulate the output list?  Do you use locks?


    kickingfNovember 5, 2009 8:03 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Hi kickingf,

    I'm on the Cilk team and I'd like to look more deeply at these experiments.  It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++.  Are you able to provide the source code and data sets you used?  Also, if possible, which compiler and compiler options you used would be helpful.

    Hello,

    Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.

    To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at

    best regards

    Ferdinand



    kickingfNovember 5, 2009 8:10 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Hi Ferdinand.

    How do you accumulate the output list?  Do you use locks?

    Hello Matteo,

    No, I don't use locks. It works, that I allocate all the memory (possibly) needed by the sons. Then I perform the calculations (testing if the triangles of the father-boxes are in the son-boxes) in parallel. After this I check if all the memory was needed and free the rest. I observed, that this (the memory allocation) is in my application neglectable. When I was starting with the pthreaded version some years ago, I also tried to use locks for be able to write on the same structure (with all threads) but it was a mess, or I was simply not able to achieve good speedups.

    best regards

    Ferdinand


    Matteo Frigo (Intel)November 5, 2009 9:05 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Quoting - kickingf

    Hello,

    Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.

    To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at

    best regards

    Ferdinand


    One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.  Consequently, inlining becomes more important if you have many small functions.  Can you try -O3, which enables automatic inlining?  Can you also try -finline-limit=1000 or so?

    Thanks for your feedback.
    Cheers,
    Matteo Frigo


    Matteo Frigo (Intel)November 5, 2009 9:12 AM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    Quoting - kickingf

    Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on.

    Ferdinand,

    any code that you are willing to share with us would be really appreciated.  We need the help of people like you to improve Cilk++.

    Cheers,
    Matteo


    Dmitriy VyukovNovember 5, 2009 9:26 PM PST
    Rate
     
    Re: comparison cilk++, openmp, pthreads first results

    One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.

    Btw, Matteo, I'm curious why you add overhead to *each* function call? Why you do not penalize only spawns?
    I would expect something along the lines of: cilk compiler generates normal version of a function, and a wrapper that copes with all these __clik_box<>, etc and then calls normal function. cilk_spawn uses wrapped version, while plain calls call normal version of a function.
    Are there any principal reasons to not do that? Or it's just "not done yet"? Or I am missing something?



    ---------------------------------------------
    All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
    www.1024cores.net

Forum jump:  

Intel Software Network Forums Statistics

17,025 users have contributed to 48,317 threads and 172,754 posts to date.

In the past 24 hours, we have 9 new thread(s) 56 new posts(s), and 52 new user(s).

In the past 3 days, the most popular thread for everyone has been How to manage rounding by IVF ?? The most posts were made to Most likely, the issue is that The post with the most views is Optimalization of sine function\'s taylor expansion

Please welcome our newest member redfruit83


For more complete information about compiler optimizations, see our Optimization Notice.