<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated on Wed, 25 Nov 2009 03:39:09 -0800 -->
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <atom:link href="http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/feed" rel="self" type="application/rss+xml" />
    <title>Intel Software Network - <![CDATA[ comparison cilk++, openmp, pthreads first results ]]> feed</title>
    <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681</link>
    <description></description>
    <language>en-us</language>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450319">kickingf</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>Hello, <br /><br />Here the first results of my comparison of cilk++ and openmp, on my application which is 3D meshgeneration. I have done here the first part which is octree-refinement + inserting triangles. The reference is a pthreaded version of the code.<br />Please consider, that these are first numbers. The difference between cilk++ and openmp is also due to the used compiler-version (but cilk is done only for the 4.2.4). <br /></em><br /></div>
</div>
</div>
<br />What does "nproc=8(4)" mean? Does it mean 2x over-subscription of threads, i.e. 4 physical cores and 8 OS threads?<br />Also it would be interesting to see some brief characteristics of the implementations, for example how does pthread version divide, distribute and schedule work?<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 04:27:20 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin: 0px; height: auto;"></div>
Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable. <br /><br />The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.<br /><br />hope this answers your question.<br /><br />best regards<br /><br />Ferdinand<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 04:44:40 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450319">kickingf</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em> Yes 8(4) mean 8 threads on 4 physical cores. For cilk I used the CILK_NPROC environment variable and for openmp the OMP_NUM_THREADS environment variable. <br /><br />The threaded version is also only a loop parallelization dividing the loop into num_of_threads parts. The tree is stored in a list of leaves, and the data structure for the list is array based. So in this list there are first the (coarse) boxes, then there come the next finer level and so on. This is not (cache) optimal but also not bad, since (due to the algorithm) only the finest level of boxes will be refined (a subset of them), and the (every actual) finest level of boxes will appear in the list as z-curve.<br /><br />hope this answers your question.<br /></em><br /></div>
</div>
</div>
<br />Thanks, Ferdinand.<br /><br />As far as I see, OpenMP is better than Cilk++, and pthread is better than OpenMP on this workload.<br />Also it seems that Cilk++ badly handles thread oversubscription.<br /><br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 04:56:38 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div></div>
Hi kickingf,<br /><br />I'm on the Cilk team and I'd like to look more deeply at these experiments.  It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++.  Are you able to provide the source code and data sets you used?  Also, if possible, which compiler and compiler options you used would be helpful.<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 08:10:03 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;"></div>
Hi Ferdinand.<br /><br />How do you accumulate the output list?  Do you use locks?<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 08:17:44 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450884">themightywilltor</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em> Hi kickingf,<br /><br />I'm on the Cilk team and I'd like to look more deeply at these experiments.  It is surprising to me that pthreads were able to keep up with either OpenMP or Cilk++.  Are you able to provide the source code and data sets you used?  Also, if possible, which compiler and compiler options you used would be helpful.<br /></em></div>
</div>
</div>
<br />Hello, <br /><br />Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.<br /><br />To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at <br /><br />best regards<br /><br />Ferdinand<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 09:03:02 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450888">Matteo Frigo (Intel)</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em> Hi Ferdinand.<br /><br />How do you accumulate the output list?  Do you use locks?<br /></em></div>
</div>
</div>
<br />Hello Matteo, <br /><br />No, I don't use locks. It works, that I allocate all the memory (possibly) needed by the sons. Then I perform the calculations (testing if the triangles of the father-boxes are in the son-boxes) in parallel. After this I check if all the memory was needed and free the rest. I observed, that this (the memory allocation) is in my application neglectable. When I was starting with the pthreaded version some years ago, I also tried to use locks for be able to write on the same structure (with all threads) but it was a mess, or I was simply not able to achieve good speedups.<br /><br />best regards<br /><br />Ferdinand<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 09:10:15 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450319">kickingf</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>
<div style="margin:0px;"></div>
<br />Hello, <br /><br />Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. I also want to state that this are just the first results. I used for the cilk-version the cilk++ 4.2.4 (Cilk-Arts). For the pthread-version I used the g++ 4.2.4 (Cilk-Arts). The iteresting point here is, that if I compile the serial source just swithing to cilk_main, it is significantly slower than the g++ variant. So I assume here, that some other opimization settings within the compiler are used. The OpenMp version is compiled with the g++ 4.3.2. For optimization I used -O2. All exes are 64bit.<br /><br />To be more precise, I was investing a lot in cilk++ in optimizing the e.g. the cilk_grainsize for each loop, but all this does not have a big effect, all this is not able to compensate what is lost at the serial side. But I like very much the coding, since this is easy and readable. A colleage told me, that if I am already testing, I could also test openMP. Here I did nothing, but added the directive for the loop, and got this results, but alse here the serial version is better optimized. More details on the application you can find on my homepage: www.meshing.at <br /><br />best regards<br /><br />Ferdinand<br /><br /></em></div>
</div>
</div>
<br />One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.  Consequently, inlining becomes more important if you have many small functions.  Can you try -O3, which enables automatic inlining?  Can you also try -finline-limit=1000 or so?<br /><br />Thanks for your feedback.<br />Cheers,<br />Matteo Frigo<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 10:05:25 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450319">kickingf</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>
<div style="margin:0px;"></div>
<br />Access to the whole source is not possible, but I could extract the octree-part only, where the tests are performed on. <br /></em></div>
</div>
</div>
<br />Ferdinand,<br /><br />any code that you are willing to share with us would be really appreciated.  We need the help of people like you to improve Cilk++.<br /><br />Cheers,<br />Matteo<br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 10:12:35 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
    <item>
      <title>Re: comparison cilk++, openmp, pthreads first results</title>
      <description><![CDATA[ <div style="margin:0px;">
<div id="quote_reply" style="width: 100%; margin-top: 5px;">
<div style="margin-left:2px;margin-right:2px;">Quoting - <a href="/en-us/profile/450888">Matteo Frigo (Intel)</a></div>
<div style="background-color:#E5E5E5; padding:5px;border: 1px; border-style: inset;margin-left:2px;margin-right:2px;"><em>
<div style="margin:0px;"></div>
One issue with Cilk Arts Cilk++ is that it adds some overhead to each function call.<br /></em></div>
</div>
</div>
<br />Btw, Matteo, I'm curious why you add overhead to *each* function call? Why you do not penalize only spawns? <br />I would expect something along the lines of: cilk compiler generates normal version of a function, and a wrapper that copes with all these __clik_box&lt;&gt;, etc and then calls normal function. cilk_spawn uses wrapped version, while plain calls call normal version of a function.<br />Are there any principal reasons to not do that? Or it's just "not done yet"? Or I am missing something?<br /><br /> ]]></description>
      <link>http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</link>
      <pubDate>Thu, 05 Nov 2009 22:26:56 -0800</pubDate>
      <guid isPermaLink="true">http://software.intel.com/en-us/forums/whatif-alpha-software/topic/69681/</guid>
      <category>Parallel Programming</category>
      <category>ISN General</category>
    </item>
  </channel></rss>