<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Intel Software Network Blogs &#187; Threading Building Blocks</title>
	<atom:link href="http://software.intel.com/en-us/blogs/category/intel-threading-building-blocks/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<pubDate>Wed, 25 Nov 2009 17:07:34 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.5</generator>
	<language>en</language>
			<item>
		<title>SP1 for Intel Parallel Studio - service pack worth installing!</title>
		<link>http://software.intel.com/en-us/blogs/2009/11/19/sp1-for-intel-parallel-studio-service-pack-worth-installing/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/11/19/sp1-for-intel-parallel-studio-service-pack-worth-installing/#comments</comments>
		<pubDate>Thu, 19 Nov 2009 23:48:43 +0000</pubDate>
		<dc:creator>James Reinders (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[automated testing]]></category>

		<category><![CDATA[Intel Parallel Amplifier]]></category>

		<category><![CDATA[Intel Parallel Inspector]]></category>

		<category><![CDATA[Intel Parallel Studio]]></category>

		<category><![CDATA[multi-core]]></category>

		<category><![CDATA[parallelism]]></category>

		<category><![CDATA[Service Packs and Updates]]></category>

		<category><![CDATA[Visual Studio 2010]]></category>

		<category><![CDATA[Windows 7]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/11/19/sp1-for-intel-parallel-studio-service-pack-worth-installing/</guid>
		<description><![CDATA[Intel® Parallel Studio Service Pack 1 is now available, adding support for Windows* 7.
SP1 is well worth downloading and installing - here are some of the reasons:

Parallel Inspector and Parallel Amplifier can be driven (for automating test suites) from the command line now.
Bug fixes - of course - not many issues needed fixing, but you [...]]]></description>
			<content:encoded><![CDATA[<p><strong><a href="http://intel.com/go/parallel">Intel® Parallel Studio Service Pack 1</a></strong><strong> is now available,</strong> adding support for Windows* 7.</p>
<p>SP1 is well worth downloading and installing - here are some of the reasons:</p>
<ol>
<li>Parallel Inspector and Parallel Amplifier can be driven (for automating test suites) from the command line now.</li>
<li>Bug fixes - of course - not many issues needed fixing, but you may appreciate the ones bugs that were found and fixed!</li>
<li>Window 7 support (Parallel Studio came before Windows 7, now that it is released - we had a few things to update)</li>
<li>TBB 2.2 and other improvements to align with the upcoming Microsoft Visual Studio 2010  I'm sure there are more - these are the highlights as I see them.</li>
</ol>
<p>Download SP1 - you'll be glad you did!</p>
<p>See the <a href="http://software.intel.com/en-us/articles/intel-parallel-studio-release-notes/" target="blank">release notes</a> for more details - skip the main document if you want to read about what is new and useful - read the three individual documents.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/11/19/sp1-for-intel-parallel-studio-service-pack-worth-installing/feed/</wfw:commentRss>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code, a first attempt</title>
		<link>http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 01:44:24 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[lambda functions]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[OpenMP]]></category>

		<category><![CDATA[parallelism]]></category>

		<category><![CDATA[Supercomputing]]></category>

		<category><![CDATA[vectorization]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/</guid>
		<description><![CDATA[On the way to composing a first thread-parallel version of n-body code, Robert points out parallelization has already been occuring, using the Intel compiler and its vectorization of simple loops.]]></description>
			<content:encoded><![CDATA[<p>It’s been a busy month preparing for <a href="http://sc09.supercomputing.org/index.php">SuperComputing ‘09</a> and <a href="http://scyourway.supercomputing.org/exhibits/view/19">booth duty</a> (I’ll be hanging out in the Intel booth on Tuesday and Thursday and giving a talk there on Wednesday), and refining materials for a Parallelism Road Show we’re planning for next February and March (more details later).  (Not to mention chorus rehearsals for this year’s <a href="http://portlandrevels.org/revels.php?page=this-year%60s-show">Christmas Revels</a>—oops, I did mention it. ;-)   But finally, after all <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">this serial optimization</a> I’ve been working through on the n-bodies code, it’s time to go parallel.  Or rather, first take a short side-step to discover that code parallelization has already begun—through <em>vectorization</em>.  I can pull up a compiler report, normally suppressed, by adding the manual switch <code>/Qvec-report:1</code> to the compilation options:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-01.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-01.png" alt="" width="744" height="378" /></a></p>
<p>With this simple change, I notice something new in the compilation logs:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-02.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-02.png" alt="" width="659" height="138" /></a></p>
<p>If I double-click on one of these lines in the Output panel, the system navigates to the corresponding source code lines:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-03.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-03.png" alt="" width="607" height="411" /></a></p>
<p>Note the little marker in the left margin that indicates the lines referred by that vectorization report line.  This is the ballistic step in the serial code implementation, suggesting that the little loop for setting the vector components of the bodies has been converted to a linear sequence of vector instructions.  Some other time I’ll dig down into the assembly code to demonstrate that this code really has been <em>vectorized</em> (i.e., realized by emitting SIMD code to execute it), but for now let’s move forward and try to make this multi-thread parallel in addition to vector-parallel.</p>
<p>How do I “parallelize” that lower set of loops in the previous code sample?  One simple way would be to add an <a href="http://openmp.org/wp/">OpenMP</a> construct:</p>
<pre>    #pragma omp parallel for
    for (i = 0; i &lt; n; ++i) {
        for (int axis = 0; axis &lt; 3; ++axis) {
            body[i].vel[axis] += body[i].acc[axis] * TIMESTEP;
            body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
            body[i].acc[axis] = 0.;
        }
    }</pre>
<p>OpenMP has been around for a number of years and operates as a language extension for C, C++ and Fortran.  Compilers enabled to recognize the constructs (such as the Intel® C++ and Fortran Compilers) can use them as hints to direct the compiler generation of parallel code.  Non-complying compilers see these constructs as an unrecognized pragma (or a funny comment in Fortran) and ignore them.  In this case the OpenMP line applies to the line that follows, directing the compiler to create code that divides the outer loop into some collection of chunks, each of which can be dispatched to a separate HW thread.  Each thread processes the chunks assigned to it.  As each thread finishes its work, it waits for the others in its team to complete their work.  All these HW threads will land in a <em>rendezvous</em> or <em>join</em> point until all have arrived, because there’s an implied wait at the end of the parallel <em>for</em>-loop  so that code that follows will not be executed until the preceding code has been completed, just to avoid any potential side effects.  In this particular case, we’re also at the end of the parallel section so only one HW thread would proceed beyond the end of the <em>for</em>-loop, the rest returning to a thread pool to await more work.</p>
<p>With the advent of lambda constructs, described in the <a href="http://en.wikipedia.org/wiki/C%2B%2B0x">C++0x standard</a> and implemented in the Intel C++ Compiler version 11, we can write nearly as compact a version of this parallel construct using Intel® Threading Building Blocks (before lambdas we’d need to use a full C++ function-object, which really breaks up the flow of the source code).  As a lambda construct using TBB, the OpenMP code above would transform into something like this:</p>
<pre>    parallel_for( blocked_range&lt;int&gt; (0,n),
      [] (const blocked_range&lt;int&gt; &amp;r) {
        for (int i = r.begin(); i != r.end(); ++i)
            for (int axis = 0; axis &lt;3; ++axis) {
                body[i].vel[axis] += body[i].acc[axis] * TIMESTEP;
                body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
                body[i].acc[axis] = 0.;
            }
      });</pre>
<p>Not quite a compact as the OpenMP version so there must be some other reason to value this.  Otherwise, why embrace the complexity?  The key is flexibility.  TBB offers a rich set of tools that can be used within the context of such a parallel function.  The c++0x lambda-function expands that richness with a compactness of expression and flexibility that lets me use TBB with almost the same convenience of OpenMP.  For example, that pair of square brackets leading off the lambda provides flexible control of what variables defined in the <a href="http://en.wikipedia.org/wiki/Scope_(programming)">scope</a> of the call will be available and in what form within the function (more on this later).  The TBB parallel_for will divide the work of this inline body using as a helper class the TBB blocked_range, making work for as many HW threads as there are available.</p>
<p>Next time: parallel code: first runs</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Memory management challenges in parallel applications</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/28/memory-management-challenges-in-parallel-applications/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/28/memory-management-challenges-in-parallel-applications/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 14:18:00 +0000</pubDate>
		<dc:creator>Roman Lygin (Intel)</dc:creator>
		
		<category><![CDATA[Software Engineering]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/28/memory-management-challenges-in-parallel-applications/</guid>
		<description><![CDATA[Let me share some recent practical experience with memory management issues when developing a multi-threaded application. This can probably be a rather common case (as recent post by Roman Dementiev and its follow up discussion demonstrates), and I’d be happy if my experience were helpful for others.
Working on CAD Exchanger I am designing one of [...]]]></description>
			<content:encoded><![CDATA[<p>Let me share some recent practical experience with memory management issues when developing a multi-threaded application. This can probably be a rather common case (as <a href="http://software.intel.com/en-us/blogs/2009/08/21/is-your-memory-management-multi-core-ready/">recent post by Roman Dementiev</a> and its follow up discussion demonstrates), and I’d be happy if my experience were helpful for others.</p>
<p>Working on <a href="http://www.cadexchanger.com">CAD Exchanger</a> I am designing one of its plugin to convert 3D CAD data between ACIS and Open CASCADE (two modeling kernels) to be parallel. Depending on a model size, the converter has to deal with multiple small objects allocated on a heap (e.g. 20,000+ objects each taking 48bytes + additional object data such as lists, strings, etc).</p>
<p>The translation works just fine and concurrency analysis with <a href="http://www.intel.com/go/parallel">Intel Parallel Amplifier</a> indicates high concurrency levels. So far, so good. However I had noticed that when translating the same ACIS file over and over again in the same test harness session translation took longer and longer. Why could it be ?</p>
<p>So I launched the Amplifier to collect hotspots and here is what I saw:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-hostpot3.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-hostpot3-300x87.png" alt="" width="300" height="87" class="aligncenter size-medium wp-image-11266" /></a></p>
<p>These two top hotspots relate to the memory manager layer (Standard_MMgrRaw class) which simply forwards calls to malloc/free and new/delete. Trying to root-cause the problem I had to switch to the mode to see direct OS functions (toggling off the button on the Amplifier toolbar) and here is a new screenshot:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-hostpot4a.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-hostpot4a-300x92.png" alt="" width="300" height="92" class="aligncenter size-medium wp-image-11264" /></a></p>
<p>It shows that hotspots are two system functions – RtlpFindAndCommitPages() and ZwWaitForSingleObject() – which are called from memory allocation / deallocation routines. It also shows that the nearest hotspot related to my code (BSplCLib::Bohm()) is just 1/4 of the time consumed by ZwWaitForSingleObject() (0.47s vs 1.81s).</p>
<p>After experimenting with several runs, analyzing how the hotspot profile changes with growing number of runs, I concluded that the first hotspot is explained by the fact that the ACIS converter creates multiple tiny objects with different size with short life span (they are destroyed after every conversion). This seems to cause strong memory fragmentation which forces the system to constantly look for new memory chunks.</p>
<p>The second hotspot (ZwWaitForSingleObject()) which goes through critical section is caused by the default mechanism of memory management on Windows <a href="http://msdn.microsoft.com/en-us/library/ms683476(VS.85).aspx">which uses a lock</a>.</p>
<p>The execution of locks&amp;waits analysis also proves that memory management lock is the greatest one adversely affecting concurrency.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-lw2.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-lw2-300x175.png" alt="" width="300" height="175" class="aligncenter size-medium wp-image-11265" /></a></p>
<p>All this is caused by the direct use of calloc/malloc/free, and new/delete called dozens of thousands times. It’s worth mentioning that such hotspots did not exist when I used serial implementation and popped up only when I started using parallel one. The former used a memory manager (in a 3rd party lib) that allocated memory blocks and did not return them to the system reusing them when the application requested new blocks. I couldn’t reuse this memory manager because it was not thread-safe and therefore had to switch to another manager that simply forwarded to malloc/free.</p>
<p>So I almost was forced to write my own memory manager that would implement a previous behavior and would be thread-safe and … fast ! Challenges are good but not when you need to re-write low-level components what can take a lot of time and require diligent thorough testing delaying progress in your project which already receives very limited attention.</p>
<p>So, I approached my colleagues from the Threading Build Blocks team to check if there is anything TBB could help with. What was my surprise when they suggested me trying a new release 2.2. Version 2.2 offers a mechanism to seamlessly replace the system memory manager with the tbb allocator. ‘Seamlessly’ really means it – everything I had to is to add a single line of code into a C++ file:</p>
<p>#include "tbb/tbbmalloc_proxy.h"</p>
<p>The outcome was immediate. Not only did the hotspot profile change completely removing the OS hotspots (see the comparison mode screenshot below) but the overall speed up (on entire test case) was about 25%! One line of code, no need of re-writing anything on my own saved hours of coding, with such a return! Just incredible, the least I could say. Recently released 2.2 Update 1 includes further improvements which my app now benefits from (more reliable processing of realloc(), bug fixes for debug mode, etc).</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-hostpot3.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/mm-hostpot3-300x87.png" alt="" width="300" height="87" class="aligncenter size-medium wp-image-11266" /></a></p>
<p>The colleagues later explained me that the TBB allocator runs concurrently (seemingly without any locks inside) and with a similar fashion of reusing previously allocated blocks. Thus, it was the entire application (not only its parallel part) which benefited from this substitution. </p>
<p>So, if you are migrating from serial to parallel implementation you may encounter something unexpected – memory bottlenecks. If you got accustomed to use some nice single-threaded memory manager you can be forced to consider migration to something alternative. If this is the case you may want to give a try to tbb allocator and see if it helps in your case.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/28/memory-management-challenges-in-parallel-applications/feed/</wfw:commentRss>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial body forces one more time</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/#comments</comments>
		<pubDate>Fri, 23 Oct 2009 19:58:13 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[c++ parallel programming]]></category>

		<category><![CDATA[Intel Parallel Composer]]></category>

		<category><![CDATA[Multicore Parallel Programming]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[performance analysis]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/</guid>
		<description><![CDATA[Forced to revisit the question of accumulating forces one more time, Robert tests addForce(i,j) and discovers that while accelerations are a little faster, it's not much and a much more complicated story than he realized.]]></description>
			<content:encoded><![CDATA[<p>My plan to go parallel this time was <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">thwarted by concerns</a> that I may still have left some serial performance on the table. So I’ll take one more look (OK, well, no more than three). Leading the contenders was Jim Dempsey’s suggestion that accumulating forces instead of accelerations would save some divides. His numbers did not show a dramatic difference but did suggest summing forces to be ever so slightly faster than accumulating accelerations. My analysis of the equations suggest that even this should be the wrong order, so I took the plunge and wrote <em>addForce</em>(i,j). It’s a simple twist on the original <em>addAcc</em>. Instead of computing separate accelerations for each body, I compute one force:</p>
<div>   // Use the Force, Luke!<br />
    double force = GFORCE * ivdist * ivdist * body[j].mass * body[i].mass;</div>
<p>Then I ensured the vector component accumulations take advantage of the simplification:</p>
<div>    for (int axis = 0; axis &lt; 3; ++axis) {<br />
        double axialForce = force * ud[axis];<br />
        body[j].acc[axis] += axialForce;<br />
        body[i].acc[axis] -= axialForce;<br />
    }</div>
<p>To avoid changes to the body data structure that might affect the experiment, I redefined the <em>acc</em> field to mean <em>accumulator</em> instead of <em>acceleration</em> (cheap trick for a short hack ;-).</p>
<p>With <em>addForce</em> in hand, I needed to make some adjustments to the ballistic step to turn the forces back into accelerations:</p>
<div>    for (i = 0; i &lt; n; ++i) {<br />
        for (int axis = 0; axis &lt; 3; ++axis) {<br />
            body[i].vel[axis] += (body[i].acc[axis] / body[i].mass) * TIMESTEP;<br />
            body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;<br />
            body[i].acc[axis] = 0.;<br />
        }<br />
    }</div>
<p>Oops, adding three divides per body (one for each axis), which gives this result:<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-01.png"><img class="alignnone size-full wp-image-11080" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-01.png" alt="" width="741" /></a></p>
<table border="0" width="100%">
<tbody>
<tr>
<td valign="top">In my experiments, the add forces version came in slightly slower than my best add acceleration version. It’s easier to see in the numbers: <br />
 </p>
<p>As this table shows, the times for the run accumulating force takes longer, as you would expect for a solution that requires more multiples (to include the extra mass term in the force equation and then to remove it to get to acceleration). <span style="underline;">But wait a minute</span>! There’s something else going on here. What’s with those serial <em>addAcc</em> numbers? I remember lower numbers when I took my first serial run. Maybe there’s more variability in the results than I recall? That’s easy to check. I switched back to the <em>bodies007</em> code and took several more runs.</td>
<td><img class="alignnone size-full wp-image-11083" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-02.png" alt="" width="250" height="250" /></td>
</tr>
</tbody>
</table>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-03.png" alt="" width="698" height="250" /></p>
<p>That looks all pretty consistent across the range of <em>n</em>-values. Yet when I tried the same thing with bodies008:</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-04.png" alt="" width="699" height="254" /></p>
<table border="0" width="100%">
<tbody>
<tr>
<td valign="top">Almost a second slower for 2K bodies, even though the supposed “code under test” didn’t change. Note: <span style="underline;"><span style="underline;"><span style="underline;">I’m not even running the <em>addForce</em> code</span>!</span></span>—just the conditional test in the RAMP test mode (see below). There were not many changes in going from bodies007.cpp to bodies008.cpp so it was pretty easy to isolate the code change that caused most of the slowdown. I was able to get these numbers...</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-05.png" alt="" width="144" height="252" /></td>
</tr>
</tbody>
</table>
<p>…by the simple expedient of commenting out the following code:</p>
<div>                          // Do the single threaded run<br />
//                       if (method &amp; USEFORCE) {<br />
//                           startBodies(n);<br />
//                           stime = tick_count::now();<br />
//                           runSerialForceBodies(n);<br />
//                           etime = tick_count::now();</p>
<p>//                           elapsed = (etime - stime).seconds();<br />
//                           cout &lt;&lt; "," &lt;&lt; setw(20) &lt;&lt; elapsed;<br />
//                       }</p></div>
<table border="0" width="100%">
<tbody>
<tr>
<td>This is one of several clauses in a <em>for</em>-loop that selects the values of <em>n</em> for the ramp; commenting out all the variant methods so that the only one remaining is the serial <em>addAcc</em> does even better, though the returns are diminishing:</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-06.png" alt="" width="144" height="232" /></td>
</tr>
</tbody>
</table>
<table border="0">
<tbody>
<tr>
<td>So, given the ramp loop is somehow having an effect on the numbers, let me reverse the scenario and comment out all but the <em>addForce</em> variant:</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-07.png" alt="" width="144" height="232" /></td>
</tr>
</tbody>
</table>
<p>Each of these last two sets are the averages of five runs each, and by my measure, adding accelerations still wins, though the answer is much more murky than I would hope. What are the gremlins that are plaguing these numbers? I have some hunches that involve optimization and inlining strategies but I can’t yet point my finger at specific problems. There are some tantalizing observations to be made, though.</p>
<p>For example, I could try doing a hot spot analysis to see if that would provide clues about the unexpected overhead. However, that means relaxing function inline optimization (the /Ob1 trick). But what does that do to performance?</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-08.png" alt="" width="513" height="285" /></p>
<table border="0" width="100%">
<tbody>
<tr>
<td>Wow. Looks like function inlining is a big performance benefit for the serial acceleration accumulating code, in this case (one sample) having lost 14 seconds computing the interactions of 2K bodies.<br />
 </p>
<p>Curiously, the serial code accumulating forces appears to take a much smaller hit from the loss of aggressive function inlining. Or, in glass-half-full parlance, it appears to take much less advantage of compiler optimizations.The same appears to be true if you continue relaxing optimizations, specifically looking at the performance of the Debug configuration with this same test (all these tests are using the optimizations available in Intel® Parallel Composer, so your mileage may vary depending on what compiler you use):</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-09.png" alt="" width="300" height="285" /></td>
</tr>
</tbody>
</table>
<p>There’s a lot more to be discovered in this rich mine of anomalies, and perhaps when I have some more time, I will delve into it more deeply. For now though, I’ll continue to use the <em>addAcc</em> variant in the experiments going forward. After all, after over half a dozen posts in this series, I haven’t even gone parallel yet!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/feed/</wfw:commentRss>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial body drill-down</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 22:15:35 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[c++ parallel programming]]></category>

		<category><![CDATA[Hot Spot Analysis]]></category>

		<category><![CDATA[Intel® Parallel Amplifier]]></category>

		<category><![CDATA[Intel® Parallel Studio]]></category>

		<category><![CDATA[Intel® Threading Building Blocks]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/</guid>
		<description><![CDATA[Having found the function that consumes the most time, this episode shows the process of drilling down into the hot source and optimizing it BEFORE going parallel.]]></description>
			<content:encoded><![CDATA[<p>Having discovered which function consumes most of the time in the serial algorithm <a href="http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/">last time</a>, there’s still more to discover by narrowing the focus to a specific function of interest. Our function, shown last time and below, is <em>addAcc</em>.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-011.png"><img class="alignnone size-full wp-image-10707" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-011.png" alt="" width="691" /></a></p>
<p>Expanding the view to show the function in detail is often called <em>drilling down to source</em>. In Intel® Parallel Amplifier I can do this by just double-clicking on the function, <em>addAcc</em>.</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-02.png" alt="" width="596" height="494" /></p>
<p>Parallel Amplifier lands on the hottest line in the function and provides easy navigation buttons (just below the <em>Bottom-up</em> button) to explore the other hot spots in order of time taken (max, step-up, step-down, min, respectively). Since I landed on the hottest hot spot, the "navigate-to-a-hotter-spot" buttons are grayed out.</p>
<p>Looks like <em>addAcc</em> has a problem with divides. Division is one of the more expensive arithmetic operations and while I can’t eliminate all of them, I certainly can reduce the number of them.</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-03.png" alt="" width="561" height="336" /></p>
<p>Computing the inverse distance once and then multiplying that seems to have had an effect: before the change, the times attributed to the nearby lines amounted to over 1.2 seconds while after, the total is down to 0.78 seconds. I accumulate the values from the adjacent lines because the reported event counts are at best approximate—the tool needs to deal with both the optimized code that may have scattered around the instructions that implement any particular line and phenomena that affect the actual process of determining the location of the instruction pointer, such as <em>event skid</em> (to be addressed in some other post). In fact, probably a significant portion of the 0.78 and 1.2 seconds is actually coming from the square root function that immediately precedes these lines. So I’ll run another ramp of n-bodies and see if my numbers are any better.</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-04.png" alt="" width="741" height="416" /></p>
<p>Yes, they are. And now that I have <em>ivdist</em>, it’s worth considering whether I can use it more efficiently to replace the divide by <em>distsq</em> into something like this:</p>
<p><code>    double Gdivd = GFORCE * ivdist * ivdist;</code></p>
<p>Sure enough, that change, though not as beneficial as the last one, still has an observable benefit:</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-05.png" alt="" width="741" height="416" /></p>
<p>Another hot spot run shows even less time being spent at the previously identified hot spots:</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-06.png" alt="" width="561" height="400" /></p>
<p>I’ll use this last version as my serial <em>baseline</em>, the benchmark against which I’ll compare to measure my progress in parallelization.  It might change someday as I continue to evaluate alternatives like <a href="http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/">the persistent question about forces</a>. This in fact is part of our recommended practice for migrating serial code into a parallel environment: I start by optimizing the serial version as much as I can so that the benefits I gain through parallel implementation are not just because multiple HW threads are just filling in the gaps left behind by reusing an inefficient, serial version.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/">parallel code, a first attempt</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Got Multicore Data Parallel Woes?</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/09/got-multicore-data-parallel-woes/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/09/got-multicore-data-parallel-woes/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 22:22:45 +0000</pubDate>
		<dc:creator>Rita Turkowski (Intel)</dc:creator>
		
		<category><![CDATA[Cool Software]]></category>

		<category><![CDATA[Financial Services Industry]]></category>

		<category><![CDATA[Gaming]]></category>

		<category><![CDATA[Intel® Software Network 2.0]]></category>

		<category><![CDATA[Media]]></category>

		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Software Engineering]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[Visual Computing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/09/got-multicore-data-parallel-woes/</guid>
		<description><![CDATA[Sign up here: http://makebettercode.com/ct_tech/survey.
]]></description>
			<content:encoded><![CDATA[<p>Sign up here: <a href="http://makebettercode.com/ct_tech/survey">http://makebettercode.com/ct_tech/survey</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/09/got-multicore-data-parallel-woes/feed/</wfw:commentRss>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial body hot spots</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 23:13:29 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[Amdahl's Law]]></category>

		<category><![CDATA[hot spots]]></category>

		<category><![CDATA[Intel Parallel Amplifier]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/</guid>
		<description><![CDATA[Robert finds the hot function in the serial n-bodies code, but only after discovering what a good job of function inlining the Intel C++ Compiler does.]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/">my last venture </a>I got the n-bodies program to compile and ran a test series with the serial algorithm, showing the n-squared nature of the basic problem. I mean to write a parallel version of this (heh, heh, heh) but first I need to know what is taking up the time. By the dictates of <a href="http://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s Law</a>, I want to apply the most processors at the place the program is spending most of the time, its <em>hot spots,</em> to do the most good. The most common way to do this is to interrupt the processsor regularly and figure out where it is in the program, accumulating these locations to build a picture of where the HW thread (or threads) is/are spending time.  This technique is one of the several used in Intel’s most recent performance analysis tool, called <a href="http://software.intel.com/en-us/intel-parallel-amplifier/">Intel® Parallel Amplifier</a>.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-01.png"><img class="alignnone size-full wp-image-10435" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-01.png" alt="" width="560" /></a></p>
<p>It installs right in Visual Studio as shown above. In order to collect hot spots on the serial algorithm, I switch the debug command to <em>single 256 serial</em></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-02.png"><img class="alignnone size-full wp-image-10438" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-02.png" alt="" width="428" height="168" /></a></p>
<p>I’ve also turned on symbols in my Release configuration (C/C++ &gt;&gt; General &gt;&gt; Debug Information Format set to <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-03.png"><img class="alignnone size-full wp-image-10439" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-03.png" alt="" width="145" height="17" /></a> and Linker &gt;&gt; Debugging &gt;&gt; Generate Debug Info set to <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-04.png"><img class="alignnone size-full wp-image-10440" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-04.png" alt="" width="82" height="18" /></a> on my latest build), then just click on the Profile button, and viola!</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-05.png"><img class="alignnone size-full wp-image-10441" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-05.png" alt="" width="751" /></a></p>
<p>Huhhhhhh?! I see two seconds plus a quarter spent in main, but where are my functions? Do I get the same result if I try the Debug configuration?</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-062.png"><img class="alignnone size-full wp-image-10457" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-062.png" alt="" width="700" /></a></p>
<p>Oh, there are my functions, <em>runSerialBodies</em> and <em>addAcc</em>, but the run takes over 5 seconds. I don’t want to spend time making Debug code run faster, so I want to tune the optimized Release code. However, something about that Release configuration is causing the functions to disappear. Experimenting a little with the configuration settings reveals that the Intel compiler is automatically <em><a href="http://en.wikipedia.org/wiki/Function_inlining">inlining</a></em> the functions into <em>main</em>. Unfortunately, apparently there’s no way to represent that inlining in the debug information so the functions just disappear. By relaxing the optimization a little, I can restore the function hierarchy for analysis at the cost of some extra function call instructions:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-07.png"><img class="alignnone size-full wp-image-10447" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-07.png" alt="" width="584" /></a></p>
<p>Now my hot spot analysis on the Release configuration looks much better:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-08.png"><img class="alignnone size-full wp-image-10449" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-08.png" alt="" width="691" /></a></p>
<p>Most of the time is being spent in the <em>addAcc</em> function, which is being called by <em>runSerialBodies</em> as can be seen in the function call hierarchy graph. Looks like <em>addAcc</em> will be one of my candidates for parallelization.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">serial body drill-down</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Application for Ct beta program now available on-line</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/02/application-for-ct-beta-program-now-available-on-line/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/02/application-for-ct-beta-program-now-available-on-line/#comments</comments>
		<pubDate>Sat, 03 Oct 2009 03:03:38 +0000</pubDate>
		<dc:creator>Rita Turkowski (Intel)</dc:creator>
		
		<category><![CDATA[Cool Software]]></category>

		<category><![CDATA[Financial Services Industry]]></category>

		<category><![CDATA[Gaming]]></category>

		<category><![CDATA[Media]]></category>

		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Software Engineering]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[Visual Computing]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/02/application-for-ct-beta-program-now-available-on-line/</guid>
		<description><![CDATA[Hey, check out the newly updated Intel's Ct website. We've updated it so folks interested in the beta, coming out later this year, may apply for beta consideration online. Please visit the website to register. We are receiving and reviewing applicants now for potential inclusion in the Ct beta engagement program. Note that applying does not guarantee acceptance into [...]]]></description>
			<content:encoded><![CDATA[<p class="MsoNormal"><span style="x-small;"><span style="10pt;">Hey, check out the newly updated <a href="http://software.intel.com/en-us/data-parallel">Intel's Ct website</a>. We've updated it so folks interested in the beta, coming out later this year, may apply for beta consideration online. Please visit the website to <a title="http://software.intel.com/en-us/data-parallel" href="http://software.intel.com/en-us/data-parallel"><span style="#800080;">register</span></a>. We are receiving and reviewing applicants now for potential inclusion in the Ct beta engagement program. Note that applying does not guarantee acceptance into the beta.</span></span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/02/application-for-ct-beta-program-now-available-on-line/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Parallel Programming Talk #49 - For Real-time Weather Simulation in Parallel with Simul Software</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/30/parallel-programming-talk-49-for-real-time-weather-simulation-in-parallel-with-simul-software/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/30/parallel-programming-talk-49-for-real-time-weather-simulation-in-parallel-with-simul-software/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 23:57:33 +0000</pubDate>
		<dc:creator>Aaron Tersteeg (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[ParallelProgrammingTalk]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/30/parallel-programming-talk-49-for-real-time-weather-simulation-in-parallel-with-simul-software/</guid>
		<description><![CDATA[Hello Parallel Programers &#38; Intel Software Partners, I'm Aaron Tersteeg. Welcome to Episode 49 of Parallel Programming Talk. Joining me again is Dr. Clay Breshears.

Download the video.

Download and MP3 of the show.
Today on the show we'll be speaking with Roderick Kennedy, President and CEO of Simul. Simul is a software company specializing in innovative, lightweight solutions inspired by [...]]]></description>
			<content:encoded><![CDATA[<p>Hello Parallel Programers &amp; Intel Software Partners, I'm Aaron Tersteeg. Welcome to Episode 49 of Parallel Programming Talk. Joining me again is Dr. Clay Breshears.</p>
<p><embed src="http://blip.tv/play/g5FLgaSzBAA%2Em4v" type="application/x-shockwave-flash" width="640" height="360" allowscriptaccess="always" allowfullscreen="true"></embed></p>
<p><A HREF="http://blip.tv/file/get/ISNTV-ParallelProgrammingTalk49RoderickKennedy703.mp4">Download the video.</a></p>
<p><img style="visibility:hidden;width:0px;height:0px;" border=0 width=0 height=0 src="http://counters.gigya.com/wildfire/IMP/CXNID=2000002.0NXC/bT*xJmx*PTEyNTQzNTUwNDIwMTUmcHQ9MTI1NDM1NTA*MzU*MyZwPTQ1MDk3MiZkPSZnPTImbz1mYzY5ODA2YWI5YmM*MzIxOGNjMThiM2M4ZDIyOTQyYyZvZj*w.gif" /><embed src="http://www.blogtalkradio.com/BTRPlayer.swf?file=http%3A%2F%2Fwww%2Eblogtalkradio%2Ecom%2Fplaylist%2Easpx%3Fshow%5Fid%3D688634&#038;autostart=false&#038;bufferlength=5&#038;volume=100&#038;borderweight=1&#038;bordercolor=#999999&#038;backgroundcolor=#FFFFFF&#038;dashboardcolor=#0098CB&#038;textcolor=#FFFFFF&#038;detailscolor=#FFFFFF&#038;playlistcolor=#999999&#038;playlisthovercolor=#333333&#038;cornerradius=10&#038;callback=http://www.blogtalkradio.com/FlashPlayerCallback.aspx?referrer_url=/show.aspx&#038;C1=7&#038;C2=6042973&#038;C3=31&#038;C4=&#038;C5=&#038;C6=" width="210" height="108" quality="high" pluginspage="http://www.adobe.com/go/getflashplayer" type="application/x-shockwave-flash" wmode="transparent" menu="false" allowScriptAccess="always"></embed></p>
<p><a href="http://www.blogtalkradio.com/MulticoreSoftware/2009/09/29/For-Real-time-Weather-Simulation-in-Parallel-with-Simul-Software.mp3?localembed=download">Download and MP3 of the show.</a></p>
<p>Today on the show we'll be speaking with Roderick Kennedy, President and CEO of Simul. Simul is a software company specializing in innovative, lightweight solutions inspired by physics based in Manchester, England. Roderick is a specialist in game physics and simulation, has worked in the games industry since 1990, contributing to such titles as DID’s Eurofighter 2000, and Evolution Studios’ World Rally Championship series.</p>
<p><strong>First the News:</strong></p>
<p><a href="http://software.intel.com/en-us/contests/Threading-Challenge-2009/codecontest.php">Intel Threading Challenge PHASE 2</a></p>
<ul>
<li>Problem 1 - "Strassen's Algorithm" winner was iArchitect</li>
<li>Problem 2 - "Knights Tour" is CLOSED for submissions!</li>
<li>Problem 3 - "Graph Coloring" went live Sept 21 and is due October 9th</li>
</ul>
<p><a href="http://media.cs.uiuc.edu/live/upcrc0910/upcrc.asx ">The University of Illinois is presenting a lecture on-line this Friday<br />
</a>XcalableMP: A Performance-Aware Scalable Parallel Programming Language for Distributed Memory System - Beyond PGAS Models Mitsuhisa Sato, Director for the Center for Computational Sciences, University of Tsukuba Friday, October 2, 2009 at 2:00 PM (Central Time) at 2405 Siebel Center for Computer Science.</p>
<p>Live video streaming*: <a href="http://media.cs.uiuc.edu/live/upcrc0910/upcrc.asx ">http://media.cs.uiuc.edu/live/upcrc0910/upcrc.asx </a></p>
<p><a href="http://sc09.supercomputing.org/">SC09 – Super Computer Conference</a> is coming up November 14-20 in Portland, OR. The Intel Software network team will be participating in panels and on the show floor for the whole event.</p>
<p><a href="mailto:ParallelProgrammingTalk@Intel.com">Listener Questions Show</a> is the first Tuesday of each month. October 6th is the next one. If you have a question or idea about the show send it in to <a href="mailto:ParallelProgrammingTalk@Intel.com">ParallelProgrammingTalk@Intel.com</a></p>
<p><a href="http://www.intel.com/idf">Intel Developer Forum</a> was September 22-24. Our friends at <a href="http://softtalkblog.wordpress.com/">Softtalkblog</a> did a great job highlighting many of the great parallel programming sessions. Here are a  few excerpts:</p>
<blockquote><p>In exciting news for developers, Intel CEO Paul Otellini has just announced the Intel Atom Developer Programme here at IDF2009 in San Francisco. The programme is aimed at developers who want to create new apps or port existing ones to Atom, enabling them to enter the vibrant market of internet-enabled devices. Intel will support developers with software development kits, technical support and community resources. Developers will also be able to trade code components (buy and sell), so they can accelerate their development time, and receive income from landmark innovations early, before they are part of an end product.</p></blockquote>
<p>There were many technical sessions on Parallel Programming Technology and tools.</p>
<blockquote><p>In the "Go Parallel" session James Reinders explain the language and library support available for developers today and compare it with what the future holds. Intel is now working on a language extension, based on Cilk++.</p>
<p>Intel Concurrent Collections now offers Linux support as well as a preliminary implementation for Haskell.</p>
<p>Victoria Gromova, one of Intel’s senior software engineers, explaining how developers can achieve forward scalability with Threading Building Blocks (TBB). Having been around for some time now, TBB is an Intel Open Source project, dedicated to facilitating parallel programming by providing a library of template classes and functions for C++ developers. In doing so, TBB helps developers achieve the two ‘holy grails’ of parallelism: correctness and performance.</p>
<p>Steve Teixera, product unit manager of parallel development tools at Microsoft, presented a session on the future of parallel programming with Intel Parallel Studio and Microsoft Visual Studio.</p>
<p>As many of you know there are many challenges to building parallel applications. It needs to be much easier, and people need to know how to tune their applications to make them efficient.</p>
<p>Steve outlined some of the ways that Intel and Microsoft have been cooperating to make multicore programming easier. Firstly, the concurrency runtime in Windows 7 helps to avoid resource conflicts between Intel Threading Building Blocks, Open-MP and Microsoft Parallel Pattern Library. Secondly, Intel Threading Building Blocks and Microsoft Parallel Pattern Library now share a common data structure for vector models. Thirdly, the concurrency runtime scheduler in Windows 7 enables load balancing, using work stealing so that a processor will take work from another processor’s queue if it has immediate capacity.</p>
<p>In another session, Mark Davis and Ravi Vemuri both with Intel, demonstrated how the Intel Parallel Advisor Lite can be used to model parallelism in serial code, to ease the transition towards parallel code. Intel Parallel Advisor Lite is in preview, and we invites you to download it, use it and submit your feedback. You can get a free copy at whatif.intel.com. You will need Intel Parallel Studio, but can use an evaluation edition</p></blockquote>
<p>There were many other technical presentation and Q&amp;A sessions. Learn more about IDF, watch the videos and download the presentations at <a href="http://www.intel.com/idf">http://www.intel.com/idf</a>.</p>
<p><strong>Today's Show:</strong></p>
<p>Today on the show we'll be speaking with Roderick Kennedy, President and CEO of Simul. Simul is a software company specializing in innovative, lightweight solutions inspired by physics based in Manchester, England.</p>
<p>Roderick talked about his company, the market for his technology, and the process they used to multi-thread their application. We discussed his evaluation of OpenMP vs. Threading Building Blocks and his eventual decision to use Threading Building Blocks. Listen or watch the full show to understand their challenges and the impact on Simul Software's weather application.  <a href="http://cache-www.intel.com/cd/00/00/42/59/425967_425967.pdf">Download a case study</a> and<a href="http://www.simul.co.uk"> visit the Simul Software web site</a> to learn more about the company and multi-threading positive impact on application performance.</p>
<p><strong>Coming Up Next on Parallel Programming Talk:</strong></p>
<p>The first Tuesday of the month is our Listener Questions show. October 6th is the next one. If you have a question or idea about the show send it in to <a href="mailto:ParallelProgrammingTalk@Intel.com">ParallelProgrammingTalk@Intel.com</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/30/parallel-programming-talk-49-for-real-time-weather-simulation-in-parallel-with-simul-software/feed/</wfw:commentRss>
<enclosure url="http://media.cs.uiuc.edu/live/upcrc0910/upcrc.asx" length="207" type="video/x-ms-asf" />
<enclosure url="http://www.blogtalkradio.com/MulticoreSoftware/2009/09/29/For-Real-time-Weather-Simulation-in-Parallel-with-Simul-Software.mp3?localembed=download" length="" type="" />
<enclosure url="http://blip.tv/file/get/ISNTV-ParallelProgrammingTalk49RoderickKennedy703.mp4" length="411017412" type="video/mp4" />
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial bodies test run</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 23:53:38 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[Intel C++ compiler]]></category>

		<category><![CDATA[Microsoft Visual Studio]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/</guid>
		<description><![CDATA[Wherein Robert attempts to compile his program and remembers eventually to switch to the Intel C++ compiler to accommodate C++0x features used by the program.]]></description>
			<content:encoded><![CDATA[<p>Let’s take the body interaction code I laid out <a href="http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/">last time</a>, combine it with the other parts laid out previously and run it. Dropping the fleshed out program into a Microsoft Visual Studio* project, I quickly rediscover something:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-011.png"><img class="alignnone size-full wp-image-10244" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-011.png" alt="" width="100%" /></a></p>
<p>Oops, that’s right. bodies007.cpp relies on language extensions available in the Intel® Compiler version 11, some early arrivals from the C++0x standard. Fortunately, it’s pretty easy to switch compilers.</p>
<table border="0" width="100%">
<tbody>
<tr>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-02.png"><img class="alignnone size-full wp-image-10249" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-02.png" alt="" width="253" height="250" /></a></td>
<td valign="top">With the Intel C++ Compiler installed in Visual Studio from either of the regular distribution packages, the Compiler Professional Edition or Intel Parallel Composer, switching compilers is just a click away.</td>
</tr>
</tbody>
</table>
<table border="0" width="100%">
<tbody>
<tr>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-03.png"><img class="alignnone size-full wp-image-10251" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-03.png" alt="" width="500" height="196" /></a></td>
<td valign="top">OK, two clicks. I did try to build already, so I’ll let the system clean up the project files.</td>
</tr>
</tbody>
</table>
<p> </p>
<table border="0" width="100%">
<tbody>
<tr>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-04.png"><img class="alignnone size-full wp-image-10253" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-04.png" alt="" width="203" height="251" /></a></td>
<td valign="top">Viola!!! The project now uses the Intel C++ compiler.</td>
</tr>
</tbody>
</table>
<p>One more configuration setting to enable C++0x support:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-05.png"><img class="alignnone size-full wp-image-10257" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-05.png" alt="" width="100%" /></a></p>
<p>And now it compiles!</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-06.png"><img class="alignnone size-full wp-image-10259" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-06.png" alt="" width="100%" /></a></p>
<p>Dropping the serial code into the test program I’ve prepared gives access to a simple command processor that allows me to select that kernel for one of several tests. A simple ramp, testing the algorithm with varying values of <em>n</em>, can be launched by putting this in the command line: <em>select serial</em></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-071.png"><img class="alignnone size-full wp-image-10264" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-071.png" alt="" /></a></p>
<p>Looks like the time it takes to complete the simulation (I’m running 1000 time steps for each body-count) is going up more than four times for every doubling of the number of bodies; in fact, if I plot it on a logarithmic scale, I can see the exponential growth:<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-08.png"><img class="alignnone size-full wp-image-10265" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-08.png" alt="" width="100%" /></a></p>
<p>Clearly this is an algorithm that has plenty of work to divide, if I can just figure out a way to do it.</p>
<p>Next time: <a href="http://origin-software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/">serial body hot spots</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Simulating do-nothing mutexes -- null_mutex and null_rw_mutex</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/28/simulating-do-nothing-mutexes-null_mutex-and-null_rw_mutex/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/28/simulating-do-nothing-mutexes-null_mutex-and-null_rw_mutex/#comments</comments>
		<pubDate>Mon, 28 Sep 2009 19:16:04 +0000</pubDate>
		<dc:creator>Wooyoung Kim (Intel)</dc:creator>
		
		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[mutex]]></category>

		<category><![CDATA[null_mutex]]></category>

		<category><![CDATA[null_rw_mutex]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/28/simulating-do-nothing-mutexes-null_mutex-and-null_rw_mutex/</guid>
		<description><![CDATA[Early this year a TBB user requested in the forum for the feature that simulates mutexes that do nothing. The user wrote “Lot of times, when we do template meta programming, we need to provide some containers with no mutex (tbb containers) and some containers with a tbb Mutex...[snip]... If we can have a NullMutex [...]]]></description>
			<content:encoded><![CDATA[<p>Early this year a TBB user requested in the forum for the feature that simulates mutexes that do nothing. The user wrote “Lot of times, when we do template meta programming, we need to provide some containers with no mutex (tbb containers) and some containers with a tbb Mutex...[snip]... If we can have a NullMutex feature ... it would be easy to handle such situations.”<br />
 (<a href="http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/63003/">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/63003/</a>). We agreed and thought the idea made sense, and added null_mutex and null_rw_mutex to Intel® Threading Building Blocks's mutex lineup. (They have been available since Intel® TBB 2.1 Update 3 and officially added to Intel® TBB 2.2).  Both mutexes were built based on Alexey Kukanov's sketch posted in the forum.  As Alexey noted, their is_recursive and is_fair traits are set to true. The two mutexes really do nothing and simulate successful mutex operations. I hope this answers the question about the status posted here (<a href="http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/66231/reply/86944/">http://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/66231/reply/86944/</a>).</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/28/simulating-do-nothing-mutexes-null_mutex-and-null_rw_mutex/feed/</wfw:commentRss>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: realizing addAcc(i,j)</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 23:42:40 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/</guid>
		<description><![CDATA[Putting together the function to apply accelerations between a pair of gravitational bodies.]]></description>
			<content:encoded><![CDATA[<p>Having settled the question of whether I should accumulate forces or accelerations <a href="http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/">last time</a>, now it’s time to build the accumulation function.</p>
<div>    <span style="#00ff00;">void</span><br />
    <strong>addAcc</strong>(<span style="#00ff00;">int </span>i, <span style="#00ff00;">int </span>j) {</div>
<p><em></em></p>
<p><em>i</em> and <em>j</em> are indices selecting elements of the <em>body</em> array. First task is to compute the distance between them.</p>
<div>        <span style="#00ff00;"><span style="#00ff00;">double </span></span>dx = body[<span style="#ff6600;">i</span>].pos[<span style="#ff6600;">0</span>]-body[<span style="#ff6600;">j</span>].pos[<span style="#ff6600;">0</span>];<br />
        <span style="#00ff00;"><span style="#00ff00;">double </span></span>dy = body[<span style="#ff6600;">i</span>].pos[<span style="#ff6600;">1</span>]-body[<span style="#ff6600;">j</span>].pos[<span style="#ff6600;">1</span>];<br />
        <span style="#00ff00;"><span style="#00ff00;">double </span></span>dz = body[<span style="#ff6600;">i</span>].pos[<span style="#ff6600;">2</span>]-body[<span style="#ff6600;">j</span>].pos[<span style="#ff6600;">2</span>];<br />
        <span style="#00ff00;"><span style="#00ff00;">double </span></span>distsq = dx*dx + dy*dy + dz*dz;</div>
<p> </p>
<p>Pythagorean Theorem in three dimensions gets me the square of the hypotenuse, but before doing the square root, I’ll avoid the singularity:</p>
<div>        <span style="#0000ff;"><strong>if</strong></span> (distsq &lt; MINDIST) distsq = MINDIST;<br />
        <span style="#00ff00;">double </span>dist = sqrt(distsq);</div>
<p> </p>
<p>That is, if the point masses get too close together, act like they’re not. But wait! Why do I even need the square root, if I’m working with gravitation, an inverse-<span style="underline;">squared</span> law? Well, because acceleration is a vector so I need the next step.</p>
<div>        <span style="#00ff00;">double </span>ud[<span style="#ff6600;">3</span>];<br />
        ud[<span style="#ff6600;">0</span>] = dx/dist;<br />
        ud[<span style="#ff6600;">1</span>] = dy/dist;<br />
        ud[<span style="#ff6600;">2</span>] = dz/dist;</div>
<p> </p>
<p>Array <em>ud</em> represents the <em>unit vector</em> (length 1 direction vector) pointing from body <em>j </em>to body <em>i</em>. I need just one more thing, the magnitude of those accelerations.</p>
<div>        <span style="#00ff00;">double </span>Gdivd = GFORCE/distsq;<br />
        <span style="#00ff00;">double </span>ai = Gdivd*body[<span style="#ff6600;">j</span>].mass;<br />
        <span style="#00ff00;">double </span>aj = Gdivd*body[<span style="#ff6600;">i</span>].mass;</div>
<p> </p>
<p>All that’s left is to compute the acceleration vector components and apply them to the bodies.</p>
<div>        <span style="#0000ff;"><strong>for </strong></span>(<span style="#00ff00;">int </span>k = 0; k &lt; 3; ++k) {<br />
            body[<span style="#ff6600;">j</span>].acc[<span style="#ff6600;">k</span>] += aj*ud[<span style="#ff6600;">k</span>];<br />
            body[<span style="#ff6600;">i</span>].acc[<span style="#ff6600;">k</span>] -= ai*ud[<span style="#ff6600;">k</span>];<br />
        }<br />
    }</div>
<p class="MsoNormal" style="0in 0in 0pt;"> </p>
<p class="MsoNormal" style="0in 0in 0pt;"><span style="Times New Roman;">Next time: <a href="http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/">serial bodies test run</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Presentations at IDF about Software Tools, available for download</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/23/presentations-at-idf-about-software-tools-available-for-download/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/23/presentations-at-idf-about-software-tools-available-for-download/#comments</comments>
		<pubDate>Wed, 23 Sep 2009 22:52:43 +0000</pubDate>
		<dc:creator>James Reinders (Intel)</dc:creator>
		
		<category><![CDATA[Intel SW Partner Program]]></category>

		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[multi-core]]></category>

		<category><![CDATA[parallelism]]></category>

		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/23/presentations-at-idf-about-software-tools-available-for-download/</guid>
		<description><![CDATA[
Today, at Intel's Developer Forum, we have taught many classes on our tools, and have a few left to go.
If you could not join us in San Francisco, the presentations are available online for downloading at intel.com/go/idfsessions.
My talks, including one today with Steve Teixeira of Microsoft, can be found searching for LAST NAME of "Reinders." Today's class was a [...]]]></description>
			<content:encoded><![CDATA[<div>
<p>Today, at <a href="http://www.intel.com/idf">Intel's Developer Forum</a>, we have taught many classes on our tools, and have a few left to go.</p>
<p>If you could not join us in San Francisco, the presentations are available online for downloading at <a href="http://www.intel.com/go/idfsessions">intel.com/go/idfsessions</a>.</p>
<p>My talks, including one today with Steve Teixeira of Microsoft, can be found searching for <a href="http://www.intel.com/go/idfsessions">LAST NAME of "Reinders."</a> Today's class was a lot of fun - it was fun to share the good work we are doing together, and Steve introduced me to the concept of "clinging to curly braces" in the course of an engaging Q&amp;A.</p>
<p>Presentations on Intel Software Development Products that are available online (after the presentations are done):</p>
<div><span><span><span>•</span></span><span>SFTS002 – “</span>Go-Parallelism!<span> </span><strong>Intel</strong></span><span><strong>®</strong></span><span><strong> Parallel Studio</strong></span><span> Eases the Onramp for C++ Windows* Development"</span></div>
<div><span><span>•</span></span><span>SFTS003 – “</span><strong>Intel® Concurrent Collections</strong><span> – Parallelization of C++ Programs”</span></div>
<div><span><span>•</span></span><span>SFTS004 – “Design for Forward-Scaling with </span><strong>I</strong><strong>ntel® Threading Building Blocks</strong><span>”</span></div>
<div><span><span>•</span></span><span>GSPS007 – “Taking Parallel Computing Mainstream with </span><strong>Microsoft Visual Studio</strong><span>”</span></div>
<div><span><span>•</span></span><span>SFTS005 – “The Future of Parallel Programming with </span><strong>Intel® Parallel Studio and Microsoft Visual Studio</strong><span>”</span></div>
<div><span><span>•</span></span><span>SFTS006 – “Simplifying Data Parallel Applications for your Manycore Future” (</span><strong>Ct Technology</strong><span>)</span></div>
<div><span><span>•</span></span><span>SFTS007 – “</span><strong>Intel® Parallel Advisor Lite</strong><span>:<span> </span>The Easy Way to Introduce Threading”</span></div>
<div><span><span>•</span></span><span>SFTS010 – “Money Tree – </span><strong>Optimizing FSI Benchmarks</strong><span> with Intel® Software Tools for Multicore &amp; Manycore”</span></div>
<div>All will be posted by September 25, 2009 (they are posted after the talks are completed).</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/23/presentations-at-idf-about-software-tools-available-for-download/feed/</wfw:commentRss>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: computing accelerations? or forces?</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/#comments</comments>
		<pubDate>Tue, 22 Sep 2009 09:07:21 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
		
		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[n-bodies]]></category>

		<category><![CDATA[parallel programming]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/</guid>
		<description><![CDATA[Robert finally deals with the eternal question, forces or accelerations?  Which is it more efficient to accumulate?]]></description>
			<content:encoded><![CDATA[<p>When considering the parallelization of some piece of code, my first concern is to be sure that the code I start with is optimized for serial execution. It does me little good to write a parallel version that just sops up the latency holes that inefficient code makes available. It may seem to scale well, but some future compiler may come along that does a better job of optimizing that inefficient kernel and suddenly the performance scaling might disappear.</p>
<table border="0" width="100%">
<tbody>
<tr>
<td valign="top">Even better is if I can squeeze those inefficiencies out of the algorithm itself. Previously, before <a href="http://software.intel.com/en-us/blogs/2009/09/14/n-bodies-a-parallel-tbb-solution-computing-accelerations/">defining the interaction loops</a>, I set up a data structure to represent each body, choosing mass, location, and velocity to represent the state, plus a place to accumulate accelerations. However, <a href="http://software.intel.com/en-us/blogs/2009/09/05/n-bodies-a-parallel-tbb-solution-body-data/">Jim Dempsey suggested </a>it might be more efficient to accumulate forces rather than accelerations. Forces are the canonical result of the gravitational equation but divides are one of the most expensive floating point operations. Better to accumulate the influence of other bodies as forces first, then do one divide to compute the acceleration, right?</td>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-01.png"><img class="alignnone size-full wp-image-9942" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-01.png" alt="" width="162" height="141" /></a></td>
</tr>
</tbody>
</table>
<p>But where did that <em>F</em> come from?  We are actually dividing out a mass that was used to compute the original force. What if we never multiplied it in the first place?<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-02.png"><img class="alignnone size-full wp-image-9943" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-02.png" alt="" width="499" height="113" /></a></p>
<p>The G-over-R-squared is a common factor that can be computed once. Two multiplies, which would be required anyway, give the accelerations directly, without any extra divides!   So we'll stick with the plan to add accelerations.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/">realizing <em>addAcc</em>(i,j)</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/feed/</wfw:commentRss>
		</item>
		<item>
		<title>tbb::concurrent_vector in TBB 2.2</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/19/tbbconcurrent_vector-in-tbb-22/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/19/tbbconcurrent_vector-in-tbb-22/#comments</comments>
		<pubDate>Sat, 19 Sep 2009 12:13:16 +0000</pubDate>
		<dc:creator>Anton Malakhov (Intel)</dc:creator>
		
		<category><![CDATA[Open Source]]></category>

		<category><![CDATA[Parallel Programming]]></category>

		<category><![CDATA[Software Engineering]]></category>

		<category><![CDATA[Threading Building Blocks]]></category>

		<category><![CDATA[concurrent]]></category>

		<category><![CDATA[concurrent container]]></category>

		<category><![CDATA[concurrent_vector]]></category>

		<category><![CDATA[container]]></category>

		<category><![CDATA[exception safety]]></category>

		<category><![CDATA[exceptions]]></category>

		<category><![CDATA[TBB 2.2]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/19/tbbconcurrent_vector-in-tbb-22/</guid>
		<description><![CDATA[Concurrent vector was also significantly reworked for TBB 2.2 in all areas: interface, documentation, and internal implementation.
Let's start from the interface. Methods push_back(), grow_by(), and grow_to_at_least() return iterator now. Returning iterator instead of index allows to avoid unnecessary address calculation for just inserted item(s) when a code needs to access it right after insertion:

// typedef [...]]]></description>
			<content:encoded><![CDATA[<p>Concurrent vector was <a title="Concurrent queue changes" href="http://software.intel.com/en-us/blogs/2009/09/04/concurrent-queue-changes-in-intelr-threading-building-blocks-22/">also</a> significantly reworked for <a title="What's new" href="http://software.intel.com/en-us/blogs/2009/08/04/whats-new-in-intel-tbb-22/">TBB 2.2</a> in all areas: interface, documentation, and internal implementation.<br />
Let's start from the interface. Methods push_back(), grow_by(), and grow_to_at_least() return iterator now. <span id="more-9900"></span>Returning iterator instead of index allows to avoid unnecessary address calculation for just inserted item(s) when a code needs to access it right after insertion:</p>
<blockquote>
<pre>// typedef tbb::concurrent_vector&lt; std::pair&lt;int,int&gt; &gt; cv_t;
cv_t::iterator it = cv.push_back( item ); // add new item
map[ item.first ] = &amp;( *it ); // store its address in a table</pre>
</blockquote>
<p>For grow_to_at_least() which previously returned nothing, there is additional benefit of delimiting a range that was allocated and initialized by this particular call and thread. E.g.:</p>
<blockquote>
<pre>cv_t cv( 100 ); // initial 100 items
assert( &amp;( *cv.grow_to_at_least(50) ) == &amp;cv[50] ); // doesn't allocate items
assert( &amp;( *cv.grow_to_at_least(250) ) == &amp;cv[100] ); // allocates additional 150 items</pre>
</blockquote>
<p style="justify;">The vector contains 100 items. grow_to_at_least(50) will return an iterator pointing to an item with index = 50. But sequential grow_to_at_least(250) will return iterator which points to item #100. It means that additional 150 (= 250 - 100) items were added by this call.<br />
It makes more sense in concurrent environment, as these items were correctly initialized but not just allocated.  The same guarantee is already provided for grow_by() and all its items.<br />
If you wonder how items may be allocated but not initialized, please read my blog which clarifies semantics of size() and grow_to_at_least() which were also fixed in 2.2 to allow safe parallel inspections. BTW, tbb::zero_allocator was added to support use cases from this <a title="Delusion of concurrent_vector's size" href="http://software.intel.com/en-us/blogs/2009/04/09/delusion-of-tbbconcurrent_vectors-size-or-3-ways-to-traverse-in-parallel-correctly/">blog</a>.</p>
<p style="justify;">In order to align with std::vector from C++0x, method compact() was renamed as shrink_to_fit(). However, it doesn't suggest any change to semantics and practical means of the method. It still will defragment the first segments as described in my another <a title="secrets of memory organization" href="http://software.intel.com/en-us/blogs/2008/07/24/tbbconcurrent_vector-secrets-of-memory-organization/">blog</a>, and it will not change capacity exactly to the value of size().</p>
<p style="justify;">To close the topic of interface changes, I'd remind you that you might compile old programs with new TBB as is just by setting TBB_DEPRECATED=1 as described in <a title="from TBB2.1 to TBB2.2" href="http://software.intel.com/en-us/blogs/2009/08/05/transitioning-from-intel-tbb-21-to-22/">transition blog</a>.</p>
<p style="justify;">Another part of the changes is hidden behind implementation and the fact that exceptions are rare for concurrent_vector. So, a few chronic critical bugs of exception safety were fixed to comply with declared guarantees. However, these fixes miss the 2.2 release and available only in consequent <a title="tbb22_20090809oss" href="http://www.threadingbuildingblocks.org/ver.php?fid=142">stable</a> and update releases.<br />
So, please consider upgrading to the latest TBB after the 2.2 release in order to avoid a deadlock inside concurrent_vector which may occur for e.g. "out of memory" exception in all the previous versions.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/19/tbbconcurrent_vector-in-tbb-22/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
