<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Blogs &#187; Robert Reed (Intel)</title>
	<atom:link href="http://software.intel.com/en-us/blogs/author/robert-reed/feed/" rel="self" type="application/rss+xml" />
	<link>http://software.intel.com/en-us/blogs</link>
	<description></description>
	<lastBuildDate>Fri, 25 May 2012 22:49:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code with parallel_invoke: will it run faster?</title>
		<link>http://software.intel.com/en-us/blogs/2010/07/23/n-bodies-a-parallel-tbb-solution-parallel-code-with-parallel_invoke-will-it-run-faster/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/07/23/n-bodies-a-parallel-tbb-solution-parallel-code-with-parallel_invoke-will-it-run-faster/#comments</comments>
		<pubDate>Sat, 24 Jul 2010 03:19:55 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Intel Parallel Amplifier]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[parallel_invoke]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/07/23/n-bodies-a-parallel-tbb-solution-parallel-code-with-parallel_invoke-will-it-run-faster/</guid>
		<description><![CDATA[Robert finally brings this story to a close by demonstrating that withl a sufficient threshold in the partitioning process, the combinatorial subdivision algorithm using parallel_invoke can run circles around the serial implementation of the n-body interaction problem.]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke/">Earlier in the month </a>I fleshed out a spatially arranged subdivision method I learned from Matteo Frigo but didn’t have time to actually run it and compare against my baselines.  And in the meantime my test machine has been regrooved into a Windows 7 box, so my first order of business is to retest my baselines.  I reran the serial algorithm a few times and averaged the results.  While the raw numbers (using the same compiler and machine but just a different OS) are slightly larger under Windows 7, their magnitude cannot be distinguished in a graph:<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-01.png"><img class="alignnone size-full wp-image-17162" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-01.png" alt="" width="659" height="335" /></a></p>
<p>Moving forward with the Windows 7 serial numbers, next I did a run with the recursive parallel technique I described last time:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-02.png"><img class="alignnone size-full wp-image-17163" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-02.png" alt="" width="659" height="335" /></a></p>
<p>I included the simple parallel run for comparison.  Looks like there’s more work to do.  My “fast” technique is still not measuring up against either the serial version or the buggy simple parallel version.  Time for another hot spot analysis:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-03.png"><img class="alignnone size-full wp-image-17164" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-03.png" alt="" width="616" height="414" /></a></p>
<p>Curious.  Where most of the work is done is in accAcc(),  yet most of time is being spent in rect_interact(), which just splits up rectangles.  Maybe I’m being too aggressive at subdividing.  Yup, it turns out this is another load balance problem.  And the test program comes equipped with a grain size control in the form of a parameter, <em>fastgrain</em>. You can experiment with various thresholds, but using a grain of 16 makes a substantial shift in the hot spots:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-04.png"><img class="alignnone size-full wp-image-17165" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-04.png" alt="" width="603" height="471" /></a></p>
<p> Body_interact() is now taking less time than addAcc() (where it had been taking more than four times as much before).  And if I collect a set of runs with fastgrain 16, there’s a big change to the performance graph:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-05.png"><img class="alignnone size-full wp-image-17166" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100723-05.png" alt="" width="659" height="335" /></a></p>
<p>These runs start out like the previous parallel algorithms, but by 8 bodies, the curve has already fallen below that of the unsafe, simple parallel method, and by 64 bodies it’s beating the baseline serial numbers.  At 4096 bodies this algorithm is now running over 4x faster than the serial version.  Success!  Could we do even better with a larger grain size?  Maybe marginally, but note in the latter hot spot result that TBB internal code including the idle spinner are already taking more time than the code doing real work.  It’s possible that a higher threshold will just result in more idle time.  Something to look into another day.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/07/23/n-bodies-a-parallel-tbb-solution-parallel-code-with-parallel_invoke-will-it-run-faster/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: balanced recursive parallelism with parallel_invoke</title>
		<link>http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke/#comments</comments>
		<pubDate>Fri, 02 Jul 2010 00:37:03 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[parallel_invoke]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke/</guid>
		<description><![CDATA[Having sketched out a new way of looking at the interactions of the n-body problem, Robert realizes the method with some C++ code using parallel_invoke from Intel Threading Building Blocks.]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism/">Last time</a>, after struggling with different lock configurations to reduce synchronization overhead managing the interactions of <em>n</em>-squared bodies, I changed perspectives on the problem by spatially representing the interactions between all the bodies and (re-)discovering in that view a means to group the interactions so that independent threads could work together without having to worry about locking the data.</p>
<p>This time I’ll turn the concept into some sample code, but first a reminder of that partitioning scheme.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100701-01.png"><img class="alignnone size-medium wp-image-16825" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100701-01-300x267.png" alt="" width="300" height="267" /></a></p>
<p>Because the ranges of <em>i</em> and <em>j</em> are disjoint between triangles A and B, I can set two threads loose, one on each, and let them handle the interactions between bodies represented by points in the region.  I can’t touch any of the pairs represented in rectangle C while A and B are being worked, but I can wait until A and B are done, then start C.</p>
<p>So it looks like the algorithm will have two steps, with two different threads of recursion.  Each triangle will be split into two, smaller triangles and an adjacent rectangle; each rectangle will be subdivided into four smaller rectangles.  I can keep splitting until there’s only one point left, but that might prove to be overkill.</p>
<p>I’ll start with the triangle subdivision:</p>
<pre name="code" class="cpp">void
body_interact(int i, int j)
{
    int d = j - i;
    if (d &gt; 1) {
        int k = d/2 + i;
        parallel_invoke([&amp;]() {body_interact(i,k);}, [&amp;](){body_interact(k,j);});
        rect_interact(i, k, k, j);
    }
}</pre>
<p>Here’s a function called body_interact(), which takes a pair of indices representing the range of the incoming triangle.  The function finds the midpoint of the defining interval and uses that point, if it is distinct from the endpoints, to define the next pair of triangles and rectangle.  It schedules calls to itself for each of the triangles, and then calls rect_interact() to handle the adjacent rectangle.  Inherent in the behavior of parallel_invoke() is to wait for both (or all) calls to complete before returning, thus guaranteeing that interactions in the rectangle will not start until work is done in the triangles.</p>
<p>This triangle code is but a subset of what is needed for the rectangle recursion.  For it I need two midpoints and two parallel recursion calls:</p>
<pre name="code" class="cpp">void
rect_interact(int i0, int i1, int j0, int j1)
{
    int di = i1 - i0; int dj = j1 - j0;

    if (di &gt; 1 &amp;&amp; dj &gt; 1) {
        int im = i0 + di/2;
        int jm = j0 + dj/2;
        parallel_invoke([&amp;]() {rect_interact(i0, im, j0, jm);},
                        [&amp;]() {rect_interact(im, i1, jm, j1);});
        parallel_invoke([&amp;]() {rect_interact(i0, im, jm, j1);},
                        [&amp;]() {rect_interact(im, i1, j0, jm);});
    }

    else {
        for (int i = i0; i &lt; i1; ++i)
            for (int j = j0; j &lt; j1; ++j)
                addAcc(i, j);
    }
}</pre>
<p>Two balanced parallel operations are invoked when there is enough work to subdivide on both axes.  Otherwise, the else condition drops into the actual interaction function, addAcc().  (You might have noticed that body_interact() does not even call addAcc(). Because of the half-open interval notation I’m using, a triangle subdivision that reduces to a single interaction node is in fact pointing at a node representing the interaction of a body with itself, so we can ignore them.)</p>
<p>With this code, I can go back to the original addAcc(), which updated the acceleration vectors of both interacting bodies without concern about data races and without any locks.  There are still locks, but they are hidden in the library code that handles the synchronization of the rendezvousing threads within the parallel_invoke() call.  But will it run faster?  We’ll find out <a href="http://software.intel.com/en-us/blogs/2010/07/23/n-bodies-a-parallel-tbb-solution-parallel-code-with-parallel_invoke-will-it-run-faster/">next time</a>.</p>
<p>Sorry, too late to play 20 questions and WIN VALUABLE PRIZES!  Our <a href="http://software.intel.com/en-us/forums/showthread.php?t=75086">Intel Parallel Studio expertise contest</a> has concluded.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: a fresh look using recursive parallelism</title>
		<link>http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism/#comments</comments>
		<pubDate>Thu, 01 Jul 2010 18:33:47 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism/</guid>
		<description><![CDATA[After beating his head against various configurations of locks, none successful at beating serial performance, Robert switches tactics and looks at another way to look at the n-body interaction problem, sketching out an alternative that uses balanced, recursive parallelism]]></description>
			<content:encoded><![CDATA[<p>When <a href="http://software.intel.com/en-us/blogs/2010/06/04/n-bodies-a-parallel-tbb-solution-parallel-code-spreading-the-fix-around/">last I had a chance to play with this code</a>, I experimented with using multiple locks to enable multiple simultaneous (and disjoint) interactions between pairs of bodies.  It helped but performance still didn’t cross the base line using only one thread.  Overhead in the loop could be reduced by using only one scoped lock instead of two, but it would require an array of locks indexed by <em>i</em>, and <em>j</em>.</p>
<pre name="code" class="cpp">    // apply acceleration components using unit vectors
    {
        MyLockType::scoped_lock mylock(ijlocks[i][j]);

        for (int axis = 0; axis &lt; 3; ++axis) {
            body[j].acc[axis] += aj*ud[axis];
            body[i].acc[axis] -= ai*ud[axis];
        }
    }</pre>
<p>Rather than pursuing that right now, let me take a step back and try to look at the problem differently. I can represent all the interactions between pairs of bodies, <em>i</em> and <em>j</em>, in a simple graph:<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-01.png"><img class="alignnone size-medium wp-image-16780" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-01-300x267.png" alt="" width="300" height="267" /></a></p>
<p>The green triangle above covers the area where <em>i</em> ≤ <em>j</em>.  Skipping the points along the diagonal (we don’t need to consider interactions of bodies with themselves), all the other integral points within the triangle can represent unique interactions between pairs of bodies.  Dividing the outer loop between a set of threads is like splitting the triangle along horizontal lines as shown above.  Each sub-region can be handled by a separate thread.</p>
<p>One problem becomes immediately apparent by drawing this diagram: the sub-regions have different sizes.  That means the number of interaction pairs within each varies—the thread that gets the bottom sub-region has a lot more work to do than the one getting the topmost region.  The bottom thread will be working for a long time after the top thread finishes.  This is a great example of what is called a <a href="http://en.wikipedia.org/wiki/Load_balancing_(computing)"><em>load balance</em> </a>problem.</p>
<p>A little more thought will net the realization that while each thread in the subdivision above has its own range of <em>i</em> values, they must share <em>j</em> values: any set of interactions of different <em>i</em> that have the same <em>j</em> are <em>collisions</em> that are either races or restricted by locks to the execution of a single thread.  What might help is if I could add some vertical partitions to this graph, and gain some separation between the threads like what happened with the <em>i</em> values.</p>
<p>It turns out there is a way to do this, using a combinatorial algorithm I learned from Matteo Frigo.  Considering the same interaction graph, this time find the midpoint of the diagonal and use it to draw a couple triangles like this:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-02.png"><img class="alignnone size-medium wp-image-16781" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-02-300x267.png" alt="" width="300" height="267" /></a></p>
<p>The triangles A and B have a particular property of great interest: the ranges of <em>i</em> <span style="text-decoration: underline">and </span><em>j</em> are completely disjoint.  A thread working in the range of A could pick any contained interaction pair and be guaranteed not to interfere with another thread doing the same thing within triangle B.  Two threads could simultaneously cover the interactions represented by the pair of triangles, each covering roughly the same number of interactions and thus load balanced.</p>
<p>That handles two threads.  What about more?  Well, following that great adage, “whatever is worth doing is worth overdoing,” I can repeat that partition on the new triangles:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-03.png"><img class="alignnone size-medium wp-image-16782" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-03-300x157.png" alt="" width="300" height="157" /></a></p>
<p>Now in this triangle there are eight separate regions, each which could contain a separate thread operating without interfering with the adjacent ones.  However, more and more of our available interaction space is turning dark green, representing as yet unhandled pairs of interactions.  Fortunately these are easy to handle.  In fact, the triangle interactions are a subset of the rectangle interactions.  Consider the biggest rectangle:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-04.png"><img class="alignnone size-full wp-image-16783" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/07/B100630-04.png" alt="" width="274" height="267" /></a></p>
<p>I can split this rectangle into four sub-rectangles.  In this arrangement, I can turn two threads loose, each one targeting interactions in one of the two light green rectangles.  The disjoint property holds for those two regions.  When those two threads are both done, I can move one each to the two dark green regions because they also have the disjoint index range property.  And what I can do once I can also repeat here, subdividing these rectangles and generating as much balanced parallel work as I need to occupy all the available threads.</p>
<p>We’ll look at some sample code for doing this<a href="http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-balanced-recursive-parallelism-with-parallel_invoke/"> next time</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: spreading the “fix” around</title>
		<link>http://software.intel.com/en-us/blogs/2010/06/04/n-bodies-a-parallel-tbb-solution-parallel-code-spreading-the-fix-around/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/06/04/n-bodies-a-parallel-tbb-solution-parallel-code-spreading-the-fix-around/#comments</comments>
		<pubDate>Fri, 04 Jun 2010 23:42:43 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/06/04/n-bodies-a-parallel-tbb-solution-parallel-code-spreading-the-fix-around/</guid>
		<description><![CDATA[If one lock is a bottleneck to interacting bodies, what about a lock per body?]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/2010/05/05/n-bodies-a-parallel-tbb-solution-parallel-code-finding-a-fix-for-the-leaky-adds/">Last time </a>I was able to make the n-bodies acceleration code at least thread safe by employing a scoped lock, at a disastrous cost in performance.  If you think about it, it’s a bad way to manage the eight HW threads my test machine has available.  The obvious alternative is to have a lock per body—any thread needing to adjust a pair of bodies would need to acquire each body’s lock before proceeding.  That’s more locking overhead than before—twice as many locks—but enough independent locks that my multiple threads won’t all be stopped by one body interaction.  Making that change, the body structure now looks like this:</p>
<pre name="code" class="cpp">#ifdef ALIGN_BODIES
__declspec(align(128))
#endif
struct bodytype {
    double pos[3];  // body position in three axes
    double vel[3];  // body velocity
    double acc[3];  // body acceleration
    double mass;    // body mass
#ifdef BODY_LOCKS //{
    MyLockType lock; // local access lock
#ifdef ALIGN_BODIES //{
    unsigned char dummy[128-(10*sizeof(double)+sizeof(MyLockType))]; // 10 is the number of doubles as data
#endif //}
#else //}{
#ifdef ALIGN_BODIES //{
    unsigned char dummy[128-(10*sizeof(double))]; // 10 is the number of doubles as data
#endif //}
#endif //}
} body[BODYMAX];</pre>
<p>I’ve added a new conditional, BODY_LOCKS, which when defined creates a lock object of MyLockType in each body object.  And I added the size of that object to the dummy field at the end so that I maintain cache alignment for each of the bodies.  But compiling it...</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-01.png"><img class="alignnone size-full wp-image-16270" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-01.png" alt="" width="729" height="219" /></a></p>
<p>Oops. How many free bytes were there before adding the lock? Hmm.... (128 – 10* sizeof (double)) = (128 – 10* 8) = 48 bytes. So I guess the lock structure must take more than 48 bytes. And if I double the size of the body structure (using 256 instead of 128), that does fix the errors. Does it have any other impact? And for that matter, what is the cost of not padding the bodies into separate cache lines? Taking a few runs to test, we get numbers like these for the serial run (which have an insignificant impact in the overall graph):</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-02.png"><img class="alignnone size-full wp-image-16271" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-02.png" alt="" width="242" height="261" /></a></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-03.png"><img class="alignnone size-full wp-image-16272" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-03.png" alt="" width="659" height="311" /></a></p>
<p>These effects may be different when I  actually turn on multiple threads (and have them deal with bodies split across cache lines) so I’ll take another look then, but for now I’ll go to the 2-cache line body to flesh out the lock-per-body approach to thread safety. The code change to use these locks:</p>
<pre name="code" class="cpp">    // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq
    double Gdivd = GFORCE * ivdist * ivdist;
    double ai = Gdivd*body[j].mass;
    double aj = Gdivd*body[i].mass;

    // apply acceleration components using unit vectors
    {
        MyLockType::scoped_lock locki(body[i].lock);
        MyLockType::scoped_lock lockj(body[j].lock);

        for (int axis = 0; axis &lt; 3; ++axis) {
            body[j].acc[axis] += aj*ud[axis];
            body[i].acc[axis] -= ai*ud[axis];
        }
    }</pre>
<p>Updating accelerations now require two lock references, conveniently called <em>locki</em> and <em>lockj</em>. Also conveniently, the order in which I am acquiring new <em>i</em>s and <em>j</em>s, using a half triangle of the interaction plain (touching each <em>i</em>-<em>j</em> pair once, updating both) means that <em>j</em> is always larger than <em>i</em>.  This is a hierarchical ordering of the locks, which means that whichever thread is accessing a pair of bodies, it will always acquire the lock for the smaller-numbered body before the larger-numbered one (it would work equally well the other way, as long as all the threads do it the same way.</p>
<p>Nevertheless, though it is faster than the single, global lock version used before, it still isn’t as fast as doing it serially.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-04.png"><img class="alignnone size-full wp-image-16273" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/06/B100604-04.png" alt="" width="659" height="311" /></a></p>
<p>My use of locks doesn't seem to be getting me far enough.  Maybe <a href="http://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism/">next time I'll take a step back </a>and see if there is a different approach.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/06/04/n-bodies-a-parallel-tbb-solution-parallel-code-spreading-the-fix-around/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: finding a fix for the leaky adds</title>
		<link>http://software.intel.com/en-us/blogs/2010/05/05/n-bodies-a-parallel-tbb-solution-parallel-code-finding-a-fix-for-the-leaky-adds/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/05/05/n-bodies-a-parallel-tbb-solution-parallel-code-finding-a-fix-for-the-leaky-adds/#comments</comments>
		<pubDate>Thu, 06 May 2010 01:13:42 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Intel]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[scoped locks]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/05/05/n-bodies-a-parallel-tbb-solution-parallel-code-finding-a-fix-for-the-leaky-adds/</guid>
		<description><![CDATA[Wherein Robert does the simplest thing to make the body interaction code safe, and looks at TBB scoped locks in the process.  Sometimes the simplest thing is not the fastest.]]></description>
			<content:encoded><![CDATA[<p><a href="http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/">Last time</a> (yeah, it's been a while, and an old joke) I revealed through the use of Intel® Parallel Inspector that the obvious means to empower the interaction code by applying a parallel_for not only didn't run all that much faster, but also was so full of <em>race conditions</em> (places where threads might write and read shared variables in unpredictable orders) that its answers are probably way off.</p>
<pre name="code" class="cpp">    // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq
    double Gdivd = GFORCE * ivdist * ivdist;
    double ai = Gdivd*body[j].mass;
    double aj = Gdivd*body[i].mass;

    // apply acceleration components using unit vectors
    for (int axis = 0; axis &lt; 3; ++axis) {
        body[j].acc[axis] += aj*ud[axis];
        body[i].acc[axis] -= ai*ud[axis];
    }</pre>
<p>The lines revealed as problematic in my previous correctness testing were those comprising the last two calculations, adjusting the acceleration fields in each of the two bodies interacting in the <em>addAcc</em>() call. Since each HW thread must load the current acceleration value into a local register in order to perform the summing operation, two threads may read the same value before either has a chance to update it, leaving after both writes one update overwritten (and lost) by the other.</p>
<p>To avoid the update races, a mechanism is needed to assure that any thread can do both the read and the write before any other thread starts the same operation from another body (or rather, a thread working on the same <em>j</em>-body from a different <em>i</em>-body). The typical mechanism is to employ a <em>lock</em>, a location manipulated by special pieces of code that can guarantee only one thread can <em>own</em> the lock at a time. Using a lock, I can create a <em>critical section</em> (also called a <em>monitor</em>) a region of code where only one thread can operate at a time.</p>
<p>It's very easy to do this in Intel Threading Building Blocks, using <em>scoped locks</em>:</p>
<pre name="code" class="cpp">    typedef tbb::mutex MyLockType;
    MyLockType L;

        // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq
        double Gdivd = GFORCE * ivdist * ivdist;
        double ai = Gdivd*body[j].mass;
        double aj = Gdivd*body[i].mass;

        // apply acceleration components using unit vectors
        {
            MyLockType::scoped_lock lock(L);

            for (int axis = 0; axis &lt; 3; ++axis) {
                body[j].acc[axis] += aj*ud[axis];
                body[i].acc[axis] -= ai*ud[axis];
            }
        }</pre>
<p>Scoped locks are pretty cool because they take advantage of the basic language notion of object lifetimes. In the example above, I create a specific lock type (tbb::mutex) and create an instance of the lock, L. Then in the region where I want to use the lock, I create a separate<em> </em>scoped_lock object, <em>lock</em>, and tie it to object L in the constructor. This <em>acquires</em> the lock. The lifetime of this object is the region of the compound statement in which it is defined. When it <em>passes out of scope</em> (a thread leaves this region), then the secondary object goes away and the lock is <em>released</em>, while the lock object (L) itself sticks around for the next locking opportunity.</p>
<p>What’s cool is that I don’t need explicitly to release the lock, meaning simpler code.  Moreover, if there was something in my for-loop that caused an <em>exception</em> (an interrupt to normal processing), the semantics of the language guarantee that the scoped object will go away and the lock will be released automatically. That's very cool.</p>
<p>But that’s about as far as the coolness goes. Viewed from a different angle, this is pretty gross code. I've created a single lock to service however many bodies and threads that may exist when I run it. That means that even in cases where a pair of threads are operating <span style="text-decoration: underline;">on completely different pairs of bodies</span>, they still must wait their turn. That's a lot of serialization, suggesting that the resulting code will be much slower, which it is:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/05/B100503-01.png"><img class="alignnone size-full wp-image-15774" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/05/B100503-01.png" alt="" width="167" height="261" /></a></p>
<p>Well, that's not going to win any performance awards. Representing that versus my previous graph:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/05/B100503-02.png"><img class="alignnone size-full wp-image-15775" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/05/B100503-02.png" alt="" width="659" height="311" /></a></p>
<p>I'll have to do better if I plan to show a beneficial parallelization. At least the code is thread-safe now:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/05/B100503-03.png"><img class="alignnone size-full wp-image-15776" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/05/B100503-03.png" alt="" width="685" height="174" /></a></p>
<p>Or, at least the reported races are not in the bodies code (mostly). Maybe I'll explore that, but first, what happens if the <a href="http://software.intel.com/en-us/blogs/2010/06/04/n-bodies-a-parallel-tbb-solution-parallel-code-spreading-the-fix-around/">threads don't pile up </a>on a single, global lock</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/05/05/n-bodies-a-parallel-tbb-solution-parallel-code-finding-a-fix-for-the-leaky-adds/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: so what does TBB_USE_THREADING_TOOLS do?</title>
		<link>http://software.intel.com/en-us/blogs/2010/04/08/n-bodies-a-parallel-tbb-solution-parallel-code-so-what-does-tbb_use_threading_tools-do/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/04/08/n-bodies-a-parallel-tbb-solution-parallel-code-so-what-does-tbb_use_threading_tools-do/#comments</comments>
		<pubDate>Thu, 08 Apr 2010 23:34:13 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Intel Parallel Inspector]]></category>
		<category><![CDATA[Intel® Threading Building Blocks]]></category>
		<category><![CDATA[locks]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>
		<category><![CDATA[Threading tools]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/04/08/n-bodies-a-parallel-tbb-solution-parallel-code-so-what-does-tbb_use_threading_tools-do/</guid>
		<description><![CDATA[Take a peek at an example of TBB_USE_THREADING_TOOLS, a compile-time switch used in TBB to hide code that might look suspect to Intel Parallel Inspector.]]></description>
			<content:encoded><![CDATA[<p>Our East coast Parallelism Road Show was a success, and having finally caught up with some of the work that piled up while I was gone, I’ll squeeze enough time at least to add a footnote to a previous rambling.</p>
<p>In <a href="http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/">my last bumbling about</a>, I tried defining the TBB_USE_THREADING_TOOLS macro as a stab to find the problem with my Intel® Parallel Inspector analysis of nbodies.  Didn’t seem to do much at the time, so I thought it might be interesting to find out what it really does.  It was easy to find examples of it in the open source.  spin_mutex.h contains has a scoped lock constructor:</p>
<pre> 
<code>        //! Construct and acquire lock on a mutex.</code>
<code>        scoped_lock( spin_mutex&amp; m ) {</code>
<code>#if TBB_USE_THREADING_TOOLS||TBB_USE_ASSERT</code>
<code>            my_mutex=NULL;</code>
<code>            internal_acquire(m);</code>
<code>#else</code>
<code>            my_unlock_value = __TBB_LockByte(m.flag);</code>
<code>            my_mutex=&amp;m;</code>
<code>#endif /* TBB_USE_THREADING_TOOLS||TBB_USE_ASSERT*/</code>
<code>        }</code></pre>
<p>There is a bit of getting the cart before the horse to examine the details of a lock before even talking about races in my way-slower-than-expected narrative exploring the effort to parallelize some code, but it seems appropriate as a footnote.</p>
<p>So, what’s going on up there?  In the non-TBB_USE_THREADING_TOOLS case something called __TBB_LockByte is being called with a field of the spin mutex object (probably a byte?), which must be the lock part (a gate where only one thread gets by at a time).  Then the spin mutex object is stashed until later.  If multiple threads tried to do this __TBB_LockByte call at the same time, they might face some contention with each other, and some tool designed to detect those <em>data</em> <em>races</em> might flag this operation as suspect.</p>
<p>On the other hand, when TBB_USE_THREADING_TOOLS is asserted, it looks like local mutex pointer is set to a safe value and the mutex itself is passed to some other function, <em>internal_acquire</em>(), effectively hiding any lock funny business from our correctness inspection tool.  So that’s what it does.  Maybe after I introduce scoped locks, I’ll come back here and peel another layer, and we can look at the alternate implementations of the lock.</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/04/08/n-bodies-a-parallel-tbb-solution-parallel-code-so-what-does-tbb_use_threading_tools-do/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: first run’s fatal flaw</title>
		<link>http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 09:37:53 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[c++ parallel programming]]></category>
		<category><![CDATA[Intel Parallel Inspector]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/</guid>
		<description><![CDATA[Robert tries Intel Parallel Composer update 5 on NBodies and shows (eventually) the data race in the dead simple parallel version of recomputing accelerations.]]></description>
			<content:encoded><![CDATA[<p>Last time when I resumed the exploration of my simple n-body gravitational simulator, I produced some performance numbers and revealed that there is a flaw in the first parallel version of the algorithm.  But then <a href="http://software.intel.com/en-us/intel-parallel-composer/">Intel® Parallel Composer Update 5</a> was released last week, so I updated my tools.  That means I need a new benchmark run to see how the baseline has been affected.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-01.png"><img class="alignnone size-full wp-image-14364" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-01.png" alt="" width="592" height="263" /></a></p>
<p>The rebuilt bodies program is just a touch slower (the 4K body serial average before was 283 sec vs. 278 sec for this parallel version).  In this case the serial number got hit harder so the parallel scaling number actually goes up slightly (from 1.02x to 1.03x) even though the numbers got worse!  Don’t trust parallel scaling numbers—understand the underlying data.</p>
<p>On to parallel correctness.  I know there’s a flaw in the parallel algorithm I’ve proposed, but how to demonstrate it?  If the n-bodies program actually had a means to display a projection of the bodies in motion I might be able to run that long enough to notice glitches in the body motion.  Maybe.  Or maybe not.  I’d like something a little more reliable and mechanical than exhaustive inspection, especially when there’s lots of data.  So let’s look at what we can learn from <a href="http://software.intel.com/en-us/intel-parallel-inspector/">Intel Parallel Inspector</a>.</p>
<p>First thing we need to do is change the command line to select a test for doing data collection.  I’ve been running these ramps using “select serial” and “select par” as the command lines.  These code paths have timing functions and run varying sizes of the problem to collect those times, and generally do way more work than I need to catch a race.  There is another option, the “single <em>n</em>” command that takes as an argument <em>n</em>, the number of bodies to use in the simulation.  I’ll change the debug command line to “single 128 par” to run the test with 128 bodies.  So here goes.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-02.png"><img class="alignnone size-full wp-image-14365" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-02.png" alt="" width="638" height="259" /></a></p>
<p>I’ll pick “ti3” the third level thread inspection, in order find out where my deadlocks or data races exist.  You can see from the estimates shown at the left that this might take a little longer than just running the test instance by itself.   So I hit the “run analysis” button at the bottom of the dialog and off it goes:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-03.png"><img class="alignnone size-full wp-image-14366" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-03.png" alt="" width="658" height="458" /></a></p>
<p>Assuming all goes well and data actually get collected from the run, I next see something like this:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-04.png"><img class="alignnone size-full wp-image-14367" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-04.png" alt="" width="651" height="207" /></a></p>
<p>And hitting the “Interpret Result” button takes me to this:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-05.png"><img class="alignnone size-full wp-image-14368" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-05.png" alt="" width="664" height="443" /></a></p>
<p>At the bottom are some of the observations collected during my program run.  The observations were interpreted when I hit the last button, generating the problem sets visible in the upper pane.  Looks like I have some <em>data races</em> (problematic places in memory where multiple threads may be writing and reading data in an undetermined and potentially harmful order).  I should be able to double-click on a problem set and find an error.  I’ll try P1:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-06.png"><img class="alignnone size-full wp-image-14369" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-06.png" alt="" width="670" height="145" /></a></p>
<p>Oops.  That’s not very helpful.  Oh, but the executable module is irml, not NBodies. I get the same thing with the  next two problem sets.  Moving on to P4, which does not mention irml (which <span style="text-decoration: underline">is</span> a part of TBB—Intel resource management layer?), I try again:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-07.png"><img class="alignnone size-full wp-image-14370" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-07.png" alt="" width="650" height="452" /></a></p>
<p>Hmmmmmmmm….  This doesn’t look like NBodies code either.  This is somewhere in the TBB parallel_for header file.  Parallel Inspector flags this as a Write-after-Write race but it doesn’t seem to give me much that might help me understand the problem.  Maybe this is a <em>false positive</em>, a case where it looks like there might be a race condition but there really isn’t.  The code that unwittingly commits a data race might not look that different from safe and legal mutex code.  Sometimes it’s only in how the code is used that can distinguish one case from the other. </p>
<p>To tell the difference, Intel Parallel Studio has a library of function calls that can be used to declare safe operations that might otherwise be considered suspect.  To enable these TBB conditionally compiles under the TBB_USE_THREADING_TOOLS macro definition and will apply alternate code when enabled to hide questionable code.  But it can cost a little in performance so it does need to be turned on when you need it, which can be done inside Intel Parallel Studio:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-08.png"><img class="alignnone size-full wp-image-14371" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-08.png" alt="" width="457" height="491" /></a></p>
<p>Under Parallel Composer Select Build Components is the following dialog:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-09.png"><img class="alignnone size-full wp-image-14372" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-09.png" alt="" width="623" height="356" /></a></p>
<p>Oops. Not set.  That’s easy to fix.  Let me rebuild and recollect the ti3 data.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-10.png"><img class="alignnone size-full wp-image-14373" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-10.png" alt="" width="667" height="489" /></a></p>
<p>Well, that doesn’t look much better.  Now irml is a contributing module for all the data race problem sets.  Sure enough, drilling down to source on any of these problem sets is as unsatisfying as were any in the last collection.  So what am I doing wrong? <a href="http://software.intel.com/en-us/blogs/2010/04/08/n-bodies-a-parallel-tbb-solution-parallel-code-so-what-does-tbb_use_threading_tools-do/">And did that TBB_USE_THREADING_TOOLS do anything</a>?</p>
<p>There is one other thing that I can try.  I’ve been using the Release configuration here, following on from the performance runs that started this post.  This has been a problem before when using analysis tools because of the aggressive function inlining that Parallel Composer normally applies.  I have an alternate configuration, Release-with-functions, which has the same settings as Release save the function inlining, which is turned down to /Ob1.  Switching to that configuration, rebuilding the program and collecting ti3 data one more time:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-11.png"><img class="alignnone size-full wp-image-14374" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-11.png" alt="" width="667" height="531" /></a></p>
<p>That doesn’t look much different than the last one.  However, when I double-click on P2 this time, I get this:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-12.png"><img class="alignnone size-full wp-image-14375" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100222-12.png" alt="" width="679" height="659" /></a></p>
<p>Well this looks better.  In fact, the highlighted lines are the known data-race.  My brute parallel code divides the index <em>i</em> among a collection of threads, which may simultaneously try to modify the same body <em>j</em>.  This display shows the two sides engaging in the race, which in this circumstance happens to be from different invocations (threads) of the same code.</p>
<p>Moreover, when I drilled down to all the other problem sets in this collection, none of them contained the “my_body( my_range );” function call I landed upon before.  Duh-oh!  Of course!  Now I recognize this function call as where parallel_for actually executes the kernel containing the racy code.  It looks like the aggressive inlining normally in play in the Release configuration left the relevant code stripped of symbols.  There had not been enough detail available for Parallel Inspector to navigate closer to the racy lines until I relaxed function inlining.  Backing off inlining may also affect performance, but hopefully not so severely that we’re “Heisenberg-ed” into observations that are substantially wrong.</p>
<p>If you haven’t gotten enough of this, we’re taking the Parallelism road show on the road again in a few weeks (middle of March, snow permitting) up the East coast with several stops from New Jersey to Boston.  Find out more details at this <a href="http://www.programmers.com/PPI_US/PartnerCenter/partners.aspx?name=Parallelism_Techday">Programmer’s Paradise link</a>.</p>
<p>Next time: parallel code: <a href="http://software.intel.com/en-us/blogs/2010/05/05/n-bodies-a-parallel-tbb-solution-parallel-code-finding-a-fix-for-the-leaky-adds/">finding a fix for the leaky adds</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code: first runs</title>
		<link>http://software.intel.com/en-us/blogs/2010/02/08/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs/</link>
		<comments>http://software.intel.com/en-us/blogs/2010/02/08/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs/#comments</comments>
		<pubDate>Mon, 08 Feb 2010 23:23:48 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[nbodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[serial optimization]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2010/02/08/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs/</guid>
		<description><![CDATA[After getting lost in the twisty passages of compiler inter-procedural optimization, Robert returns to the simple path and shows a practical example of why it's important to optimize your serial code before parallelizing it.]]></description>
			<content:encoded><![CDATA[<p>Shortly after Thanksgiving I started experimenting with the <a href="http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/">ideas for some parallel code</a> to replace <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">the serial code</a> I’d previously optimized.  However, exceeding my goal of stepping into every hole I could find along the way, I hit a doozy: a case of modified source in a function not executed affecting the execution performance of another function!  After passing the code around among some compiler guys and taking the holiday hit, as I get back to this project it appears to be a case of aggressive optimizers in the compiler. I’m still working on some experiments to understand the interactions of inter-procedural optimization and function inlining, but as those efforts continue to percolate, I’m overdue to take the next step—run this parallel version.</p>
<p>But first, because I’ve made some serial code optimizations and now I’m using Intel® Parallel Composer update 4, I want to take more samples of the serial kernel with the improved <em>addAcc()</em> function.  Using the command, “select serial” I collected several runs of data and took their averages (first column is the number of bodies):<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-01.png"><img class="alignnone size-full wp-image-13821" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-01.png" alt="" width="332" height="242" /></a></p>
<p>Then I did the same thing with the command, “select par” to get the basic parallel kernels for interactions and dynamics discussed previously.  To compare against the serial, first I color-coded the averages (I’m a sucker for Excel conditional formatting): parallel runs that are faster get the green—if they’re more than 10% slower, they see red.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-02.png"><img class="alignnone size-full wp-image-13818" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-02.png" alt="" width="592" height="242" /></a></p>
<p>OOOooooo…, that’s way more red than I expected.  Plotting a log-log graph of these data:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-03.png"><img class="alignnone size-full wp-image-13819" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-03.png" alt="" width="659" height="311" /></a></p>
<p>That’s not very impressive parallelism.  It takes around 2048 bodies interacting for my simple parallel kernel to beat the serial code.  It wasn’t like that six months ago when I first compared these kernels:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-04.png"><img class="alignnone size-full wp-image-13820" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-04.png" alt="" width="698" height="337" /></a></p>
<p>This is an image of the performance of these same kernels from a presentation I put together last summer.  Here the simple parallel kernel looks impressive, sweeping under the serial kernel albeit stuck on the same n-squared growth curve.  What happened?  Serial optimization.  I pulled the data for those serial runs and added it to my previous graph of the new results:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-05.png"><img class="alignnone size-full wp-image-13821" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2010/02/B100207-05.png" alt="" width="660" height="312" /></a></p>
<p>As you can see, the old serial code was quite a bit slower and the crossover point for the parallel runs was down around 64 bodies (the new parallel results are a little better than the old: the old 4096 bodies number was around 288 seconds, versus 277 here).  So, even as I was preaching about doing serial code optimization first, the serial code I was using for the previous graph lulled me into a false sense of accomplishment.  No wonder I was disappointed!</p>
<p>But it’s even worse, though this twist I was expecting.  This parallel kernel that is struggling to beat its serial cousin has a fatal flaw, which I’ll explore next time.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2010/02/23/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs-fatal-flaw/">parallel code: first run’s fatal flaw</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2010/02/08/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: parallel code, a first attempt</title>
		<link>http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/#comments</comments>
		<pubDate>Sat, 14 Nov 2009 01:44:24 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[lambda functions]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[OpenMP]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[Supercomputing]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>
		<category><![CDATA[vectorization]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/</guid>
		<description><![CDATA[On the way to composing a first thread-parallel version of n-body code, Robert points out parallelization has already been occuring, using the Intel compiler and its vectorization of simple loops.]]></description>
			<content:encoded><![CDATA[<p>It’s been a busy month preparing for <a href="http://sc09.supercomputing.org/index.php">SuperComputing ‘09</a> and <a href="http://scyourway.supercomputing.org/exhibits/view/19">booth duty</a> (I’ll be hanging out in the Intel booth on Tuesday and Thursday and giving a talk there on Wednesday), and refining materials for a Parallelism Road Show we’re planning for next February and March (more details later). (Not to mention chorus rehearsals for this year’s <a href="http://portlandrevels.org/revels.php?page=this-year%60s-show">Christmas Revels</a>—oops, I did mention it. ;-) But finally, after all <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">this serial optimization</a> I’ve been working through on the n-bodies code, it’s time to go parallel. Or rather, first take a short side-step to discover that code parallelization has already begun—through <em>vectorization</em>. I can pull up a compiler report, normally suppressed, by adding the manual switch <code>/Qvec-report:1</code> to the compilation options:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-01.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-01.png" alt="" width="744" height="378" /></a></p>
<p>With this simple change, I notice something new in the compilation logs:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-02.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-02.png" alt="" width="659" height="138" /></a></p>
<p>If I double-click on one of these lines in the Output panel, the system navigates to the corresponding source code lines:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-03.png"><img src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/11/b091102-03.png" alt="" width="607" height="411" /></a></p>
<p>Note the little marker in the left margin that indicates the lines referred by that vectorization report line. This is the ballistic step in the serial code implementation, suggesting that the little loop for setting the vector components of the bodies has been converted to a linear sequence of vector instructions. Some other time I’ll dig down into the assembly code to demonstrate that this code really has been <em>vectorized</em> (i.e., realized by emitting SIMD code to execute it), but for now let’s move forward and try to make this multi-thread parallel in addition to vector-parallel.</p>
<p>How do I “parallelize” that lower set of loops in the previous code sample? One simple way would be to add an <a href="http://openmp.org/wp/">OpenMP</a> construct:</p>
<pre>    #pragma omp parallel for
    for (i = 0; i &lt; n; ++i) {
        for (int axis = 0; axis &lt; 3; ++axis) {
            body[i].vel[axis] += body[i].acc[axis] * TIMESTEP;
            body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
            body[i].acc[axis] = 0.;
        }
    }</pre>
<p>OpenMP has been around for a number of years and operates as a language extension for C, C++ and Fortran. Compilers enabled to recognize the constructs (such as the Intel® C++ and Fortran Compilers) can use them as hints to direct the compiler generation of parallel code. Non-complying compilers see these constructs as an unrecognized pragma (or a funny comment in Fortran) and ignore them. In this case the OpenMP line applies to the line that follows, directing the compiler to create code that divides the outer loop into some collection of chunks, each of which can be dispatched to a separate HW thread. Each thread processes the chunks assigned to it. As each thread finishes its work, it waits for the others in its team to complete their work. All these HW threads will land in a <em>rendezvous</em> or <em>join</em> point until all have arrived, because there’s an implied wait at the end of the parallel <em>for</em>-loop so that code that follows will not be executed until the preceding code has been completed, just to avoid any potential side effects. In this particular case, we’re also at the end of the parallel section so only one HW thread would proceed beyond the end of the <em>for</em>-loop, the rest returning to a thread pool to await more work.</p>
<p>With the advent of lambda constructs, described in the <a href="http://en.wikipedia.org/wiki/C%2B%2B0x">C++0x standard</a> and implemented in the Intel C++ Compiler version 11, we can write nearly as compact a version of this parallel construct using Intel® Threading Building Blocks (before lambdas we’d need to use a full C++ function-object, which really breaks up the flow of the source code). As a lambda construct using TBB, the OpenMP code above would transform into something like this:</p>
<pre>    parallel_for( blocked_range&lt;int&gt; (0,n),
      [] (const blocked_range&lt;int&gt; &amp;r) {
        for (int i = r.begin(); i != r.end(); ++i)
            for (int axis = 0; axis &lt;3; ++axis) {
                body[i].vel[axis] += body[i].acc[axis] * TIMESTEP;
                body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
                body[i].acc[axis] = 0.;
            }
      });</pre>
<p>Not quite a compact as the OpenMP version so there must be some other reason to value this. Otherwise, why embrace the complexity? The key is flexibility. TBB offers a rich set of tools that can be used within the context of such a parallel function. The c++0x lambda-function expands that richness with a compactness of expression and flexibility that lets me use TBB with almost the same convenience of OpenMP. For example, that pair of square brackets leading off the lambda provides flexible control of what variables defined in the <a href="http://en.wikipedia.org/wiki/Scope_(programming)">scope</a> of the call will be available and in what form within the function (more on this later). The TBB parallel_for will divide the work of this inline body using as a helper class the TBB blocked_range, making work for as many HW threads as there are available.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2010/02/08/n-bodies-a-parallel-tbb-solution-parallel-code-first-runs/">parallel code: first runs</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial body forces one more time</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/#comments</comments>
		<pubDate>Fri, 23 Oct 2009 19:58:13 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[c++ parallel programming]]></category>
		<category><![CDATA[Intel Parallel Composer]]></category>
		<category><![CDATA[Multicore Parallel Programming]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[performance analysis]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/</guid>
		<description><![CDATA[Forced to revisit the question of accumulating forces one more time, Robert tests addForce(i,j) and discovers that while accelerations are a little faster, it's not much and a much more complicated story than he realized.]]></description>
			<content:encoded><![CDATA[<p>My plan to go parallel this time was <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">thwarted by concerns</a> that I may still have left some serial performance on the table. So I’ll take one more look (OK, well, no more than three). Leading the contenders was Jim Dempsey’s suggestion that accumulating forces instead of accelerations would save some divides. His numbers did not show a dramatic difference but did suggest summing forces to be ever so slightly faster than accumulating accelerations. My analysis of the equations suggest that even this should be the wrong order, so I took the plunge and wrote <em>addForce</em>(i,j). It’s a simple twist on the original <em>addAcc</em>. Instead of computing separate accelerations for each body, I compute one force:</p>
<div>   // Use the Force, Luke!<br />
    double force = GFORCE * ivdist * ivdist * body[j].mass * body[i].mass;</div>
<p>Then I ensured the vector component accumulations take advantage of the simplification:</p>
<div>    for (int axis = 0; axis &lt; 3; ++axis) {<br />
        double axialForce = force * ud[axis];<br />
        body[j].acc[axis] += axialForce;<br />
        body[i].acc[axis] -= axialForce;<br />
    }</div>
<p>To avoid changes to the body data structure that might affect the experiment, I redefined the <em>acc</em> field to mean <em>accumulator</em> instead of <em>acceleration</em> (cheap trick for a short hack ;-).</p>
<p>With <em>addForce</em> in hand, I needed to make some adjustments to the ballistic step to turn the forces back into accelerations:</p>
<div>    for (i = 0; i &lt; n; ++i) {<br />
        for (int axis = 0; axis &lt; 3; ++axis) {<br />
            body[i].vel[axis] += (body[i].acc[axis] / body[i].mass) * TIMESTEP;<br />
            body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;<br />
            body[i].acc[axis] = 0.;<br />
        }<br />
    }</div>
<p>Oops, adding three divides per body (one for each axis), which gives this result:<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-01.png"><img class="alignnone size-full wp-image-11080" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-01.png" alt="" width="741" /></a></p>
<table border="0" width="100%">
<tbody>
<tr>
<td valign="top">In my experiments, the add forces version came in slightly slower than my best add acceleration version. It’s easier to see in the numbers: <br />
 As this table shows, the times for the run accumulating force takes longer, as you would expect for a solution that requires more multiples (to include the extra mass term in the force equation and then to remove it to get to acceleration). <span>But wait a minute</span>! There’s something else going on here. What’s with those serial <em>addAcc</em> numbers? I remember lower numbers when I took my first serial run. Maybe there’s more variability in the results than I recall? That’s easy to check. I switched back to the <em>bodies007</em> code and took several more runs.</td>
<td><img class="alignnone size-full wp-image-11083" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-02.png" alt="" width="250" height="250" /></td>
</tr>
</tbody>
</table>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-03.png" alt="" width="698" height="250" /></p>
<p>That looks all pretty consistent across the range of <em>n</em>-values. Yet when I tried the same thing with bodies008:</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-04.png" alt="" width="699" height="254" /></p>
<table border="0" width="100%">
<tbody>
<tr>
<td valign="top">Almost a second slower for 2K bodies, even though the supposed “code under test” didn’t change. Note: <span><span><span>I’m not even running the <em>addForce</em> code</span>!</span></span>—just the conditional test in the RAMP test mode (see below). There were not many changes in going from bodies007.cpp to bodies008.cpp so it was pretty easy to isolate the code change that caused most of the slowdown. I was able to get these numbers...</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-05.png" alt="" width="144" height="252" /></td>
</tr>
</tbody>
</table>
<p>…by the simple expedient of commenting out the following code:</p>
<div>                          // Do the single threaded run<br />
//                       if (method &amp; USEFORCE) {<br />
//                           startBodies(n);<br />
//                           stime = tick_count::now();<br />
//                           runSerialForceBodies(n);<br />
//                           etime = tick_count::now();</div>
<p>//                           elapsed = (etime - stime).seconds();<br />
//                           cout &lt;&lt; "," &lt;&lt; setw(20) &lt;&lt; elapsed;<br />
//                       }</p>
<table border="0" width="100%">
<tbody>
<tr>
<td>This is one of several clauses in a <em>for</em>-loop that selects the values of <em>n</em> for the ramp; commenting out all the variant methods so that the only one remaining is the serial <em>addAcc</em> does even better, though the returns are diminishing:</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-06.png" alt="" width="144" height="232" /></td>
</tr>
</tbody>
</table>
<table border="0">
<tbody>
<tr>
<td>So, given the ramp loop is somehow having an effect on the numbers, let me reverse the scenario and comment out all but the <em>addForce</em> variant:</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-07.png" alt="" width="144" height="232" /></td>
</tr>
</tbody>
</table>
<p>Each of these last two sets are the averages of five runs each, and by my measure, adding accelerations still wins, though the answer is much more murky than I would hope. What are the gremlins that are plaguing these numbers? I have some hunches that involve optimization and inlining strategies but I can’t yet point my finger at specific problems. There are some tantalizing observations to be made, though.</p>
<p>For example, I could try doing a hot spot analysis to see if that would provide clues about the unexpected overhead. However, that means relaxing function inline optimization (the /Ob1 trick). But what does that do to performance?</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-08.png" alt="" width="513" height="285" /></p>
<table border="0" width="100%">
<tbody>
<tr>
<td>Wow. Looks like function inlining is a big performance benefit for the serial acceleration accumulating code, in this case (one sample) having lost 14 seconds computing the interactions of 2K bodies.<br />
 Curiously, the serial code accumulating forces appears to take a much smaller hit from the loss of aggressive function inlining. Or, in glass-half-full parlance, it appears to take much less advantage of compiler optimizations.The same appears to be true if you continue relaxing optimizations, specifically looking at the performance of the Debug configuration with this same test (all these tests are using the optimizations available in Intel® Parallel Composer, so your mileage may vary depending on what compiler you use):</td>
<td><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091019-09.png" alt="" width="300" height="285" /></td>
</tr>
</tbody>
</table>
<p>There’s a lot more to be discovered in this rich mine of anomalies, and perhaps when I have some more time, I will delve into it more deeply. For now though, I’ll continue to use the <em>addAcc</em> variant in the experiments going forward. After all, after over half a dozen posts in this series, I haven’t even gone parallel yet!</p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial body drill-down</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/#comments</comments>
		<pubDate>Wed, 14 Oct 2009 22:15:35 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[c++ parallel programming]]></category>
		<category><![CDATA[Hot Spot Analysis]]></category>
		<category><![CDATA[Intel® Parallel Amplifier]]></category>
		<category><![CDATA[Intel® Parallel Studio]]></category>
		<category><![CDATA[Intel® Threading Building Blocks]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/</guid>
		<description><![CDATA[Having found the function that consumes the most time, this episode shows the process of drilling down into the hot source and optimizing it BEFORE going parallel.]]></description>
			<content:encoded><![CDATA[<p>Having discovered which function consumes most of the time in the serial algorithm <a href="http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/">last time</a>, there’s still more to discover by narrowing the focus to a specific function of interest. Our function, shown last time and below, is <em>addAcc</em>.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-011.png"><img class="alignnone size-full wp-image-10707" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-011.png" alt="" width="691" /></a></p>
<p>Expanding the view to show the function in detail is often called <em>drilling down to source</em>. In Intel® Parallel Amplifier I can do this by just double-clicking on the function, <em>addAcc</em>.</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-02.png" alt="" width="596" height="494" /></p>
<p>Parallel Amplifier lands on the hottest line in the function and provides easy navigation buttons (just below the <em>Bottom-up</em> button) to explore the other hot spots in order of time taken (max, step-up, step-down, min, respectively). Since I landed on the hottest hot spot, the "navigate-to-a-hotter-spot" buttons are grayed out.</p>
<p>Looks like <em>addAcc</em> has a problem with divides. Division is one of the more expensive arithmetic operations and while I can’t eliminate all of them, I certainly can reduce the number of them.</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-03.png" alt="" width="561" height="336" /></p>
<p>Computing the inverse distance once and then multiplying that seems to have had an effect: before the change, the times attributed to the nearby lines amounted to over 1.2 seconds while after, the total is down to 0.78 seconds. I accumulate the values from the adjacent lines because the reported event counts are at best approximate—the tool needs to deal with both the optimized code that may have scattered around the instructions that implement any particular line and phenomena that affect the actual process of determining the location of the instruction pointer, such as <em>event skid</em> (to be addressed in some other post). In fact, probably a significant portion of the 0.78 and 1.2 seconds is actually coming from the square root function that immediately precedes these lines. So I’ll run another ramp of n-bodies and see if my numbers are any better.</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-04.png" alt="" width="741" height="416" /></p>
<p>Yes, they are. And now that I have <em>ivdist</em>, it’s worth considering whether I can use it more efficiently to replace the divide by <em>distsq</em> into something like this:</p>
<p><code>    double Gdivd = GFORCE * ivdist * ivdist;</code></p>
<p>Sure enough, that change, though not as beneficial as the last one, still has an observable benefit:</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-05.png" alt="" width="741" height="416" /></p>
<p>Another hot spot run shows even less time being spent at the previously identified hot spots:</p>
<p><img class="alignnone" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091013-06.png" alt="" width="561" height="400" /></p>
<p>I’ll use this last version as my serial <em>baseline</em>, the benchmark against which I’ll compare to measure my progress in parallelization. It might change someday as I continue to evaluate alternatives like <a href="http://software.intel.com/en-us/blogs/2009/10/23/n-bodies-a-parallel-tbb-solution-serial-body-forces-one-more-time/">the persistent question about forces</a>. This in fact is part of our recommended practice for migrating serial code into a parallel environment: I start by optimizing the serial version as much as I can so that the benefits I gain through parallel implementation are not just because multiple HW threads are just filling in the gaps left behind by reusing an inefficient, serial version.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2009/11/13/n-bodies-a-parallel-tbb-solution-parallel-code-a-first-attempt/">parallel code, a first attempt</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial body hot spots</title>
		<link>http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 23:13:29 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Amdahl's Law]]></category>
		<category><![CDATA[hot spots]]></category>
		<category><![CDATA[Intel Parallel Amplifier]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/</guid>
		<description><![CDATA[Robert finds the hot function in the serial n-bodies code, but only after discovering what a good job of function inlining the Intel C++ Compiler does.]]></description>
			<content:encoded><![CDATA[<p>In <a href="http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/">my last venture </a>I got the n-bodies program to compile and ran a test series with the serial algorithm, showing the n-squared nature of the basic problem. I mean to write a parallel version of this (heh, heh, heh) but first I need to know what is taking up the time. By the dictates of <a href="http://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl’s Law</a>, I want to apply the most processors at the place the program is spending most of the time, its <em>hot spots,</em> to do the most good. The most common way to do this is to interrupt the processsor regularly and figure out where it is in the program, accumulating these locations to build a picture of where the HW thread (or threads) is/are spending time.  This technique is one of the several used in Intel’s most recent performance analysis tool, called <a href="http://software.intel.com/en-us/intel-parallel-amplifier/">Intel® Parallel Amplifier</a>.</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-01.png"><img class="alignnone size-full wp-image-10435" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-01.png" alt="" width="560" /></a></p>
<p>It installs right in Visual Studio as shown above. In order to collect hot spots on the serial algorithm, I switch the debug command to <em>single 256 serial</em></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-02.png"><img class="alignnone size-full wp-image-10438" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-02.png" alt="" width="428" height="168" /></a></p>
<p>I’ve also turned on symbols in my Release configuration (C/C++ &gt;&gt; General &gt;&gt; Debug Information Format set to <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-03.png"><img class="alignnone size-full wp-image-10439" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-03.png" alt="" width="145" height="17" /></a> and Linker &gt;&gt; Debugging &gt;&gt; Generate Debug Info set to <a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-04.png"><img class="alignnone size-full wp-image-10440" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-04.png" alt="" width="82" height="18" /></a> on my latest build), then just click on the Profile button, and viola!</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-05.png"><img class="alignnone size-full wp-image-10441" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-05.png" alt="" width="751" /></a></p>
<p>Huhhhhhh?! I see two seconds plus a quarter spent in main, but where are my functions? Do I get the same result if I try the Debug configuration?</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-062.png"><img class="alignnone size-full wp-image-10457" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-062.png" alt="" width="700" /></a></p>
<p>Oh, there are my functions, <em>runSerialBodies</em> and <em>addAcc</em>, but the run takes over 5 seconds. I don’t want to spend time making Debug code run faster, so I want to tune the optimized Release code. However, something about that Release configuration is causing the functions to disappear. Experimenting a little with the configuration settings reveals that the Intel compiler is automatically <em><a href="http://en.wikipedia.org/wiki/Function_inlining">inlining</a></em> the functions into <em>main</em>. Unfortunately, apparently there’s no way to represent that inlining in the debug information so the functions just disappear. By relaxing the optimization a little, I can restore the function hierarchy for analysis at the cost of some extra function call instructions:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-07.png"><img class="alignnone size-full wp-image-10447" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-07.png" alt="" width="584" /></a></p>
<p>Now my hot spot analysis on the Release configuration looks much better:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-08.png"><img class="alignnone size-full wp-image-10449" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/10/b091002-08.png" alt="" width="691" /></a></p>
<p>Most of the time is being spent in the <em>addAcc</em> function, which is being called by <em>runSerialBodies</em> as can be seen in the function call hierarchy graph. Looks like <em>addAcc</em> will be one of my candidates for parallelization.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2009/10/14/n-bodies-a-parallel-tbb-solution-serial-body-drill-down/">serial body drill-down</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: serial bodies test run</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 23:53:38 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[Intel C++ compiler]]></category>
		<category><![CDATA[Microsoft Visual Studio]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/</guid>
		<description><![CDATA[Wherein Robert attempts to compile his program and remembers eventually to switch to the Intel C++ compiler to accommodate C++0x features used by the program.]]></description>
			<content:encoded><![CDATA[<p>Let’s take the body interaction code I laid out <a href="http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/">last time</a>, combine it with the other parts laid out previously and run it. Dropping the fleshed out program into a Microsoft Visual Studio* project, I quickly rediscover something:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-011.png"><img class="alignnone size-full wp-image-10244" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-011.png" alt="" width="100%" /></a></p>
<p>Oops, that’s right. bodies007.cpp relies on language extensions available in the Intel® Compiler version 11, some early arrivals from the C++0x standard. Fortunately, it’s pretty easy to switch compilers.</p>
<table border="0" width="100%">
<tbody>
<tr>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-02.png"><img class="alignnone size-full wp-image-10249" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-02.png" alt="" width="253" height="250" /></a></td>
<td valign="top">With the Intel C++ Compiler installed in Visual Studio from either of the regular distribution packages, the Compiler Professional Edition or Intel Parallel Composer, switching compilers is just a click away.</td>
</tr>
</tbody>
</table>
<table border="0" width="100%">
<tbody>
<tr>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-03.png"><img class="alignnone size-full wp-image-10251" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-03.png" alt="" width="500" height="196" /></a></td>
<td valign="top">OK, two clicks. I did try to build already, so I’ll let the system clean up the project files.</td>
</tr>
</tbody>
</table>
<p> </p>
<table border="0" width="100%">
<tbody>
<tr>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-04.png"><img class="alignnone size-full wp-image-10253" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-04.png" alt="" width="203" height="251" /></a></td>
<td valign="top">Viola!!! The project now uses the Intel C++ compiler.</td>
</tr>
</tbody>
</table>
<p>One more configuration setting to enable C++0x support:</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-05.png"><img class="alignnone size-full wp-image-10257" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-05.png" alt="" width="100%" /></a></p>
<p>And now it compiles!</p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-06.png"><img class="alignnone size-full wp-image-10259" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-06.png" alt="" width="100%" /></a></p>
<p>Dropping the serial code into the test program I’ve prepared gives access to a simple command processor that allows me to select that kernel for one of several tests. A simple ramp, testing the algorithm with varying values of <em>n</em>, can be launched by putting this in the command line: <em>select serial</em></p>
<p><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-071.png"><img class="alignnone size-full wp-image-10264" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-071.png" alt="" /></a></p>
<p>Looks like the time it takes to complete the simulation (I’m running 1000 time steps for each body-count) is going up more than four times for every doubling of the number of bodies; in fact, if I plot it on a logarithmic scale, I can see the exponential growth:<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-08.png"><img class="alignnone size-full wp-image-10265" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090928-08.png" alt="" width="100%" /></a></p>
<p>Clearly this is an algorithm that has plenty of work to divide, if I can just figure out a way to do it.</p>
<p>Next time: <a href="http://origin-software.intel.com/en-us/blogs/2009/10/05/n-bodies-a-parallel-tbb-solution-serial-body-hot-spots/">serial body hot spots</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: realizing addAcc(i,j)</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 23:42:40 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/</guid>
		<description><![CDATA[Putting together the function to apply accelerations between a pair of gravitational bodies.]]></description>
			<content:encoded><![CDATA[<p>Having settled the question of whether I should accumulate forces or accelerations <a href="http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/">last time</a>, now it’s time to build the accumulation function.</p>
<div>    <span>void</span><br />
    <strong>addAcc</strong>(<span>int </span>i, <span>int </span>j) {</div>
<p><em> </em></p>
<p><em>i</em> and <em>j</em> are indices selecting elements of the <em>body</em> array. First task is to compute the distance between them.</p>
<div>        <span><span>double </span></span>dx = body[<span>i</span>].pos[<span>0</span>]-body[<span>j</span>].pos[<span>0</span>];<br />
        <span><span>double </span></span>dy = body[<span>i</span>].pos[<span>1</span>]-body[<span>j</span>].pos[<span>1</span>];<br />
        <span><span>double </span></span>dz = body[<span>i</span>].pos[<span>2</span>]-body[<span>j</span>].pos[<span>2</span>];<br />
        <span><span>double </span></span>distsq = dx*dx + dy*dy + dz*dz;</div>
<p> </p>
<p>Pythagorean Theorem in three dimensions gets me the square of the hypotenuse, but before doing the square root, I’ll avoid the singularity:</p>
<div>        <span><strong>if</strong></span> (distsq &lt; MINDIST) distsq = MINDIST;<br />
        <span>double </span>dist = sqrt(distsq);</div>
<p> </p>
<p>That is, if the point masses get too close together, act like they’re not. But wait! Why do I even need the square root, if I’m working with gravitation, an inverse-<span>squared</span> law? Well, because acceleration is a vector so I need the next step.</p>
<div>        <span>double </span>ud[<span>3</span>];<br />
        ud[<span>0</span>] = dx/dist;<br />
        ud[<span>1</span>] = dy/dist;<br />
        ud[<span>2</span>] = dz/dist;</div>
<p> </p>
<p>Array <em>ud</em> represents the <em>unit vector</em> (length 1 direction vector) pointing from body <em>j </em>to body <em>i</em>. I need just one more thing, the magnitude of those accelerations.</p>
<div>        <span>double </span>Gdivd = GFORCE/distsq;<br />
        <span>double </span>ai = Gdivd*body[<span>j</span>].mass;<br />
        <span>double </span>aj = Gdivd*body[<span>i</span>].mass;</div>
<p> </p>
<p>All that’s left is to compute the acceleration vector components and apply them to the bodies.</p>
<div>        <span><strong>for </strong></span>(<span>int </span>k = 0; k &lt; 3; ++k) {<br />
            body[<span>j</span>].acc[<span>k</span>] += aj*ud[<span>k</span>];<br />
            body[<span>i</span>].acc[<span>k</span>] -= ai*ud[<span>k</span>];<br />
        }<br />
    }</div>
<p class="MsoNormal"> </p>
<p class="MsoNormal"><span>Next time: <a href="http://software.intel.com/en-us/blogs/2009/09/29/n-bodies-a-parallel-tbb-solution-serial-bodies-test-run/">serial bodies test run</a></span></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>n-bodies: a parallel TBB solution: computing accelerations? or forces?</title>
		<link>http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/</link>
		<comments>http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/#comments</comments>
		<pubDate>Tue, 22 Sep 2009 09:07:21 +0000</pubDate>
		<dc:creator>Robert Reed (Intel)</dc:creator>
				<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Parallel Programming]]></category>
		<category><![CDATA[Software Tools]]></category>
		<category><![CDATA[n-bodies]]></category>
		<category><![CDATA[parallel programming]]></category>
		<category><![CDATA[TBB]]></category>
		<category><![CDATA[Threading Building Blocks]]></category>

		<guid isPermaLink="false">http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/</guid>
		<description><![CDATA[Robert finally deals with the eternal question, forces or accelerations?  Which is it more efficient to accumulate?]]></description>
			<content:encoded><![CDATA[<p>When considering the parallelization of some piece of code, my first concern is to be sure that the code I start with is optimized for serial execution. It does me little good to write a parallel version that just sops up the latency holes that inefficient code makes available. It may seem to scale well, but some future compiler may come along that does a better job of optimizing that inefficient kernel and suddenly the performance scaling might disappear.</p>
<table border="0" width="100%">
<tbody>
<tr>
<td valign="top">Even better is if I can squeeze those inefficiencies out of the algorithm itself. Previously, before <a href="http://software.intel.com/en-us/blogs/2009/09/14/n-bodies-a-parallel-tbb-solution-computing-accelerations/">defining the interaction loops</a>, I set up a data structure to represent each body, choosing mass, location, and velocity to represent the state, plus a place to accumulate accelerations. However, <a href="http://software.intel.com/en-us/blogs/2009/09/05/n-bodies-a-parallel-tbb-solution-body-data/">Jim Dempsey suggested </a>it might be more efficient to accumulate forces rather than accelerations. Forces are the canonical result of the gravitational equation but divides are one of the most expensive floating point operations. Better to accumulate the influence of other bodies as forces first, then do one divide to compute the acceleration, right?</td>
<td><a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-01.png"><img class="alignnone size-full wp-image-9942" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-01.png" alt="" width="162" height="141" /></a></td>
</tr>
</tbody>
</table>
<p>But where did that <em>F</em> come from?  We are actually dividing out a mass that was used to compute the original force. What if we never multiplied it in the first place?<br />
<a href="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-02.png"><img class="alignnone size-full wp-image-9943" src="http://software.intel.com/en-us/blogs/wordpress/wp-content/uploads/2009/09/b090921-02.png" alt="" width="499" height="113" /></a></p>
<p>The G-over-R-squared is a common factor that can be computed once. Two multiplies, which would be required anyway, give the accelerations directly, without any extra divides!   So we'll stick with the plan to add accelerations.</p>
<p>Next time: <a href="http://software.intel.com/en-us/blogs/2009/09/25/n-bodies-a-parallel-tbb-solution-realizing-addaccij/">realizing <em>addAcc</em>(i,j)</a></p>
]]></content:encoded>
			<wfw:commentRss>http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

