n-bodies: a parallel TBB solution: parallel code: a fresh look using recursive parallelism

When last I had a chance to play with this code, I experimented with using multiple locks to enable multiple simultaneous (and disjoint) interactions between pairs of bodies.  It helped but performance still didn’t cross the base line using only one thread.  Overhead in the loop could be reduced by using only one scoped lock instead of two, but it would require an array of locks indexed by i, and j.

    // apply acceleration components using unit vectors


        MyLockType::scoped_lock mylock(ijlocks[i][j]);

        for (int axis = 0; axis < 3; ++axis) {

            body[j].acc[axis] += aj*ud[axis];

            body[i].acc[axis] -= ai*ud[axis];



Rather than pursuing that right now, let me take a step back and try to look at the problem differently. I can represent all the interactions between pairs of bodies, i and j, in a simple graph:

The green triangle above covers the area where ij.  Skipping the points along the diagonal (we don’t need to consider interactions of bodies with themselves), all the other integral points within the triangle can represent unique interactions between pairs of bodies.  Dividing the outer loop between a set of threads is like splitting the triangle along horizontal lines as shown above.  Each sub-region can be handled by a separate thread.

One problem becomes immediately apparent by drawing this diagram: the sub-regions have different sizes.  That means the number of interaction pairs within each varies-the thread that gets the bottom sub-region has a lot more work to do than the one getting the topmost region.  The bottom thread will be working for a long time after the top thread finishes.  This is a great example of what is called a load balanceproblem.

A little more thought will net the realization that while each thread in the subdivision above has its own range of i values, they must share j values: any set of interactions of different i that have the same j are collisions that are either races or restricted by locks to the execution of a single thread.  What might help is if I could add some vertical partitions to this graph, and gain some separation between the threads like what happened with the i values.

It turns out there is a way to do this, using a combinatorial algorithm I learned from Matteo Frigo.  Considering the same interaction graph, this time find the midpoint of the diagonal and use it to draw a couple triangles like this:

The triangles A and B have a particular property of great interest: the ranges of iand j are completely disjoint.  A thread working in the range of A could pick any contained interaction pair and be guaranteed not to interfere with another thread doing the same thing within triangle B.  Two threads could simultaneously cover the interactions represented by the pair of triangles, each covering roughly the same number of interactions and thus load balanced.

That handles two threads.  What about more?  Well, following that great adage, “whatever is worth doing is worth overdoing,” I can repeat that partition on the new triangles:

Now in this triangle there are eight separate regions, each which could contain a separate thread operating without interfering with the adjacent ones.  However, more and more of our available interaction space is turning dark green, representing as yet unhandled pairs of interactions.  Fortunately these are easy to handle.  In fact, the triangle interactions are a subset of the rectangle interactions.  Consider the biggest rectangle:

I can split this rectangle into four sub-rectangles.  In this arrangement, I can turn two threads loose, each one targeting interactions in one of the two light green rectangles.  The disjoint property holds for those two regions.  When those two threads are both done, I can move one each to the two dark green regions because they also have the disjoint index range property.  And what I can do once I can also repeat here, subdividing these rectangles and generating as much balanced parallel work as I need to occupy all the available threads.

We’ll look at some sample code for doing this next time.

For more complete information about compiler optimizations, see our Optimization Notice.