n-bodies: a parallel TBB solution: parallel code: balanced recursive parallelism with parallel_invoke

Last time, after struggling with different lock configurations to reduce synchronization overhead managing the interactions of n-squared bodies, I changed perspectives on the problem by spatially representing the interactions between all the bodies and (re-)discovering in that view a means to group the interactions so that independent threads could work together without having to worry about locking the data.

This time I’ll turn the concept into some sample code, but first a reminder of that partitioning scheme.

Because the ranges of i and j are disjoint between triangles A and B, I can set two threads loose, one on each, and let them handle the interactions between bodies represented by points in the region.  I can’t touch any of the pairs represented in rectangle C while A and B are being worked, but I can wait until A and B are done, then start C.

So it looks like the algorithm will have two steps, with two different threads of recursion.  Each triangle will be split into two, smaller triangles and an adjacent rectangle; each rectangle will be subdivided into four smaller rectangles.  I can keep splitting until there’s only one point left, but that might prove to be overkill.

I’ll start with the triangle subdivision:


body_interact(int i, int j)


    int d = j - i;

    if (d > 1) {

        int k = d/2 + i;

        parallel_invoke([&]() {body_interact(i,k);}, [&](){body_interact(k,j);});

        rect_interact(i, k, k, j);



Here’s a function called body_interact(), which takes a pair of indices representing the range of the incoming triangle.  The function finds the midpoint of the defining interval and uses that point, if it is distinct from the endpoints, to define the next pair of triangles and rectangle.  It schedules calls to itself for each of the triangles, and then calls rect_interact() to handle the adjacent rectangle.  Inherent in the behavior of parallel_invoke() is to wait for both (or all) calls to complete before returning, thus guaranteeing that interactions in the rectangle will not start until work is done in the triangles.

This triangle code is but a subset of what is needed for the rectangle recursion.  For it I need two midpoints and two parallel recursion calls:


rect_interact(int i0, int i1, int j0, int j1)


    int di = i1 - i0; int dj = j1 - j0;

    if (di > 1 && dj > 1) {

        int im = i0 + di/2;

        int jm = j0 + dj/2;

        parallel_invoke([&]() {rect_interact(i0, im, j0, jm);},

                        [&]() {rect_interact(im, i1, jm, j1);});

        parallel_invoke([&]() {rect_interact(i0, im, jm, j1);},

                        [&]() {rect_interact(im, i1, j0, jm);});


    else {

        for (int i = i0; i < i1; ++i)

            for (int j = j0; j < j1; ++j)

                addAcc(i, j);



Two balanced parallel operations are invoked when there is enough work to subdivide on both axes.  Otherwise, the else condition drops into the actual interaction function, addAcc().  (You might have noticed that body_interact() does not even call addAcc(). Because of the half-open interval notation I’m using, a triangle subdivision that reduces to a single interaction node is in fact pointing at a node representing the interaction of a body with itself, so we can ignore them.)

With this code, I can go back to the original addAcc(), which updated the acceleration vectors of both interacting bodies without concern about data races and without any locks.  There are still locks, but they are hidden in the library code that handles the synchronization of the rendezvousing threads within the parallel_invoke() call.  But will it run faster?  We’ll find out next time.

Sorry, too late to play 20 questions and WIN VALUABLE PRIZES!  Our Intel Parallel Studio expertise contest has concluded.

For more complete information about compiler optimizations, see our Optimization Notice.