n-bodies: a parallel TBB solution: serial body forces one more time

My plan to go parallel this time was thwarted by concerns that I may still have left some serial performance on the table. So I’ll take one more look (OK, well, no more than three). Leading the contenders was Jim Dempsey’s suggestion that accumulating forces instead of accelerations would save some divides. His numbers did not show a dramatic difference but did suggest summing forces to be ever so slightly faster than accumulating accelerations. My analysis of the equations suggest that even this should be the wrong order, so I took the plunge and wrote addForce(i,j). It’s a simple twist on the original addAcc. Instead of computing separate accelerations for each body, I compute one force:

   // Use the Force, Luke!
    double force = GFORCE * ivdist * ivdist * body[j].mass * body[i].mass;

Then I ensured the vector component accumulations take advantage of the simplification:

    for (int axis = 0; axis < 3; ++axis) {
        double axialForce = force * ud[axis];
        body[j].acc[axis] += axialForce;
        body[i].acc[axis] -= axialForce;

To avoid changes to the body data structure that might affect the experiment, I redefined the acc field to mean accumulator instead of acceleration (cheap trick for a short hack ;-).

With addForce in hand, I needed to make some adjustments to the ballistic step to turn the forces back into accelerations:

    for (i = 0; i < n; ++i) {
        for (int axis = 0; axis < 3; ++axis) {
            body[i].vel[axis] += (body[i].acc[axis] / body[i].mass) * TIMESTEP;
            body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
            body[i].acc[axis] = 0.;

Oops, adding three divides per body (one for each axis), which gives this result:

In my experiments, the add forces version came in slightly slower than my best add acceleration version. It’s easier to see in the numbers: 
 As this table shows, the times for the run accumulating force takes longer, as you would expect for a solution that requires more multiples (to include the extra mass term in the force equation and then to remove it to get to acceleration). But wait a minute! There’s something else going on here. What’s with those serial addAcc numbers? I remember lower numbers when I took my first serial run. Maybe there’s more variability in the results than I recall? That’s easy to check. I switched back to the bodies007 code and took several more runs.

That looks all pretty consistent across the range of n-values. Yet when I tried the same thing with bodies008:

Almost a second slower for 2K bodies, even though the supposed “code under test” didn’t change. Note: I’m not even running the addForce code!-just the conditional test in the RAMP test mode (see below). There were not many changes in going from bodies007.cpp to bodies008.cpp so it was pretty easy to isolate the code change that caused most of the slowdown. I was able to get these numbers...

…by the simple expedient of commenting out the following code:

                          // Do the single threaded run
//                       if (method & USEFORCE) {
//                           startBodies(n);
//                           stime = tick_count::now();
//                           runSerialForceBodies(n);
//                           etime = tick_count::now();

//                           elapsed = (etime - stime).seconds();
//                           cout << "," << setw(20) << elapsed;
//                       }

This is one of several clauses in a for-loop that selects the values of n for the ramp; commenting out all the variant methods so that the only one remaining is the serial addAcc does even better, though the returns are diminishing:

So, given the ramp loop is somehow having an effect on the numbers, let me reverse the scenario and comment out all but the addForce variant:

Each of these last two sets are the averages of five runs each, and by my measure, adding accelerations still wins, though the answer is much more murky than I would hope. What are the gremlins that are plaguing these numbers? I have some hunches that involve optimization and inlining strategies but I can’t yet point my finger at specific problems. There are some tantalizing observations to be made, though.

For example, I could try doing a hot spot analysis to see if that would provide clues about the unexpected overhead. However, that means relaxing function inline optimization (the /Ob1 trick). But what does that do to performance?

Wow. Looks like function inlining is a big performance benefit for the serial acceleration accumulating code, in this case (one sample) having lost 14 seconds computing the interactions of 2K bodies.
 Curiously, the serial code accumulating forces appears to take a much smaller hit from the loss of aggressive function inlining. Or, in glass-half-full parlance, it appears to take much less advantage of compiler optimizations.The same appears to be true if you continue relaxing optimizations, specifically looking at the performance of the Debug configuration with this same test (all these tests are using the optimizations available in Intel® Parallel Composer, so your mileage may vary depending on what compiler you use):

There’s a lot more to be discovered in this rich mine of anomalies, and perhaps when I have some more time, I will delve into it more deeply. For now though, I’ll continue to use the addAcc variant in the experiments going forward. After all, after over half a dozen posts in this series, I haven’t even gone parallel yet!

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.