n-bodies: a parallel TBB solution: serial body drill-down

Having discovered which function consumes most of the time in the serial algorithm last time, there’s still more to discover by narrowing the focus to a specific function of interest. Our function, shown last time and below, is addAcc.

Expanding the view to show the function in detail is often called drilling down to source. In Intel® Parallel Amplifier I can do this by just double-clicking on the function, addAcc.

Parallel Amplifier lands on the hottest line in the function and provides easy navigation buttons (just below the Bottom-up button) to explore the other hot spots in order of time taken (max, step-up, step-down, min, respectively). Since I landed on the hottest hot spot, the "navigate-to-a-hotter-spot" buttons are grayed out.

Looks like addAcc has a problem with divides. Division is one of the more expensive arithmetic operations and while I can’t eliminate all of them, I certainly can reduce the number of them.

Computing the inverse distance once and then multiplying that seems to have had an effect: before the change, the times attributed to the nearby lines amounted to over 1.2 seconds while after, the total is down to 0.78 seconds. I accumulate the values from the adjacent lines because the reported event counts are at best approximate-the tool needs to deal with both the optimized code that may have scattered around the instructions that implement any particular line and phenomena that affect the actual process of determining the location of the instruction pointer, such as event skid (to be addressed in some other post). In fact, probably a significant portion of the 0.78 and 1.2 seconds is actually coming from the square root function that immediately precedes these lines. So I’ll run another ramp of n-bodies and see if my numbers are any better.

Yes, they are. And now that I have ivdist, it’s worth considering whether I can use it more efficiently to replace the divide by distsq into something like this:

    double Gdivd = GFORCE * ivdist * ivdist;

Sure enough, that change, though not as beneficial as the last one, still has an observable benefit:

Another hot spot run shows even less time being spent at the previously identified hot spots:

I’ll use this last version as my serial baseline, the benchmark against which I’ll compare to measure my progress in parallelization. It might change someday as I continue to evaluate alternatives like the persistent question about forces. This in fact is part of our recommended practice for migrating serial code into a parallel environment: I start by optimizing the serial version as much as I can so that the benefits I gain through parallel implementation are not just because multiple HW threads are just filling in the gaps left behind by reusing an inefficient, serial version.

Next time: parallel code, a first attempt