n-bodies: a parallel TBB solution: serial body drill-down

Having discovered which function consumes most of the time in the serial algorithm last time, there’s still more to discover by narrowing the focus to a specific function of interest. Our function, shown last time and below, is addAcc.

Expanding the view to show the function in detail is often called drilling down to source. In Intel® Parallel Amplifier I can do this by just double-clicking on the function, addAcc.

Parallel Amplifier lands on the hottest line in the function and provides easy navigation buttons (just below the Bottom-up button) to explore the other hot spots in order of time taken (max, step-up, step-down, min, respectively). Since I landed on the hottest hot spot, the "navigate-to-a-hotter-spot" buttons are grayed out.

Looks like addAcc has a problem with divides. Division is one of the more expensive arithmetic operations and while I can’t eliminate all of them, I certainly can reduce the number of them.

Computing the inverse distance once and then multiplying that seems to have had an effect: before the change, the times attributed to the nearby lines amounted to over 1.2 seconds while after, the total is down to 0.78 seconds. I accumulate the values from the adjacent lines because the reported event counts are at best approximate-the tool needs to deal with both the optimized code that may have scattered around the instructions that implement any particular line and phenomena that affect the actual process of determining the location of the instruction pointer, such as event skid (to be addressed in some other post). In fact, probably a significant portion of the 0.78 and 1.2 seconds is actually coming from the square root function that immediately precedes these lines. So I’ll run another ramp of n-bodies and see if my numbers are any better.

Yes, they are. And now that I have ivdist, it’s worth considering whether I can use it more efficiently to replace the divide by distsq into something like this:

    double Gdivd = GFORCE * ivdist * ivdist;

Sure enough, that change, though not as beneficial as the last one, still has an observable benefit:

Another hot spot run shows even less time being spent at the previously identified hot spots:

I’ll use this last version as my serial baseline, the benchmark against which I’ll compare to measure my progress in parallelization. It might change someday as I continue to evaluate alternatives like the persistent question about forces. This in fact is part of our recommended practice for migrating serial code into a parallel environment: I start by optimizing the serial version as much as I can so that the benefits I gain through parallel implementation are not just because multiple HW threads are just filling in the gaps left behind by reusing an inefficient, serial version.

Next time: parallel code, a first attempt

For more complete information about compiler optimizations, see our Optimization Notice.


vu64's picture

The nBodies problem is getting more and more interesting.

jimdempseyatthecove's picture


I took the liberty to write conditional code into Bodies008 (my derivation of Bodies007). I can compile using acceleraton or compile using force (did not tweak code other than for this change in diminsionality). The end result was less improvement than I expected. The tradoff came to

Computing one force using product of two masses (I did't factor this in) but then using that same force in a += and -= as opposed to computing two accelerations and then using seperate results for += and -=

On Q6600 4 core 2.4GHz and 2048 bodies TBB Fast

Force 53.2233s
Acceleration 53.3049s

53.3049 - 53.2233 = .0816 / 53.3049 = 0.153% faster

Hardly worth making code changes.

I did not check to see if the compiler
optimized the unit vector calculation with the Force vector
(same would apply the the acceleration)
I haven't looked at having the body hold the gravity gradient as opposed to mass
This could eliminate one of the product terms. (G*m1*m2 replaced with G1*G2)
however, this will impact you later when you want to extract the mass for other purposes.

n-bodies is a nice simple example for learning experimentation.

Full run data follows (your TBB fast is using your Lambda functions)
Run on Q6600 4-core no HT 2.4GHz

Compiled for Force
Starting n-body run with 4 threads. Using fastpar grainsize 2.

n TBB fast QT invoke QT dist
2 0.00264379 0.00483742 0.000128479
4 0.00595778 0.0137062 0.000438641
8 0.0133869 0.028763 0.00181174
16 0.0282487 0.0633326 0.0814076
32 0.0596842 0.113969 0.0835226
64 0.139366 0.191398 0.108036
128 0.346929 0.407782 0.197438
256 1.05851 1.05835 0.516981
512 3.7096 3.06148 1.82443
1024 14.3676 10.561 6.92065
2048 53.2233 39.1031 27.0872


Compiled for Acceleration
Starting n-body run with 4 threads. Using fastpar grainsize 2.

n TBB fast QT invoke QT dist
2 0.0021168 0.00441518 8.15588e-005
4 0.00461897 0.0111446 0.000345236
8 0.0128084 0.0242005 0.00154754
16 0.0254874 0.0523342 0.0822425
32 0.0532552 0.0975228 0.0970098
64 0.135412 0.1839 0.108085
128 0.399247 0.390436 0.198003
256 1.05756 1.05385 0.552229
512 3.72708 3.09429 1.85823
1024 14.3025 10.711 7.15188
2048 53.3049 39.4705 27.9941

small difference, not as significant as first thought
(hmm... something is biasing the QT startup will have to address this later)



robert-reed (Intel)'s picture

Jim, did you have a chance to read the episode of this series posted as http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-parallel-tbb-solution-computing-accelerations-or-forces/ ? There I took up the question of whether to accumulate forces or accelerations and determined that the gravitational force equation can be turned into a pair of acceleration equations by dropping the mass multiply that would require an eventual divide to convert the accumulated partial forces back into an acceleration. Currently my code requires no such division, partial or accumulated, because it never multiplies in the mass in the first place that would eventually need to be divided out. If you're still interested, in keeping with my goal of stepping into every hole I can find along the way in the development of this algorithm, I could take up this challenge and try a version that accumulates forces instead of accelerations, comparing it against the current code, but based on the analysis mentioned above, I fear it will be slower, not faster, particularly since it uses more divides (the current algorithm never divides by mass, only by distance and even that now through the use of a reciprocal multiply).

Check it out and as always, thanks for your support of my efforts.

jimdempseyatthecove's picture


The next serial optimization should come from the realization that you can accumulate the net forces on the objects as opposed to accumulating the net accellerations. Then convert the net force to net accelleration with one division (F/M = A) as opposed to n-1 divisions (using * of inverse). This may eliminate ~31.4ms or 11.37% inside the addAcc (renamed to addForce).

It will be interesting to see if your numbers match my estimates.

Good tutorial, keep it comming.


vu64's picture

Nice to see another post about this problem. The optimization really suprises me. Can you point to other sources explaining about these kinds of optimization (or a book perhaps) ? It is amazing that reducing division makes the code faster. Looking forward to the parallel version.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.