English | 中文 | Русский | Français
2,856 Posts served
8,606 Conversations started
Having discovered which function consumes most of the time in the serial algorithm last time, there’s still more to discover by narrowing the focus to a specific function of interest. Our function, shown last time and below, is addAcc.
Expanding the view to show the function in detail is often called drilling down to source. In Intel® Parallel Amplifier I can do this by just double-clicking on the function, addAcc.

Parallel Amplifier lands on the hottest line in the function and provides easy navigation buttons (just below the Bottom-up button) to explore the other hot spots in order of time taken (max, step-up, step-down, min, respectively). Since I landed on the hottest hot spot, the "navigate-to-a-hotter-spot" buttons are grayed out.
Looks like addAcc has a problem with divides. Division is one of the more expensive arithmetic operations and while I can’t eliminate all of them, I certainly can reduce the number of them.

Computing the inverse distance once and then multiplying that seems to have had an effect: before the change, the times attributed to the nearby lines amounted to over 1.2 seconds while after, the total is down to 0.78 seconds. I accumulate the values from the adjacent lines because the reported event counts are at best approximate—the tool needs to deal with both the optimized code that may have scattered around the instructions that implement any particular line and phenomena that affect the actual process of determining the location of the instruction pointer, such as event skid (to be addressed in some other post). In fact, probably a significant portion of the 0.78 and 1.2 seconds is actually coming from the square root function that immediately precedes these lines. So I’ll run another ramp of n-bodies and see if my numbers are any better.

Yes, they are. And now that I have ivdist, it’s worth considering whether I can use it more efficiently to replace the divide by distsq into something like this:
double Gdivd = GFORCE * ivdist * ivdist;
Sure enough, that change, though not as beneficial as the last one, still has an observable benefit:

Another hot spot run shows even less time being spent at the previously identified hot spots:

I’ll use this last version as my serial baseline, the benchmark against which I’ll compare to measure my progress in parallelization. It might change someday as I continue to evaluate alternatives like the persistent question about forces. This in fact is part of our recommended practice for migrating serial code into a parallel environment: I start by optimizing the serial version as much as I can so that the benefits I gain through parallel implementation are not just because multiple HW threads are just filling in the gaps left behind by reusing an inefficient, serial version.
Next time: parallel code, a first attempt
| October 16, 2009 9:29 AM PDT
jimdempseyatthecove
|
Robert, The next serial optimization should come from the realization that you can accumulate the net forces on the objects as opposed to accumulating the net accellerations. Then convert the net force to net accelleration with one division (F/M = A) as opposed to n-1 divisions (using * of inverse). This may eliminate ~31.4ms or 11.37% inside the addAcc (renamed to addForce). It will be interesting to see if your numbers match my estimates. Good tutorial, keep it comming. Jim |
| October 16, 2009 10:01 AM PDT
Robert Reed (Intel)
|
Jim, did you have a chance to read the episode of this series posted as http://software.intel.com/en-us/blogs/2009/09/22/n-bodies-a-..... or-forces/ ? There I took up the question of whether to accumulate forces or accelerations and determined that the gravitational force equation can be turned into a pair of acceleration equations by dropping the mass multiply that would require an eventual divide to convert the accumulated partial forces back into an acceleration. Currently my code requires no such division, partial or accumulated, because it never multiplies in the mass in the first place that would eventually need to be divided out. If you're still interested, in keeping with my goal of stepping into every hole I can find along the way in the development of this algorithm, I could take up this challenge and try a version that accumulates forces instead of accelerations, comparing it against the current code, but based on the analysis mentioned above, I fear it will be slower, not faster, particularly since it uses more divides (the current algorithm never divides by mass, only by distance and even that now through the use of a reciprocal multiply). Check it out and as always, thanks for your support of my efforts. |
| October 16, 2009 12:26 PM PDT
jimdempseyatthecove
|
Robert, I took the liberty to write conditional code into Bodies008 (my derivation of Bodies007). I can compile using acceleraton or compile using force (did not tweak code other than for this change in diminsionality). The end result was less improvement than I expected. The tradoff came to Computing one force using product of two masses (I did't factor this in) but then using that same force in a += and -= as opposed to computing two accelerations and then using seperate results for += and -= On Q6600 4 core 2.4GHz and 2048 bodies TBB Fast Force 53.2233s Acceleration 53.3049s 53.3049 - 53.2233 = .0816 / 53.3049 = 0.153% faster Hardly worth making code changes. I did not check to see if the compiler optimized the unit vector calculation with the Force vector (same would apply the the acceleration) I haven't looked at having the body hold the gravity gradient as opposed to mass This could eliminate one of the product terms. (G*m1*m2 replaced with G1*G2) however, this will impact you later when you want to extract the mass for other purposes. n-bodies is a nice simple example for learning experimentation. Full run data follows (your TBB fast is using your Lambda functions) ----------------------------------------------------------------- ------------ Run on Q6600 4-core no HT 2.4GHz Compiled for Force Starting n-body run with 4 threads. Using fastpar grainsize 2. n TBB fast QT invoke QT dist 2 0.00264379 0.00483742 0.000128479 4 0.00595778 0.0137062 0.000438641 8 0.0133869 0.028763 0.00181174 16 0.0282487 0.0633326 0.0814076 32 0.0596842 0.113969 0.0835226 64 0.139366 0.191398 0.108036 128 0.346929 0.407782 0.197438 256 1.05851 1.05835 0.516981 512 3.7096 3.06148 1.82443 1024 14.3676 10.561 6.92065 2048 53.2233 39.1031 27.0872 ======================================= Compiled for Acceleration Starting n-body run with 4 threads. Using fastpar grainsize 2. n TBB fast QT invoke QT dist 2 0.0021168 0.00441518 8.15588e-005 4 0.00461897 0.0111446 0.000345236 8 0.0128084 0.0242005 0.00154754 16 0.0254874 0.0523342 0.0822425 32 0.0532552 0.0975228 0.0970098 64 0.135412 0.1839 0.108085 128 0.399247 0.390436 0.198003 256 1.05756 1.05385 0.552229 512 3.72708 3.09429 1.85823 1024 14.3025 10.711 7.15188 2048 53.3049 39.4705 27.9941 Done. small difference, not as significant as first thought (hmm... something is biasing the QT startup will have to address this later) ----------------------------------------------------------------- ------------ Jim |
| October 16, 2009 12:27 PM PDT
jimdempseyatthecove
| (formatting lost in paste) |
| October 17, 2009 4:07 AM PDT
vu64
| The nBodies problem is getting more and more interesting. |

vu64
610
Status Points:
110