When considering the parallelization of some piece of code, my first concern is to be sure that the code I start with is optimized for serial execution. It does me little good to write a parallel version that just sops up the latency holes that inefficient code makes available. It may seem to scale well, but some future compiler may come along that does a better job of optimizing that inefficient kernel and suddenly the performance scaling might disappear.

Even better is if I can squeeze those inefficiencies out of the algorithm itself. Previously, before defining the interaction loops, I set up a data structure to represent each body, choosing mass, location, and velocity to represent the state, plus a place to accumulate accelerations. However, Jim Dempsey suggested it might be more efficient to accumulate forces rather than accelerations. Forces are the canonical result of the gravitational equation but divides are one of the most expensive floating point operations. Better to accumulate the influence of other bodies as forces first, then do one divide to compute the acceleration, right? |

But where did that *F* come from? We are actually dividing out a mass that was used to compute the original force. What if we never multiplied it in the first place?

The G-over-R-squared is a common factor that can be computed once. Two multiplies, which would be required anyway, give the accelerations directly, without any extra divides! So we'll stick with the plan to add accelerations.

Next time: realizing *addAcc*(i,j)