Earlier in the month I fleshed out a spatially arranged subdivision method I learned from Matteo Frigo but didn’t have time to actually run it and compare against my baselines. And in the meantime my test machine has been regrooved into a Windows 7 box, so my first order of business is to retest my baselines. I reran the serial algorithm a few times and averaged the results. While the raw numbers (using the same compiler and machine but j
Last time, after struggling with different lock configurations to reduce synchronization overhead managing the interactions of n-squared bodies, I changed perspectives on the problem by spatially representing the interactions between all the bodies and (re-)discovering in that view a means to group the interactions so that independent threads could work together without having to worry about locking the data.
When last I had a chance to play with this code, I experimented with using multiple locks to enable multiple simultaneous (and disjoint) interactions between pairs of bodies. It helped but performance still didn’t cross the base line using only one thread. Overhead in the loop could be reduced by using only one scoped lock instead of two, but it would require an array of locks indexed by i, and j.
Last time I was able to make the n-bodies acceleration code at least thread safe by employing a scoped lock, at a disastrous cost in performance. If you think about it, it’s a bad way to manage the eight HW threads my test machine has available. The obvious alternative is to have a lock per body-any thread needing to adjust a pair of bodies would need to acquire each body’s lock before proceeding. That’s more locking overhead than before-twice as many locks-but enough independent locks that my multiple threads won’t all be stopped by one body interaction. Making that change, the body st
Last time when I resumed the exploration of my simple n-body gravitational simulator, I produced some performance numbers and revealed that there is a flaw in the first parallel version of the algorithm. But then Intel® Parallel Composer Update 5 was released last week, so I updated my tools. That means I need a new benchmark run to see how the baseline has been affected.
My plan to go parallel this time was thwarted by concerns that I may still have left some serial performance on the table. So I’ll take one more look (OK, well, no more than three). Leading the contenders was Jim Dempsey’s suggestion that accumulating forces instead of accelerations would save some divides. His numbers did not show a dramatic difference but did suggest summing forces to be ever so slightly faster than accumulating accelerations.
Having discovered which function consumes most of the time in the serial algorithm last time, there’s still more to discover by narrowing the focus to a specific function of interest. Our function, shown last time and below, is addAcc.
In my last venture I got the n-bodies program to compile and ran a test series with the serial algorithm, showing the n-squared nature of the basic problem. I mean to write a parallel version of this (heh, heh, heh) but first I need to know what is taking up the time.
Let’s take the body interaction code I laid out last time, combine it with the other parts laid out previously and run it. Dropping the fleshed out program into a Microsoft Visual Studio* project, I quickly rediscover something:
- Page 1