n-bodies: a parallel TBB solution: parallel code with parallel_invoke: will it run faster?

Earlier in the month I fleshed out a spatially arranged subdivision method I learned from Matteo Frigo but didn’t have time to actually run it and compare against my baselines.  And in the meantime my test machine has been regrooved into a Windows 7 box, so my first order of business is to retest my baselines.  I reran the serial algorithm a few times and averaged the results.  While the raw numbers (using the same compiler and machine but just a different OS) are slightly larger under Windows 7, their magnitude cannot be distinguished in a graph:

Moving forward with the Windows 7 serial numbers, next I did a run with the recursive parallel technique I described last time:



I included the simple parallel run for comparison.  Looks like there’s more work to do.  My “fast” technique is still not measuring up against either the serial version or the buggy simple parallel version.  Time for another hot spot analysis:



Curious.  Where most of the work is done is in accAcc(),  yet most of time is being spent in rect_interact(), which just splits up rectangles.  Maybe I’m being too aggressive at subdividing.  Yup, it turns out this is another load balance problem.  And the test program comes equipped with a grain size control in the form of a parameter, fastgrain. You can experiment with various thresholds, but using a grain of 16 makes a substantial shift in the hot spots:



 Body_interact() is now taking less time than addAcc() (where it had been taking more than four times as much before).  And if I collect a set of runs with fastgrain 16, there’s a big change to the performance graph:



These runs start out like the previous parallel algorithms, but by 8 bodies, the curve has already fallen below that of the unsafe, simple parallel method, and by 64 bodies it’s beating the baseline serial numbers.  At 4096 bodies this algorithm is now running over 4x faster than the serial version.  Success!  Could we do even better with a larger grain size?  Maybe marginally, but note in the latter hot spot result that TBB internal code including the idle spinner are already taking more time than the code doing real work.  It’s possible that a higher threshold will just result in more idle time.  Something to look into another day.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

thkarcher's picture

Hi Robert,

could you offer a download link to your code? I'd like to run some experiments myself ...

Thanks!

Thomas

robert-reed (Intel)'s picture

As a matter of fact, http://software.intel.com/en-us/blogs/2009/08/19/n-bodies-exploring-a-parallel-tbb-solution-intro-and-peek-ahead/ is a previous blog post of mine that contains an attachment with a version of the code that's not too different from what I was experimenting with here. I just checked to be sure it's still there (it is), so you should be able to pull a reasonable copy from there.