n-bodies: a parallel TBB solution: parallel code: first run’s fatal flaw

Last time when I resumed the exploration of my simple n-body gravitational simulator, I produced some performance numbers and revealed that there is a flaw in the first parallel version of the algorithm.  But then Intel® Parallel Composer Update 5 was released last week, so I updated my tools.  That means I need a new benchmark run to see how the baseline has been affected.



The rebuilt bodies program is just a touch slower (the 4K body serial average before was 283 sec vs. 278 sec for this parallel version).  In this case the serial number got hit harder so the parallel scaling number actually goes up slightly (from 1.02x to 1.03x) even though the numbers got worse!  Don’t trust parallel scaling numbers-understand the underlying data.

On to parallel correctness.  I know there’s a flaw in the parallel algorithm I’ve proposed, but how to demonstrate it?  If the n-bodies program actually had a means to display a projection of the bodies in motion I might be able to run that long enough to notice glitches in the body motion.  Maybe.  Or maybe not.  I’d like something a little more reliable and mechanical than exhaustive inspection, especially when there’s lots of data.  So let’s look at what we can learn from Intel Parallel Inspector.

First thing we need to do is change the command line to select a test for doing data collection.  I’ve been running these ramps using “select serial” and “select par” as the command lines.  These code paths have timing functions and run varying sizes of the problem to collect those times, and generally do way more work than I need to catch a race.  There is another option, the “single n” command that takes as an argument n, the number of bodies to use in the simulation.  I’ll change the debug command line to “single 128 par” to run the test with 128 bodies.  So here goes.



I’ll pick “ti3” the third level thread inspection, in order find out where my deadlocks or data races exist.  You can see from the estimates shown at the left that this might take a little longer than just running the test instance by itself.   So I hit the “run analysis” button at the bottom of the dialog and off it goes:



Assuming all goes well and data actually get collected from the run, I next see something like this:



And hitting the “Interpret Result” button takes me to this:



At the bottom are some of the observations collected during my program run.  The observations were interpreted when I hit the last button, generating the problem sets visible in the upper pane.  Looks like I have some data races (problematic places in memory where multiple threads may be writing and reading data in an undetermined and potentially harmful order).  I should be able to double-click on a problem set and find an error.  I’ll try P1:



Oops.  That’s not very helpful.  Oh, but the executable module is irml, not NBodies. I get the same thing with the  next two problem sets.  Moving on to P4, which does not mention irml (which is a part of TBB-Intel resource management layer?), I try again:



Hmmmmmmmm….  This doesn’t look like NBodies code either.  This is somewhere in the TBB parallel_for header file.  Parallel Inspector flags this as a Write-after-Write race but it doesn’t seem to give me much that might help me understand the problem.  Maybe this is a false positive, a case where it looks like there might be a race condition but there really isn’t.  The code that unwittingly commits a data race might not look that different from safe and legal mutex code.  Sometimes it’s only in how the code is used that can distinguish one case from the other. 

To tell the difference, Intel Parallel Studio has a library of function calls that can be used to declare safe operations that might otherwise be considered suspect.  To enable these TBB conditionally compiles under the TBB_USE_THREADING_TOOLS macro definition and will apply alternate code when enabled to hide questionable code.  But it can cost a little in performance so it does need to be turned on when you need it, which can be done inside Intel Parallel Studio:



Under Parallel Composer Select Build Components is the following dialog:



Oops. Not set.  That’s easy to fix.  Let me rebuild and recollect the ti3 data.



Well, that doesn’t look much better.  Now irml is a contributing module for all the data race problem sets.  Sure enough, drilling down to source on any of these problem sets is as unsatisfying as were any in the last collection.  So what am I doing wrong? And did that TBB_USE_THREADING_TOOLS do anything?

There is one other thing that I can try.  I’ve been using the Release configuration here, following on from the performance runs that started this post.  This has been a problem before when using analysis tools because of the aggressive function inlining that Parallel Composer normally applies.  I have an alternate configuration, Release-with-functions, which has the same settings as Release save the function inlining, which is turned down to /Ob1.  Switching to that configuration, rebuilding the program and collecting ti3 data one more time:



That doesn’t look much different than the last one.  However, when I double-click on P2 this time, I get this:



Well this looks better.  In fact, the highlighted lines are the known data-race.  My brute parallel code divides the index i among a collection of threads, which may simultaneously try to modify the same body j.  This display shows the two sides engaging in the race, which in this circumstance happens to be from different invocations (threads) of the same code.

Moreover, when I drilled down to all the other problem sets in this collection, none of them contained the “my_body( my_range );” function call I landed upon before.  Duh-oh!  Of course!  Now I recognize this function call as where parallel_for actually executes the kernel containing the racy code.  It looks like the aggressive inlining normally in play in the Release configuration left the relevant code stripped of symbols.  There had not been enough detail available for Parallel Inspector to navigate closer to the racy lines until I relaxed function inlining.  Backing off inlining may also affect performance, but hopefully not so severely that we’re “Heisenberg-ed” into observations that are substantially wrong.

If you haven’t gotten enough of this, we’re taking the Parallelism road show on the road again in a few weeks (middle of March, snow permitting) up the East coast with several stops from New Jersey to Boston.  Find out more details at this Programmer’s Paradise link.

Next time: parallel code: finding a fix for the leaky adds

Étiquettes:
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.