The N-Body problem is a classic example used frequently to demonstrate parallelization and how it improves performance. Originally my mentor at Intel told me that he wanted to create an n-body demo because he didn't have one yet despite it being so common. That's when I told him, "I've already got one."
This project started out as a simulation created for my first grad school class at BYU. My professor asked me to simulate collisions and near-collisions of celestial objects using an accurate physics model, including gravity and correct 3D collisions for rigid (inelastic) objects. As a proof-of-correctness test once I was done with the project, I tested the project by creating a mockup of the solar system, with the sun and all nine (yes, nine) planets at their correct distances from the sun and correct velocities. I represented the solar system members as GLUT spheres with texture maps applied to the surfaces (for fun). My mentor enjoyed the project and decided it was a good place to start for our parallel n-body demo.
When I started parallelizing the project, my first step was to ensure the project ran both in Windows and in Linux. I had to modify the texture mapping code because the functions used to read in the image file were Windows specific. After fixing the texture mapping, I first tried simply to parallelize the calculation loop that computed the forces acting between each pair of planets (a nested for-loop in which the outer loop iterated from i = 0 to N – 2, and the inner loop iterated from j = i + 1 to N – 1). However, I soon discovered that the overhead involved in creating and destroying the threads for every single frame far outweighed any benefit from parallelizing the loop--especially in a system with only ten objects like the solar system. I also discovered that applying texture maps for the planets quickly consumed available memory and took far too much time to process.
My next iteration of the project involved removing the texture maps from the planets and using randomly generated colors instead. I then added some lighting as a visual cue for the user. I also tried to devise a way to create a thread pool or some other means of keeping the threads around between frames instead of creating and destroying them every time (I tried both OpenMP and PThreads for a thread pool, but OpenMP wasn't designed that way, and PThreads is platform-specific). I also remembered the issue I had with Mandelbrot in that only the master thread can draw in OpenGL, so I couldn’t simply parallelize the call to glutMainLoop().
After some thought, I decided first to remove a few unnecessary aspects of the project to help speed things up. I removed the collision detection and demoted the planets from GLUT spheres to individual points. However, I kept the points at 2x2 pixels so they’d still be visible. I also kept the randomly generated colors and the lighting. These changes enabled me to use thousands of objects instead of only dozens, but getting speedup out of the parallelization was still causing problems. The breakthrough came when I figured out that I could run the computations and the rendering simultaneously (data parallelization is more intuitive to me that task parallelization for some reason). I achieved this task parallelization by having one thread (the master thread) run off into the glutMainLoop() function (with the call to the update() function removed from the display() callback), and having another thread call the update() function, which calculated the forces between objects and updated their positions and velocities accordingly.
My final iteration of the project involved having exactly two threads created in main(): the master thread to call glutMainLoop() and one other to call update(). Inside update(), I added an OpenMP parallel block that used N-1 threads (because one thread, the master thread, was already being used to draw the scene), and inside the parallel section I added an infinite while-loop that continuously computed new positions and velocities for the planets. To avoid using critical sections as much because of shared variables, I ended up changing the nested for loops as well. I changed both loops to iterate from 0 to N-1 (skipping iterations where i == j). The original computation used to add the computed force to both planets involved in the computation (to avoid redundant calculations), but it proved faster in the parallelized version to allow the redundant calculations (so each iteration only added the forces to one of the planets in the calculation).
After completing the project, as a final design issue, I again separated out the pieces of code that the three versions of the project shared in common, as I did for Mandelbrot. However, instead of leaving only the main method in each version’s *.cpp file, I also had to include the display() callback and the update() function because they behave slightly differently between versions.
I initially tested the final project on a twelve-core Windows machine, and after the modifications the speed-up was quite apparent. The single-thread version was able to manage between three and four calculation loops per second, and the twelve-thread version pushed it up to over thirty calculation loops per second--a near-linear speedup in the number of threads! Running the demo on the Manycore Testing Lab proved difficult, however, since the demo is graphically oriented. However, using a VNC client instead of Putty + X-Windows I was able to run the project and observe similar results using 32 threads. Again, I can't stress enough how much easier multi-thread programming is on the MTL machine than it is on our super-computer at school with its wait queues, batch jobs, etc.
My final take-away for this project was how to mix task parallelism and data parallelism, which I did both with OpenMP as well as with CILK Plus and TBB. Both parallel versions were quite simple to parallelize (once I got all the afore-mentioned optimizations worked out), and both parallel versions of the project scaled nearly linearly, so I'm quite pleased with the results.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804