English | 中文 | Русский | Français
2,598 Posts served
8,348 Conversations started
In order to simulate multiple bodies in motion as discussed last time, I have to represent them somehow. I can throw a lot of stuff away to start with, by considering them point masses that have some mass at some location but no size. So for each body I’ll need at least a mass, a position, a velocity, and someplace to accumulate the accelerations every other body is going to apply with each simulation step. Position, velocity and acceleration are all vectors so for 3D bodies there’s 10 values, x, y and z for the vectors plus scalar mass, to keep track of.
One more thing: I’m going to have multiple hardware threads working on these bodies and knowing a little about the architecture of modern processors, I want to avoid cache line interactions between adjacent bodies. By putting each body in its own cache-aligned region of address space, I can avoid thrashing of memory between the threads.
![]() |
So here I have a simple struct with 3-component vectors for position, velocity and acceleration and a mass. Various architectures have different cache line sizes (the number of bytes in each cache line), but 128 bytes is sufficient alignment to match any architecture out there so far. I’ve the alignment conditional on compilation so later I can experiment with aligned and unaligned bodies. I’ve done this by using the __declspec(align(128)) at the start of the structure to ensure the first body in the array is cache aligned, and using the element dummy to pad out the size of each body to take up just a full cache line.
Next time: basic simulation loop
| September 9, 2009 8:39 AM PDT
jimdempseyatthecove
|
Sorry about the formatting. Is there a way to post a comment in courier? |
| September 9, 2009 10:59 AM PDT
Robert Reed (Intel)
| Jim, you're a great shill but you're jumping ahead in my story ;-)! I mentioned my intent with this series to try to step into every development hole I could find along the way, just to demonstrate the methods for discovering your foot was in a hole and then extricating it. One of the lessons we teach is to optimize the serial code as a first step towards parallelizing it. You've found one of the inefficiencies I started with, intending to optimize it as part of the story line (my boss actually found another one and during our class prep modified the source to eliminate it). I do appreciate the attention, though, and the knowledge that at least someone is reading what I write. Thanks! |

jimdempseyatthecove
36,397
Status Points:
36,397
Since you have room in your bodytype add
double force[3];
compute the forces in place of acceleratons during body-to-body interactions. Then in your state advancement perform the acc = force / mass to produce the acceleration. This would replace N * (N-1) * 3 divisions with N*3 divisions.
acc[3] could be removed and you could use force instead.
Your n-body sample code is quite good in that it shows a slow way (excessive use of locks), a wrong way (ignoring requirement for partioning and sequencing of accesss to shared data), and a reasonably good way to partition the work amongst multiple cores.
If you will be at IDF I can demonstrate to you a more complex but faster technique.
Using fine grain partitioning (not optimal)
Starting n-body run with 4 threads. Using fastpar grainsize 2.
n TBB fast QT invoke QT dist
2 0.00248071 0.00550271 0.00010005
4 0.0063936 0.0128304 0.000489758
8 0.0144626 0.0228855 0.00228741
16 0.0307905 0.0456299 0.162664
32 0.0670427 0.085205 0.133868
64 0.154066 0.157047 0.118559
128 0.402379 0.36496 0.238831
256 1.28125 1.0509 0.718465
512 4.59968 3.52819 2.62361
1024 17.4712 13.0907 10.3105
2048 68.7657 50.2925 40.7112
Done.
The above done on Q6600. Both the TBB fast and QT invoke are using the parallel_invoke Lambda functions. And the QT dist is using parallel_distribute (available in QuickThread)
However, in picking a more optimal grain size (16) you can achieve a significant drop in scheduling calls (~ 1/(8*8) fewer parallel_invoke) and runtimes of all three techniques come within a margine of noise (+/- 1%).
Starting n-body run with 4 threads. Using fastpar grainsize 16.
n TBB fast QT invoke QT dist
2 0.00223766 0.00572098 9.957e-005
4 0.00589529 0.0118227 0.000503197
8 0.0117899 0.016476 0.00227505
16 0.0208457 0.0277316 0.159016
32 0.0451672 0.0470834 0.129005
64 0.0987306 0.101734 0.114944
128 0.232991 0.300763 0.249709
256 0.78541 0.840992 0.718607
512 2.90074 2.87626 2.66301
1024 10.5774 10.5974 10.3084
2048 41.9569 41.5398 40.8268
Done.
Jim Dempsey