n-bodies: a parallel TBB solution: parallel code, a first attempt

It’s been a busy month preparing for SuperComputing ‘09 and booth duty (I’ll be hanging out in the Intel booth on Tuesday and Thursday and giving a talk there on Wednesday), and refining materials for a Parallelism Road Show we’re planning for next February and March (more details later). (Not to mention chorus rehearsals for this year’s Christmas Revels-oops, I did mention it. ;-) But finally, after all this serial optimization I’ve been working through on the n-bodies code, it’s time to go parallel. Or rather, first take a short side-step to discover that code parallelization has already begun-through vectorization. I can pull up a compiler report, normally suppressed, by adding the manual switch /Qvec-report:1 to the compilation options:



With this simple change, I notice something new in the compilation logs:



If I double-click on one of these lines in the Output panel, the system navigates to the corresponding source code lines:



Note the little marker in the left margin that indicates the lines referred by that vectorization report line. This is the ballistic step in the serial code implementation, suggesting that the little loop for setting the vector components of the bodies has been converted to a linear sequence of vector instructions. Some other time I’ll dig down into the assembly code to demonstrate that this code really has been vectorized (i.e., realized by emitting SIMD code to execute it), but for now let’s move forward and try to make this multi-thread parallel in addition to vector-parallel.

How do I “parallelize” that lower set of loops in the previous code sample? One simple way would be to add an OpenMP construct:

    #pragma omp parallel for
for (i = 0; i < n; ++i) {
for (int axis = 0; axis < 3; ++axis) {
body[i].vel[axis] += body[i].acc[axis] * TIMESTEP;
body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
body[i].acc[axis] = 0.;
}
}


OpenMP has been around for a number of years and operates as a language extension for C, C++ and Fortran. Compilers enabled to recognize the constructs (such as the Intel® C++ and Fortran Compilers) can use them as hints to direct the compiler generation of parallel code. Non-complying compilers see these constructs as an unrecognized pragma (or a funny comment in Fortran) and ignore them. In this case the OpenMP line applies to the line that follows, directing the compiler to create code that divides the outer loop into some collection of chunks, each of which can be dispatched to a separate HW thread. Each thread processes the chunks assigned to it. As each thread finishes its work, it waits for the others in its team to complete their work. All these HW threads will land in a rendezvous or join point until all have arrived, because there’s an implied wait at the end of the parallel for-loop so that code that follows will not be executed until the preceding code has been completed, just to avoid any potential side effects. In this particular case, we’re also at the end of the parallel section so only one HW thread would proceed beyond the end of the for-loop, the rest returning to a thread pool to await more work.

With the advent of lambda constructs, described in the C++0x standard and implemented in the Intel C++ Compiler version 11, we can write nearly as compact a version of this parallel construct using Intel® Threading Building Blocks (before lambdas we’d need to use a full C++ function-object, which really breaks up the flow of the source code). As a lambda construct using TBB, the OpenMP code above would transform into something like this:

    parallel_for( blocked_range<int> (0,n),
[] (const blocked_range<int> &r) {
for (int i = r.begin(); i != r.end(); ++i)
for (int axis = 0; axis <3; ++axis) {
body[i].vel[axis] += body[i].acc[axis] * TIMESTEP;
body[i].pos[axis] += body[i].vel[axis] * TIMESTEP;
body[i].acc[axis] = 0.;
}
});


Not quite a compact as the OpenMP version so there must be some other reason to value this. Otherwise, why embrace the complexity? The key is flexibility. TBB offers a rich set of tools that can be used within the context of such a parallel function. The c++0x lambda-function expands that richness with a compactness of expression and flexibility that lets me use TBB with almost the same convenience of OpenMP. For example, that pair of square brackets leading off the lambda provides flexible control of what variables defined in the scope of the call will be available and in what form within the function (more on this later). The TBB parallel_for will divide the work of this inline body using as a helper class the TBB blocked_range, making work for as many HW threads as there are available.

Next time: parallel code: first runs

For more complete information about compiler optimizations, see our Optimization Notice.

Comments