n-bodies: a parallel TBB solution: parallel code: spreading the “fix” around

Last time I was able to make the n-bodies acceleration code at least thread safe by employing a scoped lock, at a disastrous cost in performance.  If you think about it, it’s a bad way to manage the eight HW threads my test machine has available.  The obvious alternative is to have a lock per body-any thread needing to adjust a pair of bodies would need to acquire each body’s lock before proceeding.  That’s more locking overhead than before-twice as many locks-but enough independent locks that my multiple threads won’t all be stopped by one body interaction.  Making that change, the body structure now looks like this:

#ifdef ALIGN_BODIES

__declspec(align(128))

#endif

struct bodytype {

    double pos[3];  // body position in three axes

    double vel[3];  // body velocity

    double acc[3];  // body acceleration

    double mass;    // body mass

#ifdef BODY_LOCKS //{

    MyLockType lock; // local access lock

#ifdef ALIGN_BODIES //{

    unsigned char dummy[128-(10*sizeof(double)+sizeof(MyLockType))]; // 10 is the number of doubles as data

#endif //}

#else //}{

#ifdef ALIGN_BODIES //{

    unsigned char dummy[128-(10*sizeof(double))]; // 10 is the number of doubles as data

#endif //}

#endif //}

} body[BODYMAX];


I’ve added a new conditional, BODY_LOCKS, which when defined creates a lock object of MyLockType in each body object.  And I added the size of that object to the dummy field at the end so that I maintain cache alignment for each of the bodies.  But compiling it...



Oops. How many free bytes were there before adding the lock? Hmm.... (128 – 10* sizeof (double)) = (128 – 10* 8) = 48 bytes. So I guess the lock structure must take more than 48 bytes. And if I double the size of the body structure (using 256 instead of 128), that does fix the errors. Does it have any other impact? And for that matter, what is the cost of not padding the bodies into separate cache lines? Taking a few runs to test, we get numbers like these for the serial run (which have an insignificant impact in the overall graph):





These effects may be different when I  actually turn on multiple threads (and have them deal with bodies split across cache lines) so I’ll take another look then, but for now I’ll go to the 2-cache line body to flesh out the lock-per-body approach to thread safety. The code change to use these locks:

    // F = G*mi*mj/distsq, but F = ma, so ai = G*mj/distsq

    double Gdivd = GFORCE * ivdist * ivdist;

    double ai = Gdivd*body[j].mass;

    double aj = Gdivd*body[i].mass;


    // apply acceleration components using unit vectors

    {

        MyLockType::scoped_lock locki(body[i].lock);

        MyLockType::scoped_lock lockj(body[j].lock);


        for (int axis = 0; axis < 3; ++axis) {

            body[j].acc[axis] += aj*ud[axis];

            body[i].acc[axis] -= ai*ud[axis];

        }

    }


Updating accelerations now require two lock references, conveniently called locki and lockj. Also conveniently, the order in which I am acquiring new is and js, using a half triangle of the interaction plain (touching each i-j pair once, updating both) means that j is always larger than i.  This is a hierarchical ordering of the locks, which means that whichever thread is accessing a pair of bodies, it will always acquire the lock for the smaller-numbered body before the larger-numbered one (it would work equally well the other way, as long as all the threads do it the same way.

Nevertheless, though it is faster than the single, global lock version used before, it still isn’t as fast as doing it serially.



My use of locks doesn't seem to be getting me far enough.  Maybe next time I'll take a step back and see if there is a different approach.

Tags:
For more complete information about compiler optimizations, see our Optimization Notice.