Under the hood: watching TBB task scheduler do recursive range splitting

Wow, time flies around here.  I was thinking it’s been a while since I last looked at my task scheduler experiments: a little vacation here, teaching at the O’Reilly Open Source convention there, and a little customer work stuffed around the edges can put a project completely out of mind. So let me recall and recap:  last time I’d added a thread ID to the threads in the TBB worker pool using the 2.1 feature called task scheduler observer and used it to track which threads handled which sub-ranges of a nested pair of parallel_for calls, both in the face of an outer loop lock and no locks.  The data generated to track progress was substantial, so I wrote a Perl script that could process those log data and generate a rough concurrency map showing which indices of the outer loop are handled by which thread and in what order.

Here’s a sample of the post-processed output of the parallel nested loops without any locks.  We can see greedy behavior of individual threads that take all the work for particular outer loop indices, the most greedy of these here being thread 5, which spends the whole run working on index 34.  Other threads complete their work a little more quickly than that but we can see evidence of their greediness as well: threads 1, 4, 6, 7, and 8 all take a couple of intervals to complete their work. It’s not until around the 10th interval that we see the first example of stealing within the inner loop: thread 4 grabs the rest of the inner indices for outer index 42 and eventually completes the work started by thread 3.  Thread 3 then steals from thread 2.

Also, recognizing that the splitting strategy divides the ranges in half, we can even follow how the threads subdivided the outer loop on this run, at least to start:  half of 50 is 25, and we can see threads 1 and 2 starting on indices 0 and 25, our first split.  Half of 25 is 12 and 25 plus 12 is 37, our thread 3 initial split point.  We can continue deducing such divisions and derive a diagram such as the one shown above.  We also observe that typical granularity with this run is about 3 outer loop indices.  There’s some unexplained bits like why thread 5 lands on loop 34 while thread 6 splits 5’s range rather than going to 40 or 46 or even 12.  But we’re getting a picture, albeit fuzzy in places.

It gets even fuzzier when we add the lock to the outer loop:

We see a lot of the same split points appearing initially, but a lot less overlap between the threads, a lot of gaps where only a couple threads appear to be active.  And it’s a lot harder to explain what’s going on.  Why does thread 5 start on index 31, for example, instead of taking the natural split at 37?  Perhaps thread 3 got to the split first (it got index 37) but was blocked from reporting until after thread 5 got its slot?  We’re getting further a field from the hard data here, so maybe next time we’ll dig deeper.

For more complete information about compiler optimizations, see our Optimization Notice.