Under the hood: Employing hooks to explore TBB task scheduler

On a quest to understand the TBB scheduler and how it might be used to schedule tasks with order dependencies (i.e., a place where you’d like to block access to an object until it can get built), I’ve been building up tools to take a peek.  Last time I showed a technique to use thread local storage to create a Thread ID that gets initialized (and the tls allocated) using a TBB task_scheduler_observer.  This time I’ll insert some hooks and take a look at that blocked double-nested parallel_for example.


Here’s the outer loop of the test program I started with, enhanced by the addition of a second scoped lock, P, which I’ll use to ensure that each line that I print from each thread gets its own line in the output stream.  I’m also passing in the outer loop index so I can use it for identification in the inner loop print.

The inner loop is enhanced by addition of a parameter to receive the outer loop index and thread local storage access code to get the thread ID of whichever thread gets the work.  All those details are streamed away in the scoped_lock print section in the middle.

On my eight-core machine I gave this debug code a whirl:

And on it went for pages.  The first functor call in the outer loop splits all the way down to the range (0,3) and thread 1 takes on that chunk, starting with loop 0.  Thread 2 starts on range (25,28), and so on.


It’s easy and relatively fast to dump out the stats this way but produces lots of redundant information.  What I’d like to do is reduce that redundancy and generate a rough concurrency map.  At best the relationship between these multiple threads of print calls is approximate.  Although I’m writing to “unbuffered” cerr, I can easily demonstrate the effects of buffering variability on my supposed unbuffered cerr. In the example above, the second thread appears after the first thread finishes one outer loop’s worth of work.  Repeated running with the output directed to whatever stream brings it to my display (pretty serpentine, I must say) usually delivers output from the second thread within 1 to 3 outer loops completed by the first thread.  It does vary.  If I try the same thing but capture the output to a file:

I counted 25 outer loops completed by the first thread before a print line from a second thread appeared.  Repetitions of the test on this Fedora Core 6 machine using bash produced a print statement from a second thread after around 17 or 18 outer loop executions by the first thread.  Just switching to a pipeline as the output receiver:

Caused a substantial “improvement” in the results.  In several trials, the second thread produced a print statement after no less than one but usually three outer loop executions by the first thread.  This pipeline arrangement is probably close to what happens when I don’t specify any destination at all.

Anyway, given this variability in the output, the best we can hope for is a rough concurrency diagram that shows approximate concurrency.  If a thread completes (prints) a couple outer loops before a second thread prints anything, we may be able to conclude that the first outer loop execution on the first thread was before and not overlapping the first outer loop execution of the second thread.

We can do better than that.  The output provides sufficient information to mark lifetimes on the outer loop executions.  We can keep a running summary of the active threads and the loops they’re working on and delay output until making some change would cause the loss of history of one of the threads.

I’ve been meaning to learn Python, but since that’s happened yet, let me fall back onto my Perl skills to whip out a filter to do this summarization (interpretation/proof of correctness of the Perl script left to the reader).

Looking at a summarization of that first file-captured output leaves a very unfavorable impression of double parallel_for concurrency:

We hope that the pipeline example gives us a more favorable result:

The Perl script records the outer loops each worker helps compute.  If it gets a new loop for a worker and hasn’t printed out the last loop assignment, it will print the current assignments of all threads before assigning the new loop.  When the completion of a loop is detected, each of the threads processing that loop is marked for reset, which happens after the next print.  Output variances can be quite dramatic and are easily captured in this form:

There are a few things these samples share: a tendency to take on whole outer loop iterations at the start of the run, but ending the runs with the work of a particular output divided across the threads as can be seen in this raw output of the sample called zpipe:

Oh, and there is the behavior I haven’t talked about heretofore:  sometimes, with that lock in the outer loop, the program locks up-does not terminate.  But pull the lock and my little summarizer tool shows a very different behavior:

You can see more distribution of the outer loops across the threads and much fewer gaps in the concurrency list.  But that’s enough for this post.  Next time we’ll take up where we left off, and maybe add some task specific data.

For more complete information about compiler optimizations, see our Optimization Notice.