Under the hood: Learning more about task scheduling

I’m back with another challenge, encountered during my support work for Intel® Threading Building Blocks.  I’ve been working with several TBB users who appreciate the general philosophy of Cilk task scheduling embodied in TBB but have run into some practical challenges applying it to their applications.  Often the issue revolves around the need to block some computations until other computations complete.  It may be that they need to handle either inter-object or intra-object threading-their application may at different times encounter a bunch of objects to run in parallel or just one big object that could use several threads to chew on simultaneously.  Or they may have objects they need to compute upon which other objects are dependent-here it would be great to suspend dependent object processing while the weight of the associated threads is thrown to computing the shared object.

These are tough problems and I don’t have answers for them yet.  But I have a few ideas and I hope in the next few posts to share them and get community feedback that might inspire some solutions.  So bear with me as I stumble about and maybe we can all learn something.

Anyone who has had much exposure to presentations about Intel TBB has probably heard in one form or another the motto, “process locally, steal globally.”  The Cilk philosophy embodied in the TBB task scheduler is to constrain active threads to their own local region in memory to exploit any memory that may already reside in the caches of the processing element running that thread; meanwhile, idle threads should try to steal work from memory regions as yet untouched by the active thread(s) to avoid interrupting those running threads.

When one of my customers presented an example using a nested pair of parallel_for statements, I realized that all my knowledge on subject of scheduling was theoretical: I hadn’t dug down into the guts of this code.  Now might be the time. 

First, here’s a sketch of the test code.  You’ll note an outer and inner loop with a lock on the outer and some spinning work on the inner.  This is intended to represent the structure of a program operating on a set of objects that block because of some postulated resource contention, and the inner loop represents the work done to compute that shared resource.  You’ll see that the lock is currently commented out.  When running this code on an 8-core machine, it will sometimes lock up.  Some ideas have been bounced around about why, but I’m curious whether I can demonstrate the workings of the problem rather than just speculating about it.  There’s also explicit references to the auto_partitioner, but with the advent of the affinity_partitioner, use of the auto_partitioner has been deprecated.  Still, we’ll start with this one and then see if we can tell the difference when looking under the hood.
b08050501.JPG

So what’s going on here?  As is my wont, I turned to Intel® Thread Profiler for a first look at the processing:

b08050502.JPG

Hrumph!  I can see eight threads cranking at the work, which is good because I’m running on an 8-core processor.  32% of the run time is spent at concurrency level 8.  Most of the time is spent in that serial tail that must represent startup processing, but the whole run is under 0.2 seconds so I suspect that’s a fixed overhead that will be negligible compared to the overall time of real work.  But that’s about all.  Zooming in on the high concurrency zone:
b08050503.JPG

Where the work is done, the test is keeping 8 threads busy 94% of the time. But still there’s no clue about how the task scheduler is dividing the work, though it looks like it gets interesting towards the end of the run. Guess I’ll have to revert to an old standard, inserting print statements to expose what is happening in the code:
b08050504.JPG

Unfortunately, while I can print out the loop bounds that tell me where I am in the nested executions of the array, I can’t print which thread is handling which range.  The TBB task object that might contain that information lies outside the scope of the outer_loop class.  Stuck.

But not for long.  In the evolving library that is Intel Threading Building Blocks, there is a new feature, available in at least the latest open source release (upgraded to Stable as tbb20_20080408oss), called task_scheduler_observer.  I’ve implemented an observer that lets me identify which thread executes which range.  In my next post I describe it and start exploring the behavior of the scheduler.

For more complete information about compiler optimizations, see our Optimization Notice.