Under the hood: Building hooks to explore TBB task scheduler

Little did I suspect as I was introducing the topic of blocking in parallel computation in my last post that it would generate such interest, even though it seemed a common problem I’d been working on privately with several Intel customers.  Charles Leiserson amplified the pitfalls of employing blocking in a multi-threaded architecture and offered an alternate algorithm using an edge coloring scheme on a graph representing the computation.  Mick Turner offered another example of the problem in Online Analytical Processing (OLAP) databases and speculated on a solution using task continuation and task stealing.  Dmitriy V’jukov, who has been quite vocal on memory fencing and related issues in TBB on the forum, suggested a method to postpone and reschedule the potentially blocking tasks at some later time.

As I stated up front in my previous post, there’s lots of questions and I am far from able to claim I have any answers, but it appears we have a wealth of ideas and as I have the time to explore them, I’d like to share them with the community and garner more feedback.  And the problems are formidable: just consider for example the situation where you have more top-level objects with secondary object (potentially blocking) dependencies than you have processing elements (HW threads or PEs): the usual Cilk/TBB approach of stealing from the top, even if you had a scheme where PEs are freed from waiting on blocked threads to schedule work elsewhere, would misdirect those PEs towards the big chunks of work at the top rather than sharing the load of the dependent objects-any scheme providing more threads than PEs would end in the same situation of being blocked save the threads computing the dependent object(s).  Similar problems may exist if the object graph contains deep dependency chains, and either topology requires more memory to hold these contexts (stacks, dynamic memory allocations, OS resource tables, etc.) and run the risk of running out.

Given all these concerns, I want to gain a more practical understanding of how the Cilk task scheduler approach as implemented in Intel Threading Building Blocks works.  My last post showed my first attempt to use Intel® Thread Profiler to explore such behavior; it showed the parallelism but little of the detail.  Maybe there’s a scheme using Intel® Threading Tools User Events with a tailored labeling scheme that might provide more insightful details, but for now I want to talk about a new TBB feature, task scheduler observer.  I’ve used it to implement a thread ID which I can use within the parallel control structures to report how tasks are divided among the threads.  To do so requires the implementation of some Thread Local Storage, which I do with caution.  Reread Arch’s blog on the subject to understand the issues.

This solution addresses the “problem” that within a parallel control structure there is no connection to the task structure or structures that ultimately compute it.

I can insert print statements in this functor that record the begin and end indices each time it is invoked but there are no hooks back to the tasks that actually run it. This would require more run-time overhead to maintain so it’s not something we want to have all the time, but would be useful to have available during development.

We start by subclassing task_scheduler_observer to create an observer object:


The parent class offers two virtual functions, on_scheduler_entry and on_scheduler_exit which when the observer is enabled will be called by each pool thread when they are created as a master or steal a task from another thread (entry) and when they are shut down (like in a task_scheduler_init::terminate() call).

For a thread ID, I want my on_scheduler_entry call to record a unique number for each of the threads that call it, but the first question that comes up: where to stick it?  I have no handle on any per-thread data structures within TBB, so the obvious approach is to use some sort of thread local storage.  As it happens, there’s a reasonable chunk of code already in the test routes, src/test/test_task_scheduler_observer.cpp.  So I’ll just steal the pertinent bits.


Here I have practically everything I need, for Windows or Linux.  I start with a structure that contains space for my thread ID, workerID, and code to get a pointer to it for the current thread, allocating it if necessary.  For Windows, practically everything is handled by the __declspec(thread).  For pthreads we need one additional bit of glue:

This creates the actual key used to access this thread local storage area for the current thread.  All that’s left is to implement the on_scheduler_entry function:


There’s a master counter, declared as an atomic to ensure each thread gets a clean increment and I just assign the value to the field created earlier.  I don’t need to distinguish the master thread from the worker threads, so my function ignores the value of is_worker.  With that in place, I can complete my modified main function:b08051906.JPG

With a thread ID in place, next time I’ll start making use of it to explore how things get scheduled.

For more complete information about compiler optimizations, see our Optimization Notice.