TBB 3.0 task scheduler improves composability of TBB based solutions. Part 2.

Master threads isolation described in the first part of the blog was not the only change in the TBB 3.0 scheduler ameliorating composability of the code parallelized with TBB. Another tightening in the scheduler guarantees improves a popular usage model described in the TBB Reference Manual as “Letting main thread work while child tasks run”. Here is a short example of what it looks like:



task& r = *new( task::allocate_root() ) empty_task;
r.set_ref_count(2);
task& t = *new( r.allocate_child() ) BackgroundTask;
task::spawn(t); // task t normally starts a task tree to be executed in parallel
ForegroundActivity();
r.wait_for_all(); // wait while MyBackgroundTask completes
task::destroy®;


Similarly to one of the scenarios that required master threads isolation, sometimes ForegroundActivity has to be completed fast enough while the amount of work offloaded with BackgroundTask is very large. Even for these specific requirements this pattern worked perfectly well if ForegroundActivity did not execute any parallel algorithms. But if it did, for example when ForegroundActivity was defined as:


void ForegroundActivity() {
parallel_for(MyRange(), MyBody());
}


the effect could be utterly unexpected for the programmer. Instead of only doing parallel work of its own (represented by MyRange object), this intervening parallel algorithm would block until all the locally available work belonging to BackgroundTask was finished. And as the amount of work in BackgroundTask is large by definition, this could take unacceptably long time.

To resolve this issue task scheduler in TBB 3.0 now provides a guarantee that a call to any parallel algorithm, or task::wait_for_all() and task::spawn_and_wait_for_all() methods returns as soon as it discovers that the workload associated with its task tree is processed. Thus in the following slightly modified example


task& r1 = *new( task::allocate_root() ) empty_task;
r1.set_ref_count(2);
task& t1 = *new( r1.allocate_child() ) BackgroundTask;
task::spawn(t1);

task& r2 = *new( task::allocate_root() ) empty_task;
r2.set_ref_count(2);
task& t2 = *new( r2.allocate_child() ) ForegroundTask;
task::spawn(t2);
r2.wait_for_all();
task::destroy(r2);

r1.wait_for_all();
task::destroy(r1);


the call “r2.wait_for_all()” will return as soon as the parallel work created by t2 is executed without attempts to pick up t1’s work from the local task pool.

OK, all this seems to be pretty straightforward, so I hope you had no problems grasping the gist of it. The next (and the last) change in the task scheduler semantics is even simpler to explain, though it was trickier to implement.

TBB 2.2 introduced support for cancellation and exception handling, which has become integral part of the TBB tasking model. The key component of its API and implementation was a new task_group_context class. As the usage models of cancellation and task group contexts were described in great detail in a series of blogs, I will not be repeating all that here. What is of interest at the moment, is a limitation that was imposed on task group context usages in order to avoid impact on the scalability of TBB parallel algorithms. The limitation prohibits creating a context object by one thread and destroying it by other one.

This restriction was not burdensome for mainstream TBB use cases, like the following one:


void f () {
tbb::task_group_context ctx;
my_root_task &r = *new( tbb::task::allocate_root(ctx) ) MyAlgorithmRoot;
tbb::task::spawn_root_and_wait®;
}


as the context object was normally allocated on the call stack and thus its life cycle was naturally tied to the same thread. But with time new usage models appeared that were prone to context objects implicitly trafficked between the threads. One example is task_group – new higher level tasking API introduced in TBB 2.2 that in particular contains task group context object. Since it can be dynamically allocated or be a member of dynamically allocated class, nothing prevents it from being destroyed by a thread different from the creator thread. The same risk exists when a programmer bundles context object with a class instead of using the context locally in a function call.

The problem with relaxing this requirement was that all obvious solutions required either locking on relatively hot parts of TBB execution flow, some kind of garbage collection. The former has a pernicious effect on the scalability, while the performance and scalability impact of the latter was difficult to estimate because it could vary significantly depending on the use case.

Eventually we’ve chosen the approach that already served well to us on several occasions, which is offloading all the costly operations to the cold paths of execution. The hot path here is context destruction by its owner thread when no stray contexts are being concurrently destroyed. And the extra cost on the hot path is one conditional branch, two stores, and one release fence (no-op on x86 architectures). Concurrent destruction and destruction in the presence of contention uses locks. Fortunately, as I noted above, it is relatively rare event and thus does not affect performance and scalability of TBB’s mainstream usage models.

Those of you interested in the details of synchronization can always have a look at the code available from TBB official site. But whether you are interested or not, the bottom line is that starting from TBB 3.0 you can be sure that whichever way components created by you are used (by yourself or your customers), task group context will not get you into trouble.

That’s it about the major changes in the task scheduler behavior in TBB 3.0 release. Of course there were other fixes and small improvements in its implementation, but the points we touched upon in these blogs are the ones that have visible effect on the composability of your parallel solutions built with the help of TBB.

For more complete information about compiler optimizations, see our Optimization Notice.