TBB scheduler clandestine evolution

Though Intel® Threading Building Blocks 3.0 Update 4 that introduces a concept of Community Preview feature has just been released, my today's blog will be about something that happened quite long time ago. One of the recent posts on the TBB forum attracted my attention to the issue of information rapidly becoming obsolete. In this case that was the information in the TBB nutshell book.

As any well written programming book that teaches not only specifics of a particular API but also general concepts and approaches, the TBB one remains a popular source of information for the library adopters. But what with the fast TBB evolution, even being just 3 years old has already made some details of the library implementation referenced in the book outdated. Which expectedly confuses its attentive readers from time to time.

Earlier this year I've posted a two part blog describing recent changes in the TBB scheduler behavior that mostly targeted improving library composability. However another significant change in the task scheduler that happened about two years ago (in TBB 2.2) has never been publicized before, probably because it was considered primarily performance and scalability optimization with little impact on observable TBB semantics. This change was elimination of the depth-based stealing limiting heuristic that was employed in the original TBB implementation accompanied with across-the-board stealing kernel replacement. So in this blog I'm going to fill in this information gap by providing basic details about how it worked in the beginning and why it was changed later.

Before delving into the details let’s first refresh the work stealing basics in our memories. Whenever a TBB thread gets a chunk of work, it proactively starts its further division, putting sub-chunks into its local task pool. Let's call this thread the owner. In case of parallel algorithms (like tbb::parallel_for) the depth of this division is controlled by the range argument settings and partitioning policy specified by means of partitioner argument. But whatever the actual depth of this subdivision is, it corresponds to depth-first traversal through the work tree built by divide-and-conquer algorithms, where each next sub-chunk is one level deeper, and normally twice as small as the previous one (at least in terms of iteration space it occupies).

The larger chunks remain closer to the head of the task pool, where thieves try to steal them (essentially expanding the work tree in a breadth-first manner), while smaller chunks at the task pool's tail are processed by the owner as soon as it finishes its work slicing spree. Such a setup minimizes the amount of expensive stealing operations (as the thieves get the largest available lumps of work and thus do not have to steal again for fairly long time), and maximizes cache efficiency because stolen chunks are likely to be already evicted from the victim's cache, while the owner thread keeps processing data hot in its cache.

In practice usually soon after a parallel computation commences, most of the available threads find relatively large pieces of work, and descend deep down the work splitting tree. But many real world workloads are often distributed across the iteration range non-uniformly. And even when work distribution is nearly uniform, environment influence may significantly affect individual threads throughput (e.g. system wide oversubscription, HT or NUMA effects). This means that some threads complete their initial chunk processing earlier than other, and start stealing from laggards. When a lingering thread discovers that all its local work was stolen, it is likely to be still deep down its tree branch. In accordance with how the above theory goes that's just fine, and classic Cilk 5 implementation actually works perfectly in such a situation.

But here is when TBB's depth-restricting heuristic reared its ugly head. While there were large chunks still available on shallow levels in other threads, the poor dawdler could only steal small pieces from the same or deeper level as its own. The following picture illustrates the situation:

This significantly increased stealing frequency resulting in higher overhead, as stealing is usually much more expensive than retrieval from local task pool. And what is even more important, such limitation caused victim's cache thrashing because recently generated tasks were stolen.

Acute readers may have noticed that I've just said that stealing operations are "usually much more expensive" than local retrieval. With the stress on "usually". Adding insult to injure original depth restricted stealing implementation made local task retrieval almost as expensive as stealing because in order to allow stealing from different depth levels the scheduler had to maintain relatively complex data structure illustrated by the following diagram:

Though simple enough, it still precluded any lock-free implementation, which meant that an owner thread had to acquire a lock to access its local task pool.

At last one more drawback of this approach was that even when a thief was allowed to steal from the deepest of filled levels, it had to do it from the head of the corresponding task list where fresher tasks are added, not from the tail containing the oldest one. Honestly speaking this last issue could have been easily fixed even in the scope of the depth restricted framework. But since the tradeoff between a bit better stealing choice and cost of managing extra links was unclear, the simpler alternative was chosen. Be that as it may, this swerved original TBB implementation even further from the ideal work stealing scheme.

Well, now it looks like TBB designers were complete nuts to beget such an abomination. But luckily for TBB users ('cause TBB is really a pretty sane and stable solution) they actually weren't. Because there was a very valid reason (I'd even say pressing necessity) to impose such a restriction. The thing is that Cilk 5 work stealing scheme provides a couple of important guarantees that are indispensable to ensure safety of a parallel computation. It guarantees that in any of the concurrently running threads the call stack does not go deeper than in serial case, and that total memory consumption by P threads does not exceed that of the serial run more than P times. These guarantees let the developers rest assured that as long as they validated serial version of their program and correctly parallelized it, it neither unpredictably blows up the memory footprint nor cause stack overflows.

Essential precondition for providing such guarantees is doing stealing only from empty stack. Cilk was able to ensure this using compiler support. When Cilk's sync operator is executed, and there are tasks stolen and still being processed by other threads, Cilk is able to immediately return from the current function body into the runtime and then continue this body execution on another thread. But TBB is purely library solution that can rely only on the standard C++ compilers. So it was forced to use so called blocking style. This means that when TBB's equivalent of sync operator (one of wait_for_all methods) is executed, TBB cannot unwind the current call stack and has to steal right from the current place on the stack.

If a stolen task in its turn uses wait_for_all() and stealing is not restricted somehow, an infinitely deep recursion may happen as there is always a chance that yet another such task will be stolen driving the call stack deeper and deeper.

Even though tasks of TBB parallel algorithm templates do not use blocking style constructs, the parallel algorithm function still uses one wait_for_all(). Though this allows a standalone parallel algorithm to fulfill Cilk guarantees, in case of nested parallelism (and this is a widespread usage model!) the danger or unbounded recursion returns.

All this said, TBB had no other choice as to resort to a stealing restriction, and picked up the depth based heuristic, in part because it was well investigated and used in early Cilk versions. BTW, it is a heuristic because its ultimate efficiency cannot be proved, while it still allows to achieve desirable practical purposes (at least in the vast majority of cases).

So preventing potential correctness problems in user applications was the primary reason why TBB used such an a priori suboptimal solution. Another one was that despite obvious inefficiencies overall performance and scalability of TBB was not as poor as one might expect based on the above analysis. Even the fact that the owner thread had to take a lock during each of its frequent accesses to its local task pool was not so harmful, because each thread used a lock of its own, and lock contention took place only in case of conflict with a thief. And stealings normally were still not too frequent even though depth restriction increased their rate to some degree. Actually most of the real-world workloads scaled almost perfectly (at least in comparison with the best alternative implementations) on the mainstream hardware of that time (8-32 way parallel).

Why then to change a solution that was well enough? Everyone knows, which road is paved with good intentions after all… Well, with semiconductor companies cramming more and more cores on a die each next year, the requirements to scalability keep growing, and existing research suggests that implementations using locks when accessing local task pools stop scaling well soon after approximately 64 CPUs. And for some workload types, like single producer multiple consumers the downtrend begins much earlier, somewhere between 8 and 16 threads. Besides replacing more complex logic would allow to shave off precious scheduler overhead cycles, allowing to increase efficiency of finer grained workloads handling.

This is why we eventually did just that - replaced depth restricted implementation with the well known, much simpler and more efficient one based on the Cilk's THE protocol. The new implementation uses deque of task pointers as the local task pool, and owner thread accesses it in a wait-free manner most of the time

Only when clash with a thief happens (which is much less frequent event than even stealings) the owner resorts to taking lock. The sole costly operation on part of the owner is full memory fence, but its great advantage is that it is nonblocking and does not affect threads on other CPUs:

But what about blocking style hazards? Certainly replacing owner/thief arbitration protocol should not have eliminated them, should it? Well, they indeed remained, but we employed simpler and at the same time more reliable heuristic to prevent stack overflow. Now TBB allows unrestricted stealing until the thieving thread consumes half of its call stack reserved space. As soon as a thread hits this half stack boundary, it simply stops stealing and just waits until all the tasks stolen from it by other threads are completed, letting it to continue its work and unwind the recursion by one or more steps, and thus re-enable its stealing capability.

Though as I already mentioned switching to the new implementation did not immediately result in significant positive effect on the scalability for most of the “popular” workloads in the concurrency range below 32 threads, some scenarios visibly benefited even at low concurrency levels. For example single producer multiple consumers:

What is more important the replacement shaved off noticeable amount of scheduler overhead, which allowed TBB to absorb almost painlessly the cost of a bunch of new composability features. The following chart demonstrates overhead decrease of TBB 3.0 scheduler (already loaded with a lot of new functionality) relative to much more lightweight 2.1 version that used depth restricted stealing:

Note that these data are collected for microbenchmarks (which are available in the TBB source packages). For most of real world workloads this effect is translated just in a few percent speed up at best. Though I guess nobody is going to complain about it :) .

In conclusion I’d only like to say what a wonderful technique work stealing is, if even with such a suboptimal implementation as pre-2.2 TBB’s one was, it still managed to ensure such strong results in terms of scalability that they allowed TBB to become a leading parallel framework in the industry (according to EDC, by the beginning of 2010, and not counting developers using OS threads) !

For more complete information about compiler optimizations, see our Optimization Notice.