TBB 3.0 task scheduler improves composability of TBB based solutions. Part 1.

This is the first of two blogs where I’m going to describe some of the problems that users of TBB 2.2 and earlier came across, and the changes in task scheduler behavior made in TBB 3.0 release in order to solve them. In this part we’ll talk about issues caused by the lack of master threads isolation from each other inside TBB task scheduler. But first off, let me share with you a little of the background considerations.

With parallel programming being almost as old as the computer industry, by now there have been multiple parallel languages and frameworks developed, a multitude of researches conducted, and uncountable number of articles published. Though it’s not in the power of a human being to encompass this amount of information, just a plain skimming it should be enough to notice that the main purposes of the vast majority of them are raw performance and maximal hardware utilization. Or in other words lower parallelization overhead and better scalability of algorithms.

If while reading this you began thinking that I’m going to challenge the common wisdom and try (obviously in vain) to devalue the time-honored priorities of parallel programming, then I have to disappoint you. I just want to attract your attention to a factor that is often overlooked when different approaches to parallelization are compared. And this is naturally (as the blog title suggests) composability of the resulting solutions.

But what exactly do I mean when I talk about composability? It’s actually pretty straightforward. Composability of parallelized applications or components is the ability to retain the same level of efficiency when running side-by-side with or inside of other parallelized components.

Many existing parallel frameworks are designed in assumption that a parallel application has exclusive access to the machine hardware, and centralized control over all the parallel activities done by all its threads. Such “sandbox” approach was quite applicable in the epoch when the majority of parallel applications belonged to the HPC domain, and run on dedicated servers and clusters.

However since the beginning of the multicore revolution more and more parallel applications are running on general purpose systems where multiple parallel workloads compete for the hardware resources. And what is even more important more and more reusable components become internally parallelized. Combining such components in a complex application produces various combinations of side-by-side and nested parallelism.

When these components were parallelized with poorly composable frameworks, the result often is severe oversubscription or underutilization.

From its very inception TBB was designed to ensure high degree of composability for solutions built with it. In particular TBB scheduler based on the work stealing algorithm achieves close to optimal hardware utilization even when availability of hardware resources keeps changing during the computation. Besides it seamlessly supports nested parallelism.

This seemed to be enough for building well composable solutions, first of all because TBB task scheduler efficiently prevented undersubscription and minimized oversubscription issues (of course when it was used correctly). But the devil as usually lurks in the details. This time they were the details of threads cooperation mechanics inside the scheduler. Eventually customer feedback made the problems obvious. In order to understand them better let’s first have a look at how the scheduler worked before TBB 3.0.

To set up the terminology for the following discussion, a thread belonging to TBB’s internal thread pool will be called a “worker”, and any other thread will go under the alias of “master” (e.g. application’s main thread, or any thread explicitly created by programmer).

When the first master initialized TBB 2.2 task scheduler, the following things happened:
1)    Global “arena” object was instantiated
2)    Internal pool of “worker” threads (or simply workers) was created
3)    Workers registered themselves in the arena

When the master then started a parallel algorithm it:
4)    Registered itself in the arena
5)    Published spawned tasks in the arena making them available for stealing by any other thread in the arena

While the only master was involved, everything was fine. But in complex component applications it is not uncommon for multiple master threads to start parallel algorithms concurrently. In this case they also register themselves in the same arena and publish their work there (in the form of spawned tasks). And this is what evoked unpleasant side effects of our initial design.

The main issue was caused by the fact that the work shared via the global arena was available not only to TBB workers but also to other master threads. Though in many cases this was just fine, the problem happened when the thread starting a parallel algorithm expected it to complete in a short time, e.g. a GUI thread processing reasonably small sets of data as part of its message handling. If there was a concurrently running parallel algorithm processing large amount of data, the latency-sensitive thread could steal a hefty chunk of work and get stuck processing it for too long. In an extreme case a master thread could get trapped virtually forever if there was a concurrent thread running infinitely long tbb::pipeline algorithm (e.g. the one processing requests coming from the network).

Another problem with the old approach was that the first thread to initialize the arena defined the number of workers (and thus concurrency level) for all subsequently joining masters. Again in some cases it was all right, especially taking into account that many parallel components use the default settings. But there are numerous cases when such behavior is detrimental.

Nowadays even desktops can easily sport 8 cores, and for the servers this number is 32 or more. Unfortunately software development often cannot keep up with the steady hardware progress. As the result many parallelized components scale only up to a relatively small concurrency levels (4 - 8). When run on more threads their performance either levels out or even drops. In any case this results in the wasted hardware resources and electricity, and degrades application throughput.

We cannot ignore such cases as there are multiple factors constraining scalability of parallel algorithms. Among them are large amount of dependencies limiting available parallel slack, necessity to use locking (especially coarse grained, for example when there are dependencies on non-thread safe third party components), limited amount of data to process. The last factor either leads to too fine grained partitioning resulting in excessively high overhead, or to the lack of parallelism.

To address these issues TBB 3.0 changed the way masters and workers communicate. Now master threads are isolated from each other, and worker task pool is dynamically split between them. Instead of a single global arena instance, each master has an arena of its own, and receives no more workers than it requested. Workers can migrate between arenas, but the work published by a master never becomes accessible for other masters.

The original design with the single arena had a good property that due to randomness of stealing worker threads eventually got distributed between multiple masters more or less fairly. To preserve the fairness guarantee another layer was added into task scheduler design. Now it has a global “market” object that keeps track of all active arenas and (re)distributes workers between them.

A good thing about the transition to the new model is that despite of significant changes in the library internals, it neither affected any public APIs, nor the way developers should use TBB. Thus you are getting better composability of your solutions virtually for free.

Oh, in case anyone is concerned about possible performance impact of the new layer in the scheduler, rest assured that the execution flow crosses this layer only on a few infrequent occasions (scheduler instances creation/destruction, masters starting/finishing parallel algorithms), and thus almost none of the new code falls on the hot path.

In part 2 I’ll write about other two changes in the TBB 3.0 scheduler that help promoting composability of TBB based solutions.

For more complete information about compiler optimizations, see our Optimization Notice.