TBB 3.0 and processor affinity

A week ago I started telling about a couple of new helpful features in the TBB 3.0 Update 4 task scheduler, and we talked about the support for processor groups – an extension of Win32 API available in 64-bit edition of Windows 7. The main purpose of processor groups is to extend Win32 capabilities to allow applications work with more than 64 logical CPUs. By its nature this functionality is pretty close to the classic processor affinity concept, essentially turning flat processor affinity model into a hierarchical one. First you affinitize a thread to a processor group, and then you can apply a conventional affinity mask to tie the thread to a subset of CPUs in the current group.

Such an obvious relation between the two concepts reminded me of one idea that TBB team kicked around some time ago. Truth be told that idea was pushed upon us by the issue one of our customers faced when trying to fuse TBB and MPI on a NUMA system. Their MPI processes were running on multicore NUMA nodes, and they wanted to parallelize computation inside the MPI process by means of TBB. The problem they stumbled upon was that TBB, when initialized by default, instantiated its worker thread pool in accordance with the total system concurrency, disregarding the number of cores on the current NUMA node. Naturally this resulted in huge oversubscription and performance plunge.

Fortunately in this particular case the developers had full control over the TBBfied part of the code, so they were able to resolve the issue by specifying desired amount of threads via tbb::task_scheduler_init constructor. But what if your code relies on task scheduler auto-initialization? Or, even worse, you do not even suspect that TBB is present because it is internally used by a third party component, which most probably initializes it by default (if only it is not configurable through a fancy API)?

Though in most of these cases some refactoring of your code to insert explicit scheduler initialization in appropriate place (e.g. before the aforementioned 3rd party component initializes TBB) can help, it’s certainly a nuisance. And if the (already notorious) 3rd party component is a dynamic library that initializes TBB scheduler during its static objects construction phase, you are definitely in trouble. Because even if you are able to come up with a place in your code to sneak in explicit scheduler initialization before the culprit module is initialized, this is going to be a time bomb, ready to explode any moment in the future, when inter-module dependencies in your application change. Besides it is very likely that you’d prefer not to mess around with retrieving and analyzing machine topology, especially when it has to be done in a cross-platform manner. It would be much more expedient, if an external engine (like MPI framework) did it for you.

Anyway, even all quirky use cases aside, it would no doubt be very convenient, if TBB could somehow know that the current process can access only a subset of available CPUs and adjust its default settings accordingly. Luckily obtaining such knowledge is quite possible in practice. The only mechanism supported by public OS APIs to impose such a restriction on a process is processor affinity mask. And it’s no wonder that it is exactly what applications use to confine themselves or other applications to a fragment of system computational resources. Thus in the above example with MPI application, individual processes were affinitized by the framework to the cores constituting NUMA nodes.

So the only thing that TBB should do instead of asking the system how many CPUs it has, is to retrieve the current process affinity mask, count the number of non-zero bits in it, and voilà, TBB uses no more worker threads than necessary! And this is exactly what TBB 3.0 Update 4 does. Clarifying the statement in the end of my previous blog TBB’s methods tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() return not simply the total number of logical CPUs in the system or the current processor group, but rather the number of CPUs available to the process in accordance with its affinity settings.

It’s all been pretty simple so far, hasn’t it? Well, now let’s add a bit of spice into this triteness. First of all the new functionality works only on Windows, Linux and FreeBSD. You may be urged to exclaim: ”Wait a minute, what about MacOS? TBB’s always supported it in the same way as other OSes!” The problem with MacOS is that first it had not had any affinity support until its 10.5 release. Second, even though starting with 10.5 it does support thread affinity, it does this in the traditional Apple way, that is completely non-traditionally. I’ll return to it later, as for now we’ll side step for a moment and talk about hardware affinity (aka. processor affinity) proper a little.

Actually what I called “problem with MacOS” is probably not a MacOS issue at all. Those of you, who are well acquainted with TBB, likely know that TBB itself does not provide any abstraction for processor affinity. And this is certainly not because MacOS did not support it some time ago (btw. FreeBSD had not supported affinity before its 7.1 release too), but rather because conventional hardware affinity is a double-edged weapon. While it allows to significantly increase performance when used with care, it has a substantial negative impact on the application composability.

In fact to elicit performance boost with hardware affinity, parallel algorithm developer needs full control over the whole application to make sure that no several concurrently running parallel computations affinitize their threads to the same cores/CPUs causing cache thrashing. Besides extreme care is necessary to properly synchronize concurrently running affinitized parallel algorithms so that resources partitioning did not result in underutilization time holes when one algorithm finishes its work earlier than other(s) and its share of CPUs remains idle for some time. Unfortunately such tight control is often impossible because modern software is mostly component based.

Even if developers managed to appropriately structure their application, there is another necessary condition for it to work efficiently. This condition is that either this application runs alone on the machine, or all the running applications are appropriately configured to use only part of machine’s CPUs to avoid conflicts. And this essentially means control over the whole machine, which is also difficult to achieve in practice.

Indeed, as years of Intel OpenMP experience demonstrated, processor affinity was misused at least as often as it was beneficial, with most of the success stories being in the HPC and HTC domains, which after the previous short discussion should not come as such a big surprise because it is exactly HPC and HTC applications where developers often have full control over both the whole application and the environment this application runs in.

TBB in its turn was (and still is) targeted to primarily mainstream computing with all its applications intricacy and runtime environment unpredictability. Thus it was decided that TBB would neither expose interfaces to control processor affinity nor use hardware affinity internally. Instead TBB provides soft task-to-thread affinity, which works great in most of the situation where cache performance is of concern. However for those who badly need it, we offer tbb::task_scheduler_observer class that can be used by manually affinitize TBB worker threads.

OK, now that our digression from the original topic came to its end, let’s have a quick glance at the eccentric (or pioneering, pick up whichever you like more) Apple’s approach to hardware affinity. Maybe someone of you noticed that I used the term “thread affinity” instead of usual “processor affinity” when I mentioned MacOS above. It was not accidentally, as these are the words used by MacOS developers to describe their mechanism. And indeed it seems to be more apt term as MacOS support is not a traditional affinity mask that ties a thread to a subset of CPUs. Instead it allows to define arbitrary amount of affinity groups, and then tries to schedule threads belonging to different groups so that they run on different L2 caches.

Actually this is quite possibly an optimal solution for a non-HPC system, as it cannot leave OS bound hand and foot even when used most inaptly or unfortunate combination of running components/applications occurs, thus preventing drastic performance hit on both mis-affinitized application(s) and the whole system. In this respect MacOS approach is similar to TBB’s task-to-thread replay affinity. TBB relies on the expectation that OS does not move threads between the cores without a grave need, and if the OS founds that it is nevertheless necessary, so be it, the given thread will absorb one time migration performance hit, but the rest of the active threads will remain unscathed.

Anyway, the bottom line is that despite the savvy of the MacOS affinity solution, it cannot be used for the purposes of limiting the set of CPUs available to a process, and thus on this OS TBB will always deliver one thread per each CPU/core in the whole system.

Well, at long last we are back to the trinity of Windows, Linux and FreeBSD, which all stick to the traditional affinity mask approach. But even here some details vary between them. The first difference is that Windows and FreeBSD track process affinity mask separately from individual threads affinity masks, while on Linux only thread affinity masks exist. Correspondingly on Windows and FreeBSD when TBB initializes its thread pool, it uses process affinity mask to calculate the desired concurrency level. This allows it to be independent on the affinity settings of a particular thread, which in its turn makes results of tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() methods consistent, disregarding which thread and they were called from and in which sequence (I mean inter-thread realtime order).

Unfortunately on Linux the concurrency level of TBB will depend on the affinity mask of the thread first to call any of the two APIs mentioned in the previous paragraph. Though for most programs this is not a problem, as TBB is normally initialized in a predictable manner (e.g. by the main application thread), there can be cases when the number of TBB workers in the whole application may vary depending on a race between threads concurrently initializing TBB (e.g. when several GUI threads concurrently enqueue TBB tasks). If for example there are two threads, one of which is affinitized to two cores while another to four ones and both are going to start a parallel algorithm, then the concurrency level in both algorithms will be either two or four in accordance with which thread initiated parallel processing first. And worker threads will inherit the affinity mask of the thread that caused TBB thread pool instantiation.

So, the caveat for Linux developers is to always instantiate TBB task scheduler (by means of tbb::task_scheduler_init scoped object) before setting a custom affinity mask for the given thread.

And finally a small disturbance introduced by processor groups into otherwise straightforward Windows affinity model. As I described earlier on Windows 7 machines with more than 64 cores TBB uses new API to obtain the total number of CPUs in the system and uniformly spread its workers across all the processor groups. An application always starts confined to a single processor group on such systems. And its process affinity mask is set to include all the CPUs in its group. Thus TBB has to use the full-group affinity mask value as an indication that, if initialized by default, it is to create its thread pool in accordance with the whole machine concurrency, not just to use up the current processor group capacity.

Consequently, you cannot use process affinity mask to limit TBB’s default concurrency to one processor group size. Neither can you use process affinity mask to set default concurrency level exceeding one processor group capacity, because affinity masks on Windows remain limited with a single processor group only (and comprise at most 64 bits).

That’s it about using affinity masks to (externally) control default concurrency level of TBB task scheduler. This small piece of functionality can come quite handy when running TBBfied components or whole apps as part of larger software systems. It can also be helpful when debugging TBB applications, as it allows controlling the size of TBB thread pool by changing (from outside) affinity mask of the process being debugged. On the other hand I think it may be amusing to witness how such simple and straightforward concepts become tricky to implement uniformly when one has to do it in a cross-platform way.

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.