TBB 3.0, high end many-cores, and Windows processor groups

Though I wrote my previous TBB task scheduler blog just a few days after TBB 3.0 Update 4 had been released, I ignored that remarkable event, and instead delved into more than two year old past. So today I’m going to redeem that slight, and talk about a couple of small but quite useful improvements in the TBB scheduler behavior made in the aforementioned update.

But first of all, I have to confess to being overconfident in my ability to recognize meaning of things just at the first glance. About a year ago when I was skimming the list of new Windows* 7 threading APIs, I rashly assumed that processor groups were either a kind of abstraction for NUMA nodes or a way to partition available computational resources. And since TBB’s practical interests were off those particular planes at that moment, I did not bother to read the details, and correspondingly had been remaining in my blissful delusion until one of Intel’s application engineers reported that his brand new 80-way Westmere-Ex machine would not utilize more than 64 cores for his application parallelized with TBB.

That was when it became obvious that processor groups are not just yet another abstraction layer for already available lower level functionality, but the indispensable mechanism for harnessing power of machines sporting more than 64 cores when they run Microsoft* Windows 7. I know that Linux developers must have been dumbfounded by this statement, why would one need any fancy API to work with 128 cores instead of say 64? And they will be in their full right at that, as Linux threading/multiprocessing APIs indeed work transparently with regard to the number of available CPUs.

Unfortunately, it’s not that simple with Windows. Actually, when I mentioned “Windows 7” above, it was not to say that earlier versions of Windows could somehow do without it while working on large multiprocessors. No, in case you never gave a thought to this fact, all Windows before “7” (and only its 64-bit edition) were confined to at most 64 CPUs/cores on shared memory machines. In fact the older versions and lower end SKUs had even smaller limits, but 64 turns out to be a sort of magic number putting a hard limit to the capabilities of the classic Win32 API.

Upon the second thought, this 64 watershed should not be such a surprise for Win32 practitioners. At least a number of hints that this number is too ingrained into Windows internals is scattered here and there across the Win32 vastness. For example, all functions from the (Msg)WaitForMultipleObjects(Ex) family accept an array of no more than 64 handles. And here it is not just a transient limitation that can be easily lifted in a future OS version. Return value of these functions can specify the index of the handle that satisfied the wait, and though it is a DWORD, the indices starting from 128 are reserved for abandoned mutexes (just have a look at the WAIT_ABANDONED constant value).

Of course the limitation of these particular functions does not directly preclude the OS from supporting larger amount of CPUs, and to that matter, developers from manipulating with say more than 64 threads. But having to handle more than 64 objects will definitely complicate your life quite a bit. For example, about ten years ago when I was implementing an asynchronous procedure for scanning network to detect available servers, I had to use a cascade of 64 threads each waiting on 64 socket handles to achieve acceptably broad coverage of the LAN subnets within reasonably short (or long) timeout interval. The fact that this inconveniently small limit has been steadily holding out during the whole past decade suggests that it is not a cinch to root it out of the Windows bowels.

Another API, limitation of which relates to the number of CPUs recognized by the system more directly, is Set/GetProcessAffinityMask. The affinity mask proper is a DWORD_PTR, meaning that it can cover only 64 CPUs. Again, from the purely interface standpoint it should not have been a problem to add yet another Ex version of these functions that would accept an array of bitmasks, and the fact that it was not done likely means that this would have require some non trivial reshuffling of the Windows internals.

So the solution Microsoft came up with was to introduce a new abstraction layer, called processor regroups (hey, this is a different link than the one in the beginning, so you may want to follow it too). Their size maxes out at 64, and is defined by the OS startup configuration. Normally all but the last groups comprise 64 logical CPUs each, and the last one includes whatever is left.

By the way, even if you do not have a 128- or 256-core monster, you can change groupsize value in the boot configuration data store with the BCDEdit utility to trick Windows into believing that it does need processor groups. This may be useful for testing you application behavior in a multigroup environment on a smaller machine.

Harking back to the above discussion of innate limitations of Win32 API, and contrary to what you might expect, this freshly baked processor groups support does not completely unchain you, it just slackens your shackles a little. The thing is that at least in its current edition Windows 7 supports maximum 4 processor groups. My guess is that it may have something to do with the kernel mode data structures still using UCHAR to store processor index. Though since I’m not a kernel mode programming maven, this is just a surmise, and the limitation may be just a marketing decision, or a little of both.

Anyway, the point of this blog is that once you have a more-than-64-CPU/core machine running Windows 7 x64, you’ll have to do some additional manipulations to allow your application to use all this oomph. The idea behind processor groups concept is to let all existing APIs work as they did before, which is evidently great for backward compatibility. But everything has its price, and the cost of the preserved compatibility is that existing applications will remain forever confined to one processor group only. Even GetSystemInfo function returns just the number of logical processors in the current group instead of the total amount of CPUs in the system.

Therefore from now on to learn the maximal hardware concurrency level you need to use GetActiveProcessorCount function with ALL_PROCESSOR_GROUPS flag. And to be able to utilize all the hardware capacity, you need to call GetActiveProcessorGroupCount to retrieve the number of available processor groups, and then manually distribute your threads among them by means of SetThreadGroupAffinity API. Though it is not something too difficult to do, it is still tedious enough, adds to the thread management code complexity, and what is more important, if you want your application to work on different Windows versions or SKUs (and I bet you do), requires dynamic dispatch.

The good news is that starting from TBB 3.0 U4 you get support of more than 64 cores on Windows 7 absolutely for free (that is completely transparently)! In particular methods tbb::task_scheduler_init::default_num_threads() and tbb::tbb_thread::hardware_concurrency() now return the total number of logical CPUs in the system, not the number of threads in the current processor group. And when the amount of worker threads exceeds the capacity of the current processor group, workers are spread across other groups to avoid oversubscription.

That’s essentially it about the first of two small but useful changes in the TBB task scheduler. With the blog having become a little longer than I expected, I think I’ll stop here for today, and write about another handy new feature next week. See you then!

Update: The machine where the problem was discovered was 80-way Westmere-Ex (not 128-way as in the original post). And the application engineer who came across the problem (Dr. Matthias Hahn) was actually the one who identified the root cause of the issue as the lack of processor groups support in TBB, which saved TBB engineers quite a bit of time.

For more complete information about compiler optimizations, see our Optimization Notice.


Dmitry Vyukov's picture

> I'll ping you as soon as we have them.

Thank you. Looking forward.

Andrey Marochko (Intel)'s picture

No, I do not have scalability numbers for such large concurrency levels so far. The machine you mentioned has been recently installed in our US site, so I think someone will run scalability benchmarks there soon. I'll ping you as soon as we have them.

Dmitry Vyukov's picture

> starting from TBB 3.0 U4 you get support of more than 64 cores on Windows 7 absolutely for free

Is TBB still scales linearly on such a large number of cores?
I was always interested at what point will random scheduling get stuck on system's nonuniformity... It would be interesting to see benchmark results of... I think matrix multiplication is a good workload for this, on, say, 512-hw thread machine like the one used here:
Perhaps you already have such results. Don't you?

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.