More threads created than expected with task_scheduler_init

More threads created than expected with task_scheduler_init

It appears that in TBB 3.0, giving tbb::task_scheduler_init() a thread count less than the default number of threads still results it in creating the default number of threads. Debugging the code somewhat, I narrowed it down to market::global_market() where we have this line:
max_num_workers = max( governor::default_num_threads() - 1, max_num_workers );

Is this line necessary? If so, what do I do if I want to bypass this behaviour? Is recompiling with __TBB_ARENA_PER_MASTER set to 0 a viable alternative?


19 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hi Edward,
In what scope are you declaring the task_scheduler_init object? A couple possibilities are that some TBB code was reached before that, and the default number of threads was used and the subsequent task_scheduler_init thus ignored. Alternatively, the task_scheduler_init is in some scope that you have exited, in which case TBB will go back to using the default number of threads.

So it would help to see the context in which you are using it.


Yes more threads are created now, but the extra thread are not used and should just sleep. So my recommendation for you is to do nothing. If you see any serious problem with the new behavior, please tell us. Thanks!

Hi Alexey,

The extra threads use up extra stack space so I would like to NOT have them created. What is the best modification of TBB 3.0 to do this? I tried recompiling with __TBB_USE_ARENA_PER_MASTER set to 0 and it seems to do what I expect. Or, should I just comment out that line which I pointed out above? I'm not sure what side effects that might introduce though.


Aside from the memory issue, as a software developer I'm used to having direct control over the number of threads that exist for debugging purposes. Seeing 8 threads in a debugger when I've instructed tbb to use only 2 is a little disconcerting and I believe it makes it more difficult to track down threading problems (for example, I recently tried debugging a threaded process involving 2 work threads while 6 threads were idle - making it difficult on a first glance to see which are actually being used). I'm also concerned that users inspecting the process might incorrectly assume that it is actually using all the threads that exist in memory.


Assuming an 8 core 2-way hyperthreaded machine, at 2 MByte/stack (our current default for 32-bit machines), that's 32 MByte of virtual address space out of a total of 2048 MByte, about 1.56%. That seems tiny. Though I concur that for machines with larger core counts, perhaps we should be allocating threads lazily.

Setting __TBB_USE_ARENA_PER_MASTER=0 instantiates the TBB 2.2 scheduler behavior. So you lose some of the protection from deadlock (where multiple master threads get entangled doing each other's work) in exchange for the reduced numbers of stacks. That might be the best work around for now until we figure out how we should address the issue.

ajclinto has an excellent point about debugging. In fact, come to think of it, I was bothered recently by all the extra threads when I was debugging an example and switching between threads in the debugger.

I'm prototyping changes to src/tbb/private_server.cpp to make thread creation lazy. It seems straightforward so far, though I have not gotten to the "fun" part of the tricky shutdown logic. I'll report how it goes when I know more.

Hi Arch,

Do you know why we have this statement?
max_num_workers = max( governor::default_num_threads() - 1,
max_num_workers );

I would have assumed that in a loosely coupled system, there would be no need to have this. When re-initializing would assume that we just need to cancels all the current worker threads and then start up with a new set of workers. Where/why are we relying on having at least default_num_threads()-1 workers?

The RML interface for acquiring threads requires stating the maximum number of threads up front. This requirement greatly simplfies implementation of the RML on top of other thread managers (e.g. Microsoft ConcRT) that have similar requirements.

TBB 3.0 introduced the notion that each master thread (user-created thread) could have a different upper bound on the number of workers. E.g., consider a machine with 8 hardware threads. if master thread Xexecutes "task_scheduler_init init(5)", we allocate 4 workers (5 minus the master). If a master thread Y executes "task_scheduler_init init(8)" later, we allocate 7 workers, some of which are the same as the workers for thread X.

Because the "up front" requirement of the RML interface, we request at least "governor::default_num_threads() - 1" up front, since that is the normal worst case. The max accounts for deliberate oversubscription by the first master thread. Oversubscription requests by later threads are ignored.

max_num_workers is a maximum. The RML does not have to deliver this number of threads, since its purpose is to regulate thread usage. So making the RML lazy about delivering threads seems like a good approach. (I almost have my changes to private_server.cpp working.)

Attached is a patch that addresses the issue. It causes src/tbb/private_server.cppto lazily allocate threads. Please let me know if it works for you. It passed our unit tests on a few machines that I tried, but has not yet been subjected to nightly testing across all our test platforms.


Herunterladen private_server.patch6.84 KB

Hi Arch,

I did some simple tests with the patch and it seems to be working as expected! Thanks very much!


Hi Arch,

I just tried upgrading to the latest stable TBB version, 4.1 update 3. However, it seems that your patch was never incorporated. Is there some reason why we do not want this in the regular release?

$ cat tbb-41-patch-max_workers
diff -urN --strip-trailing-cr tbb40.orig/src/tbb/market.cpp tbb40/src/tbb/market.cpp
--- tbb40.orig/src/tbb/market.cpp       2011-12-15 07:05:00.000000000 -0500
+++ tbb40/src/tbb/market.cpp    2012-03-16 15:43:24.953426500 -0400
@@ -102,9 +102,9 @@
             runtime_warning( "Newer master request for larger stack cannot be satisfied\n" );
     else {
-        max_num_workers = max( governor::default_num_threads() - 1, max_num_workers );
+        max_num_workers = min( governor::default_num_threads() - 1, max_num_workers );
         // at least 1 worker is required to support starvation resistant tasks
-        if( max_num_workers==0 ) max_num_workers = 1;
+        if( max_num_workers<=0 ) max_num_workers = 1;
         // Create the global market instance
         size_t size = sizeof(market);

Hi, Edward

Arch's patch to create worker threads lazily has been indeed incorporated and available in TBB update releases including TBB 4.1 U3

I don't know if the changes to market.cpp you qouted were in the patch Arch uploaded here in 5/20/2010. Those changes were not incorporated because we felt they were not needed for the lazy worker thread creation, and found no compelling reasons to make the changes.

Have you experienced some unexpected behaviors with the TBB 4.1. U3 ?


Sorry, I'm mistaken that the quoted patch was in there, although it is related.

The quoted patch is necessary because otherwise, there is no way to force the number of worker threads to be less than the number of cores on the system. This is extremely important in server farms, where some central authority schedules processes to run on and it needs to impose a resource limit on the scheduled jobs. For debugging purposes, it is also useful to have an easy to make an application run single threaded as well; both for comparison purposes as well as for the ease of stepping through in a debugger.

Since this patch seems to be have unintroduced (I thought I had submitted all our patches), would you consider it? If so, I can officially submit the patch through the web submission form.


Sorry again, it's been a long time since I last read Arch's post above regarding the use of max() instead of min(). However, if you don't do this, then I do not think there's a an easy switch to use less concurrency than the number of core right?

Hi Edward,

While I understand the problem and the motivation for the patch, as the architect* I will not accept it into TBB, because it would preclude some useful scenarios (e.g. resource partitioning between multiple application threads), and would result in backward-incompatible behavior impacting existing applications.

Instead, we will add a mechanism for better control over global TBB settings, including the default number of threads. Most likely, it will be a special "policy" class, which instance affects TBB behavior for its lifetime. So you will be able to specify the desired global concurrency limit for TBB, which will be treated in the same way as HW concurrency currently is. The work is in progress, and I expect this to be released before the end of the year (though it's not a commitment).

Update: one possible way to limit the desired concurrency that works with TBB now is to set a certain process affinity mask that limits the application to only a subset of available HW threads/cores. TBB respects process affinity when it defines the default number of workers, so if you specify which cores/threads should be used, just as many workers will be created. Not sure if this approach is suitable for your use cases, but I think it's worth mentioning anyway.

* I took over the architect role from Arch some time ago.


Alexey Kukanov (Intel) wrote:

Instead, we will add a mechanism for better control over global TBB settings, including the default number of threads. Most likely, it will be a special "policy" class, which instance affects TBB behavior for its lifetime. So you will be able to specify the desired global concurrency limit for TBB, which will be treated in the same way as HW concurrency currently is. The work is in progress, and I expect this to be released before the end of the year (though it's not a commitment).

What you describe sounds like a good approach to achieving the same end goal. I'm looking forward to when I don't need this patch anymore! I'll keep the patch in the mean time since it has been working for us for over 3 years now.

Thanks, Alexey!

A suggested modification to Alexey's suggestion is to create a class/struct with a ctor that sets the process affinity mask to the desired number. Create a static object that when created, calls the ctor and thus sets the desired process affinity mask, **** then assure that this static object is loaded first by the linker such that it will run prior to any other static object ctor that might instantiate the TBB thread pool. Once in main, after a tbb init, then call an additional member function in the static object to restore the process affinity mask to what it was at start of application (or some other value if you so desire).

Jim Dempsey

That's good idea Jim, thanks for suggesting it. I think it can be already possible to do with task_scheduler_observer; need to check.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen