Deadlock on tbbmalloc process shutdown

Deadlock on tbbmalloc process shutdown


I'm currently using tbb40_278oss compiled using Visual Studio 2008 SP1. On occasion, I have an app which hangs upon exit. The only thing which I can think of which being unusual is that my main() function has this stack object which performs concurrent deallocation via tbb::parallel_for. The memory being explicitly deallocated in the body is just using the regular delete, and going into the CRT (de)allocator. There are no TBB threads created until the stack object's destructor is called while main() finishes. During the stack object destruction, tbb::parallel_for() will create tasks (and the threads for the first time) which will result tbb::scalable_aligned_malloc() being called. Is this type situation supported by TBB?

Here's following stack trace where it has deadlocked for over 2 hrs:

ntdll.dll!NtYieldExecution()  + 0xa bytes    
KernelBase.dll!SwitchToThread()  + 0x1d bytes    
tbbmalloc.dll!rml::internal::removeBackRef(rml::internal::BackRefIdx backRefIdx={...})  Line 250 + 0x3f bytes
tbbmalloc.dll!rml::internal::FreeBlockPool::releaseAllBlocks()  Line 1349
tbbmalloc.dll!rml::internal::ExtMemoryPool::release16KBCaches()  Line 512 + 0x16 bytes
tbbmalloc.dll!DllMain(HINSTANCE__ * hInst=0x0000000000000001, unsigned long callReason=0x00000000, void * __formal=0x0000000000000000)  Line 228 + 0x47 bytes
tbbmalloc.dll!__DllMainCRTStartup(void * hDllHandle=0x00000000042a3f80, unsigned long dwReason=0x042a3eb0, void * lpreserved=0x000000006b8b93dc)  Line 546 + 0xd bytes
ntdll.dll!LdrShutdownProcess()  + 0x1d1 bytes
ntdll.dll!RtlExitUserProcess()  + 0x90 bytes
msvcr90.dll!doexit(int code=0x00000000, int quick=0x00000000, int retcaller=0x00000000)  Line 644 + 0x11 bytes

What is happening here is that removeBackRef() called with a backRefIdx = { master = 0xffff, largeObj = 1, offset = 0x7fff } and it hangs forever waiting for the currBlock->blockMutex to be obtained. In the debugger, it looks like currBlock->blockMutex has a value of 1 when I trace through the disassembly, so it seems to have been accidentally left locked at some point. At the point which it is hanging, all threads except for the main thread have already been destroyed.

The other possibility of course is that there is some corruption going on, but it's not easy to debug this with TBB because of its own allocator. I've already tried running every thing in debug to no avail. I find myself wishing there is some way to globally force TBB to use the CRT debug allocator.

Any thoughts, ideas, or pointers appreciated!


13 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Assuming you do not have a static object calling TBB...
What happens if you place a tbb init at the start of main?

>> currBlock->blockMutex has a value of 1

If this occurs during an early exit of your program, then perhaps you can place a data change breakpoint on the blockMutex and record all the places where lock and unlock are called from. Note, do not record the locations inside TBB, rather record the locations within your application (which may be several stack levels up).

Jim Dempsey


Thank for for the report! Sure, TBB allocator must support such usage.

Could you please try recent TBB allocator, i.e. from TBB 4.1u3. There is no need to rebuild your application, just DLL replacement is enough. If the issue remains, please report stack traces for all threads in the application, this might be helpful.

As for TBB allocator disabling, it can be done by removing allocator DLL from directory where TBB DLL that your application uses placed.

Are you by chance explicitly terminating TBB thread(s) by using _exit() or some other such way?

Thanks for the suggestions, Alexandr and Jim!

Following on the advice to look at when currBlock->blockMutex is locked/unlocked, I've now explored the problem some more. I should mention that I'm on Windows 7 64-bit and have everything (including TBB) compiled using Visual Studio 2008.

I now have a theory now of why the app might be hanging, but let me give the current findings:

  • In the main() function, it runs single threaded, but makes use of TBB concurrent data structures. These cause tbbmalloc to be used. In fact, the first newBackRef() call I find actually originates during static initialization time.
  • During the stack object destructors in main(), we spawn the TBB threads and do various work in parallel
  • A separate native "timer" thread is also spawned during the running of these destructors. In the timer thread, it creates a tbb:task_scheduler_init object with the default "number of cores" value. This thread then runs in an infinite loop, incrementing an atomic counter at regular time intervals. Since the task_scheduler_init object calls __TBB_InitOnce::add_ref() and its destructor is never called, governor::release_resources() will never be called. The reference count never reaches 0. But that actually doesn't matter because mallocProcessShutdownNotification() gets called before tbb_main's DllMain() gets its DLL_PROCESS_DETACH notification.
  • In private_work::run(), create_one_job() will call into newBackRef(), where we do a scoped lock of currBlock->blockMutex. And we do this repeatedly in the TBB worker threads.
  • After main()'s stack objects are all destroyed, the ExitProcess() system call is issued. The rest of the details below occur before ExitProcess() finishes.
  • Now all the threads are destroyed without fail. Surprisingly, neither tbb's nor tbbmalloc's DllMain() gets DLL_THREAD_DETACH notifications.
  • private_work::run() just quits, it never even makes it to the myclient.cleanup(j) call
  • Now the DllMain()'s get DLL_PROCESS_DETACH notications. tbbmalloc gets it before tbb. So at this point, mallocProcessShutdownNotification() gets called, which makes calls to removeBackRef().
  • At some point after this, the process completes.

Ok, so now for the pet theory. I think that newBackRef() must be running with currBlock->blockMutex acquired when the thread is destroyed by the OS. So all the threads except for the main thread are destroyed, then mallocProcessShutdownNotification() is called to free up the resources but then blocks forever on the currBlock->blockMutex that will never be released.

Other details:

  • I run my test case in a loop and the problem only occurs in about 1 in 2000 invocations. For builds which do not reproduce, I've reached as high as 30,000 invocations before I gave up and declared that it cannot be reproduced there. I could not reproduce in a debug build.
  • Renaming the tbbmalloc.dll that is used causes the problem to go away. In this situation, both release and debug builds fail to reproduce the problem.

More thoughts, ideas, comments? :) I haven't had a chance to try to try/look at the newer tbb41 code yet. If any of these details seem like they might already be fixed, please let me know. :) Tomorrow, I'll perhaps try to do an instrumented tbb release build to see if I can narrow down what happens when we get the hanging. The above details were just me playing with the debug build looking for problem areas. Thanks!

PS. As a hacky workaround, does it make sense for me to patch in some code for Windows only? I haven't tried hard to reproduce on Linux or OSX but I haven't seen this occur in any of our non-Windows continuous integration servers. Since we know that mallocProcessShutdownNotification() is only called from the main thread, single threaded on Windows, some options come to mind:

  • Don't call mallocProcessShutdownNotication() at all. Is this necessary for TBB to shutdown properly?
  • If we encounter a locked backRef, just skip it altogether. For these situations, the currBlock is probably in an inconsistent state.

What happens when you launch your time ticker thread with _beginthread(yourTimerFunction)?

Jim Dempsey


jimdempseyatthecove wrote:
What happens when you launch your time ticker thread with _beginthread(yourTimerFunction)?

Why do you think that might make a difference? Unfortunately, I changed the thread code to use _beginthreadex() for a minimal conversion and the bug still reproduces.

Then place a break point on ExitPocess() and then look at the state of the system. It may be hard to catch the state of the additional threads at the moment of the break point (there is a time interval between when the break occurs and when the debugger stops the additional threads). You might notice something out of the ordinary (before you start getting errors).

Jim Dempsey

Yes, I that's what I already did, which led me to the theory that there's a window in the shutdown process that the TBB worker thread can be destroyed while a blockMutex is acquired, leading to the hanging when tbbmalloc shutsdown.

Does anyone know more about if there's any guarentees that this cannot or should not happen?

FWIW, the newer TBB version seems to fix the problem although given the rare nature of it, I can't be totally sure. The code around there definitely has changed though; the loop in hardCachesCleanup() has been moved to a different place. So I guess I'll have to just get cracking and update to the newer TBB version for good.

Would appreciate some confirmation that the tbbmalloc does guarantee that if we happen to destroy a thread with blockMutex acquired in newBackRef(), then it is still safe to call removeBackRef() without hanging.

>>Would appreciate some confirmation that the tbbmalloc does guarantee that if we happen to destroy a thread with blockMutex acquired in newBackRef(), then it is still safe to call removeBackRef() without hanging.

It would seem to me to be inherently unsafe to destroy a thread that holds an acquired mutex. It would be better for the destroy thread process to wait for all mutex held by the thread to issue release. This may not be possible to do in all cases, therefore the objects being protected by a mutex must be written to accept a thread tear-down, but I do not think this feature is available. The lock (mutex) would require to have a ctor supplied with a functor for use with thread tear-down while lock held. This may be a non-trivial coding task. RTM/TSX may be helpful in this regard.

Jim Dempsey

Jim, I agree with you in general but as a user of TBB, I do not think I'm doing anything unusual. If my theory is correct, then all the app needs to do to cause this type of hanging is to simply use TBB as intended. ie. just make some tbb::parallel_for() calls in main(). The TBB threads will start, make tbbmalloc allocations in the worker threads, and if the process exits too fast, we end up hanging in tbbmalloc's process shutdown code.

Leave a Comment

Please sign in to add a comment. Not a member? Join today