Possible problem in scalable allocator

Possible problem in scalable allocator

I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc.

It's about returnEmptyBlock() function. It's called from scalable_free() when we are freeing own block, and it's called from mallocThreadShutdownNotification(). But it is NOT called on 'remote free' path (neither from freePublicObject() nor getPublicFreeListBlock()).
Assume following situation. Thread 1 allocated big amount of memory (large number of small objects), then thread 2 frees them. No free blocks will be returned to global pool, all memory will remain cached in thread 1. Then 2 scenarios are possible.
(1) Thread 1 doesn't make any further calls to scalable malloc (assume it is 'main' thread which has made 'initialization' and then blocks waiting for program completion). Memory still remains cached and unused.
(2) Thread 1 makes some further calls to scalable malloc. Memory can be reused. But if thread 1 again doesn't call scalable_free() itself, free blocks will not be returned to global pool.
And unbounded memory consumption can occur if such pattern repeats recurrently. Assume thread 1 allocates memory, and thread 2 frees it. Then thread 2 allocates memory, and thread 3 frees it. Then thread 3 allocates memory, and thread 4 frees it. And so on.

The simple solution is to provide function scalable_compact_thread_heap(), which must be called by user episodically or periodically (if he notices some problems). The hard solution is to make automatic reclamation of excessive unused memory.

Am I missing something? Or the problem doesn't worth solving? What do you think?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net
27 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

"I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc." See getPublicFreeListBlock() and privatizePublicFreeList(): scalable_malloc() will not get new memory while publically freed objects exist.

Quoting - Raf Schietekat

"I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc." See getPublicFreeListBlock() and privatizePublicFreeList(): scalable_malloc() will not get new memory while publically freed objects exist.

I understand that thread will not allocate new memory while there are free items in it's blocks.

I am talking about different thing. 2 possible scenarios:

(1) the thread itself frees last item in own block

(2) another thread frees last item in the thread's block

Behavior differs between these scenarios. In (1) free blocks will go to global pool, in (2) free blocks will stay in thread's cache. Imho, (2) can lead to problems sometimes.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

"I am talking about different thing." Sorry for assuming it might be so simple. I did more than skim myself, but I haven't examined everything either. Even though the required scenario seems unlikely, my friend Murphy always tells me that that should not lull one into a false sense of security. However, each thread is also occupying a stack, so the thread's maximum allocations would have to be significant relative to the stack size, otherwise we're talking about a situation where the program would otherwise just skip through. But I think I should have passed on this one.

Quoting - Raf Schietekat

"I am talking about different thing." Sorry for assuming it might be so simple. I did more than skim myself, but I haven't examined everything either. Even though the required scenario seems unlikely, my friend Murphy always tells me that that should not lull one into a false sense of security. However, each thread is also occupying a stack, so the thread's maximum allocations would have to be significant relative to the stack size, otherwise we're talking about a situation where the program would otherwise just skip through. But I think I should have passed on this one.

Yes, this is definitely rare scenario. But on the other hand we are talking about general-purpose memory allocator...

For example, maybe what I've described causes problems here:

http://software.intel.com/en-us/forums/showthread.php?t=61624

Who knows?

And yes, I am talking about large amounts of memory cached. 2 free blocks anyway stays in thread cache.

Thread stack can be around 1 Mb (Mr. Joe The Programmer can actually create 10,000 threads with 64Kb stacks). And at the same time thread can allocate hundreds of megabytes.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Yes, in order for a block emptied by a "foreign" (aka remote) thread to be moved to the global pool, it should first pass through reclamation of objects by the owning thread. This is by design, to make all "foreign" activity on the block clearly separate from local hot path. Returning an empty block remotely would require changing the list of blocks in the bin. The list is double-linked, and blocks can be removed from the middle (see outOfTLSBin), so I doubt any non-blocking schemas would succeed without incurring significant redesign with unknown performance impact; so the practical solution is to lock two subsequent blocks in the list on removal, and also lock a block on traversal or other operation that requires dereferencing pointers to its neighbours. Also, to allow a remote thread understand that the block is empty would require atomic operations on the counter of allocated objects; yet another penalty.

So while you are correct that there are corner case scenarios leading to memory blowup, I do not see how they could be resolved without performance penalty on the hot local path, and I do not feel convinced that these rare corner cases deserve the performance cost.

Quoting - Alexey Kukanov (Intel)

So while you are correct that there are corner case scenarios leading to memory blowup, I do not see how they could be resolved without performance penalty on the hot local path, and I do not feel convinced that these rare corner cases deserve the performance cost.

And what do you think about providing scalable_compact_thread_heap() which can be called by user manually? Or about periodic background compacting?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy V'jukov

And what do you think about providing scalable_compact_thread_heap() which can be called by user manually? Or about periodic background compacting?

I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

Quoting - Alexey Kukanov (Intel)

I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

I agree that periodic background compacting might not be such a great idea, but if it could be done well I'd probably use it.

However I think that the most important point is that as Dmitriy points out the border-case problem does exist, and that's ok in my book, but it MUST definitelybe documented in the tbb guide and perhaps the tutorial docs (can't remember if the tutorial doc mentions allocators but if it does then a mention there isnecessary). Although the case is rare, the poor programmer that runs into this problem will be at a total loss, since not many people would feel like the general allocator is causing such an issue, thus possibly causing the poor soul days of painstaking debugging (if he's lucky enough to notice the error during development).

Also considering the border-case as it stands, I could imagine a scenario where someone uses some threads to create objects (say using factories) and then handing them over to be administrated by other threads (a good design would require objects to be returned to the factory's sink, but too many times I've seen these types of objects simply get deleted), so it's not as an unlikely scenario in real life since this usage is directly related to a well used design pattern, and without any notice in the docs there is no way toforecastthe issue or take it into consideration when designing the system, so it'll likely show up when fixing the issue will cost a lot (lots of refactoring to be done, and possible damages and downtimes).

What would be the cost of having all local memory reachable from other threads, using a spin mutex on all local memory, and writing an atomic time stamp each time control passes through local-memory administration? That way the local thread could do reclamation after timeout1 and/or N other operations, and other threads could intervene after timeout2 before getting another 1-MB chunk, i.e., the mutex would almost never be locked by another thread.

Quoting - Alexey Kukanov (Intel)
As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

I thought more about it... Some relief could be provided byreducing the (already low) probability of memory blowup, if the search for objects is extended with the following two rules:

  1. For a given request size, if the search through existing local blocks was unsuccessful, switch to the next bigger allocation size, and try to reclaim some remotely freed memory there (before going to the global list of empty blocks).
  2. If there is no more empty blocks in the global storage, make full reclamation for every allocation size in the local heap (before requesting more memory by OS system call).

Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.

Quoting - Dmitriy V'jukov

I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc.

It's about returnEmptyBlock() function. It's called from scalable_free() when we are freeing own block, and it's called from mallocThreadShutdownNotification(). But it is NOT called on 'remote free' path (neither from freePublicObject() nor getPublicFreeListBlock()).
Assume following situation. Thread 1 allocated big amount of memory (large number of small objects), then thread 2 frees them. No free blocks will be returned to global pool, all memory will remain cached in thread 1. Then 2 scenarios are possible.
(1) Thread 1 doesn't make any further calls to scalable malloc (assume it is 'main' thread which has made 'initialization' and then blocks waiting for program completion). Memory still remains cached and unused.
(2) Thread 1 makes some further calls to scalable malloc. Memory can be reused. But if thread 1 again doesn't call scalable_free() itself, free blocks will not be returned to global pool.
And unbounded memory consumption can occur if such pattern repeats recurrently. Assume thread 1 allocates memory, and thread 2 frees it. Then thread 2 allocates memory, and thread 3 frees it. Then thread 3 allocates memory, and thread 4 frees it. And so on.

The simple solution is to provide function scalable_compact_thread_heap(), which must be called by user episodically or periodically (if he notices some problems). The hard solution is to make automatic reclamation of excessive unused memory.

Am I missing something? Or the problem doesn't worth solving? What do you think?

Hi,

Just wanted to confirm your observation....

We just started testing TBB for our application.

Wwe are experiencingthe unbound memory growth as explained. The application has thread pool(around 100+ threads), it allocates memory in many threads and all the memory pointers are stored in one place. At the end of transaction it goes and deletes all the memory from the stored location.

We have traffic of 4-5 transactions/sec.

When we enabled TBB, it started with 4G and then grew till 30 G and then crashed. It took almost 250 K transactions for it to crash.

Without TBB,other standard testtook15 G and is running fine.

Thank you,

Sadbhaw

Quoting - Alexey Kukanov (Intel)

I thought more about it... Some relief could be provided byreducing the (already low) probability of memory blowup, if the search for objects is extended with the following two rules:

  1. For a given request size, if the search through existing local blocks was unsuccessful, switch to the next bigger allocation size, and try to reclaim some remotely freed memory there (before going to the global list of empty blocks).
  2. If there is no more empty blocks in the global storage, make full reclamation for every allocation size in the local heap (before requesting more memory by OS system call).

Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.

Ok, my previous understanding might have been mistaken, but if I got it right now step 1 is a sure winner, step 2 is something that would need to measured.

I'm not sure if its faster or not than calling malloc (notnecessarilythe OS), and even the OS allocation depends on platform and what malloc does (for example I've placed hoard allocator to be the standard allocator and have tbb call it, while hoard then call the OS when it deems necesarry, with doubleindirection, I had no performance gains and took it out, but my point is you cannot safely know what crazy things a simple call to malloc will do).

Another way to reclaim memory in step 2 is how the Lua(VM) does GC for example, you make a few steps and bail out when your ok instead of collecting everything. And next time you need memory you continue from where you left off.

Robert, thanks for the feedback.

Quoting - robert.jay.gould
Another way to reclaim memory in step 2 is how the Lua(VM) does GC for example, you make a few steps and bail out when your ok instead of collecting everything. And next time you need memory you continue from where you left off.

This is what the allocator does now, with only difference that the search is limited to just one bin. Step 1 will extend it to a neighbor been. As for step 2, well, it is unclear without measurements whether full reclamation through all bins will have significant impact or not. If it will, then I agree the search can be performed until a single empty block is found.

Quoting - robert.jay.gould

I agree that periodic background compacting might not be such a great idea, but if it could be done well I'd probably use it.

However I think that the most important point is that as Dmitriy points out the border-case problem does exist, and that's ok in my book, but it MUST definitelybe documented in the tbb guide and perhaps the tutorial docs (can't remember if the tutorial doc mentions allocators but if it does then a mention there isnecessary). Although the case is rare, the poor programmer that runs into this problem will be at a total loss, since not many people would feel like the general allocator is causing such an issue, thus possibly causing the poor soul days of painstaking debugging (if he's lucky enough to notice the error during development).

Also considering the border-case as it stands, I could imagine a scenario where someone uses some threads to create objects (say using factories) and then handing them over to be administrated by other threads (a good design would require objects to be returned to the factory's sink, but too many times I've seen these types of objects simply get deleted), so it's not as an unlikely scenario in real life since this usage is directly related to a well used design pattern, and without any notice in the docs there is no way toforecastthe issue or take it into consideration when designing the system, so it'll likely show up when fixing the issue will cost a lot (lots of refactoring to be done, and possible damages and downtimes).

Jury sitting in my brain is still out on the issue, whether it requires fixing or not. But this are good points, Robert.

If factory-thread constantly allocates objects then all memory will be reclaimed and reused, there will be no problems. Before getting block from global pool, thread checks whether there are outstanding remote deallocations.

Problems will be when thread allocates huge amount of memory, and other threads free that memory AND first thread doesn't make any further allocations of the same size. So this must be some unusual situation to trigger unbounded memory consumption.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Alexey Kukanov (Intel)

I think the function would be mostly useless; I bet no user would be keenof calling this function at unknown times. Second, it could only collect unused blocks in the calling thread, because otherwise it would have to impose all the same synchronization needs I described above - so users would need to call this in every active thread.

Yes, it have to be called from every thread (only by applications that feel some problems). Server software (which is most suspected to the problem because of long-running-ness (24x7)) has natural points to call such function - after every N processed requests. Also the function can be called before long blocking.

Well, it is not very beautiful solution, but it can be not very bad from technical point of view.

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy V'jukov

Yes, it have to be called from every thread (only by applications that feel some problems). Server software (which is most suspected to the problem because of long-running-ness (24x7)) has natural points to call such function - after every N processed requests. Also the function can be called before long blocking.

Well, it is not very beautiful solution, but it can be not very bad from technical point of view.

Another possible solution is to provide special macros (something like TBB_MALLOC_LOW_FRAGMENTATION_MODE) which will switch allocator to low memory consuption mode... to more careful to memory consumption mode. Allocator will stay distributed, local allocation and local free paths will stay relatively lightweight, but threads will be trying harder to get rid of unused memory (possibly imposing overheads even on fast-path... but, well, being blazingly fast is not the only ultimate goal of general-purpose memory allocator, right? :) ).

The good news are that if user is satisfied with current behaviour and doen't define TBB_MALLOC_LOW_FRAGMENTATION_MODE than he won't get any additional overheads.

I think that authors of following posts will be totally satisfied with such solution:

http://software.intel.com/en-us/forums/showthread.php?t=61624

http://software.intel.com/en-us/forums/showpost.php?p=69500

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Alexey Kukanov (Intel)

Does that make sense to you? The intuition behind the proposed actions is that their overhead should be roughly equal to what would the next step (in parentheses) introduce. Note that the actions are still local to the thread heap, and the protocol of remote deallocation does not change.

If the problem requires fixing then I think it's the step in the right direction. Though I think that it will be helpful to better identify real application patterns that can cause bad behaviour first. Because, for example, if thread that allocates huge amount of memory gets blocked (for example for user input in console or taken out of circulation in the thread pool) then such solution won't help...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - sadbhaw

Hi,

Just wanted to confirm your observation....

We just started testing TBB for our application.

Wwe are experiencingthe unbound memory growth as explained. The application has thread pool(around 100+ threads), it allocates memory in many threads and all the memory pointers are stored in one place. At the end of transaction it goes and deletes all the memory from the stored location.

We have traffic of 4-5 transactions/sec.

When we enabled TBB, it started with 4G and then grew till 30 G and then crashed. It took almost 250 K transactions for it to crash.

Without TBB,other standard testtook15 G and is running fine.

Thank you,

Sadbhaw

How many memory thread allocates during transaction processing?

In what order threads are extracted from thread pool? Is it possible that some threads will be "starved" in the thread pool, i.e. will remain inactive for too long in the thread pool?

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Alexey Kukanov (Intel)

As for periodic background compacting - it sounds like a poor man's garbage collector, in the sense that it has no benefits of real garbage collection but unfortunately has some shortcomings of it, such as rather significant overhead at random time. Though might be if the points to call this function internally are well defined and do already have rather big overhead, it might be worth looking into.

Poor man GC can be not as bad for native non-managed environment as one might think. The point is to get the best from reactive manual memory management, and the best from GC. Reactive manual memory management will provide prompthness and usual behaviour for native world, and GC will provide more guarantees for various corner cases. I believe that GC can run really infrequently (for example, one time in 1-10 secs) and impose really low overheads for fast-paths. It will be just watching out that everything is going smoothly and aptly.

The negative side is that even poor man's GC requires substantial amount of coding and redesign...

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net

Quoting - Dmitriy V'jukov

Poor man GC can be not as bad for native non-managed environment as one might think. The point is to get the best from reactive manual memory management, and the best from GC. Reactive manual memory management will provide prompthness and usual behaviour for native world, and GC will provide more guarantees for various corner cases. I believe that GC can run really infrequently (for example, one time in 1-10 secs) and impose really low overheads for fast-paths. It will be just watching out that everything is going smoothly and aptly.

The negative side is that even poor man's GC requires substantial amount of coding and redesign...

Let me see here. You describe problem as follows:

Thread A allocates shi%load of objects (many MB worth).
ThreadB allocates shi%load of objects (many MB worth)..
ThreadC allocates shi%load of objects (many MB worth)..
ThreadD allocates shi%load of objects (many MB worth)..
ThreadE allocates shi%load of objects (many MB worth)..
Threads A-E do not allocate any more...

Thread F frees all of those objects onto remote list.

BAM!!!!

All objects leaked for duration of program! OH SHI%!

;^(...
Is that right? Well, AFAICT, vZOOM allocator "attempts" to solve this potential problem by putting a upper bound on leaked memory. It can do this with single-word CAS in the remote free portion, but let me show you how it does it with DWCAS:

/* very quick pseudo-code w/ "some/a lot" of detail omitted for brevity */


#define REMOTE_FREE_THRESHOLD 1024 /* whatever */


struct block {
  union block* next;
};


struct remote_free_anchor {
  union block* head;
  int32_t count;
};


struct slab {
  per_thread* owner;
  struct remote_free_anchor flist;
  /* [...] */
};


void free(void* mem) {
  struct block* block = mem;
  struct thread* thread = pthread_getspecific(...);
  struct slab* slab = get_slab_from_block(block);
  if (slab->owner == thread) {
    /* [...] */
  } else {
    /* forget remote cache for now... */
    struct remote_free_anchor xchg, cmp = slab->flist;
    do {
      block->next = cmp.head;
      if (cmp.count > REMOTE_FREE_THRESHOLD) {
        xchg.head = NULL;
        xchg.count = 0;
      } else {
        xchg.head = block;
        xchg.count = cmp.count + 1;
      }
    } while(! DWCAS(&slab->flist, &cmp, &xchg));
    if (cmp.count > REMOTE_FREE_THRESHOLD) {
      while (block) {
        struct block* next = block->next;
        coalesce_into_global_heap(block);
        block = next;
      }
    }
  }
}

AFAICT, there can be an upper bound on leaked objects per-size-bin... Also, vZOOM has per-bin thresholds... Any thoughts?

Sorry, I made up my mind: I refuse to go into any more impl detail. Its trade secret.

One hint...

the threshold must be ___equal___ to total number of blocks in a given segregated slab size... Its not a fixed global/define.
TBB can create robust soultion themselvs...
:^|

Quoting - CHRIS M THOMASSON

Sorry, I made up my mind: I refuse to go into any more impl detail. Its trade secret.

One hint...

the threshold must be ___equal___ to total number of blocks in a given segregated slab size... Its not a fixed global/define.
TBB can create robust soultion themselvs...
:^|

Or use periodic/episodic calls to `mflush()' indeed.Remember my example Dmitriy; wrt reclaiming remote cache? I not only left out the block coalescing, but I also leave out several parameters to `mflush' which can render it basically analogous to yourproposed "compact" function having to be called on a periodic/episodic basis. I don't care about performance of `mflush()'. Its going to beslower than an olddog with four broken legs compared to thread-local fast-path...

I thought about automating these moments by default, but, IMVHO of course,its not reallyworth it at all. Letting the user have some fairly fine grain options seems to be a much better solution. Automatic remote free threshold reclamation is in the form of a #define option while building your application. Of course, some options can be highly granular indeed, such as my `mflush()' being able to reclaim remote cache on a per-thread per-size basis...

Implementing a robust and scaleable general-purpose allocatorcan beTOTAL CRAP!!!!

:^/

BTW, I think Hoard does indeed suffer from the problem you describe; so does StreamFlow.vZOOM suffers from it with out the proper options set, or an applicationthat does not set global options andknows it will execute the pattern that causes the problemand chooses not to explicitly invoke `mflush'on aperiodic/episodic basis with the proper parameters set. YIKES!

;^(...

Quoting - Dmitriy Vyukov

I only briefly skimmed through the code in tbbmalloc/MemoryAllocator.cpp, I was not trying to model the situation, but it looks like thing which can cause unbounded memory consumption in scalable_malloc.

It's about returnEmptyBlock() function. It's called from scalable_free() when we are freeing own block, and it's called from mallocThreadShutdownNotification(). But it is NOT called on 'remote free' path (neither from freePublicObject() nor getPublicFreeListBlock()).
Assume following situation. Thread 1 allocated big amount of memory (large number of small objects), then thread 2 frees them. No free blocks will be returned to global pool, all memory will remain cached in thread 1. Then 2 scenarios are possible.
(1) Thread 1 doesn't make any further calls to scalable malloc (assume it is 'main' thread which has made 'initialization' and then blocks waiting for program completion). Memory still remains cached and unused.
(2) Thread 1 makes some further calls to scalable malloc. Memory can be reused. But if thread 1 again doesn't call scalable_free() itself, free blocks will not be returned to global pool.
And unbounded memory consumption can occur if such pattern repeats recurrently. Assume thread 1 allocates memory, and thread 2 frees it. Then thread 2 allocates memory, and thread 3 frees it. Then thread 3 allocates memory, and thread 4 frees it. And so on.

The simple solution is to provide function scalable_compact_thread_heap(), which must be called by user episodically or periodically (if he notices some problems). The hard solution is to make automatic reclamation of excessive unused memory.

Am I missing something? Or the problem doesn't worth solving? What do you think?

Quoting - Dmitriy Vyukov

Hi,

I override operators new and delete as follows:

void* operator new (size_t size) throw (std::bad_alloc)
{
void* ptr = null
;
if (size == 0)
size = 1
;
if (ptr = scalable_malloc(size))
return ptr;
throw std::bad_alloc();
}

void* operator new [] (size_t size) throw (std::bad_alloc)
{
return operator new (size);
}

void* operator new (size_t size, const std::nothrow_t&) throw ()
{
void* ptr = null
;
if (size == 0)
size = 1
;
if (ptr = scalable_malloc(size))
return ptr
;
return null;
}

void* operator new [] (size_t size, const std::nothrow_t&) throw ()
{
return operator new (size, std::nothrow);
}

void operator delete (void* ptr) throw()
{
if (ptr != null)
scalable_free(ptr)
;
}

void operator delete [] (void* ptr) throw()
{
operator delete (ptr);
}

void operator delete (void* ptr, const std::nothrow_t&) throw()
{
if (ptr != null)
scalable_free(ptr)
;
}

void operator delete [] (void* ptr, const std::nothrow_t&) throw()
{
operator delete (ptr, std::nothrow);
}

My application works perfect with these operators in debug mode, but fails in release mode. I get the message:

Microsoft Visual Studio C Runtime Library has detected a fatal error in ParallelTest.exe.
Press Break to debug the program or Continue to terminate the program.

Application was stopped in the following point:

/***
*dbghook.c - Debug CRT Hook Functions
*
* Copyright (c) Microsoft Corporation. All rights reserved.
*
*Purpose:
* Allow users to override default alloc hook at link time.
*
*******************************************************************************/

#include

#ifdef _DEBUG

#include
#include
#include
#include
#include

_CRT_ALLOC_HOOK _pfnAllocHook = _CrtDefaultAllocHook;

/***
*int _CrtDefaultAllocHook() - allow allocation
*
*Purpose:
* allow allocation
*
*Entry:
* all parameters ignored
*
*Exit:
* returns TRUE
*
*Exceptions:
*
*******************************************************************************/
int __cdecl _CrtDefaultAllocHook(
int nAllocType,
void * pvData,
size_t nSize,
int nBlockUse,
long lRequest,
const unsigned char * szFileName,
int nLine
)
{
return 1; /* allow all allocs/reallocs/frees */
}

#endif /* _DEBUG */

int _debugger_hook_dummy;

#ifdef _M_IA64
#undef _CRT_DEBUGGER_HOOK
#define _CRT_DEBUGGER_HOOK __crt_debugger_hook
#endif /* _M_IA64 */

__declspec(noinline)
void __cdecl _CRT_DEBUGGER_HOOK(int _Reserved)
{
/* assign 0 to _debugger_hook_dummy so that the function is not folded in retail */
(_Reserved);
_debugger_hook_dummy = 0; // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
}

Why?

Rhetorical question (from memory, and if this is not it I have to pass): do you havethe same runtime library version for all components of the program, typically multithreaded DLL and consistently debug or release?

I discovered that this error occurs when you use MS VS2008. If you build a project using MS VS2005, then no problem.
Quoting - Raf Schietekat

Rhetorical question (from memory, and if this is not it I have to pass): do you havethe same runtime library version for all components of the program, typically multithreaded DLL and consistently debug or release?

Leave a Comment

Please sign in to add a comment. Not a member? Join today