scalable_malloc fails to allocate memory while there is much memory avaliable.

scalable_malloc fails to allocate memory while there is much memory avaliable.

Hi,

We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.

Does anyone have any idea about why?

Thanks,
Wallace

26 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - wzpstbb
Hi,

We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.

Does anyone have any idea about why?

Thanks,
Wallace

By the way, the caseis run on Vista 32bits, 4G RAM, NV 88000 GTS card with 640M video memory.

- Wallace

TBB uses a pools concept. It probably uses aligned_malloc to acquire a multi-megabyte pool when a scalable malloc would otherwise fail. Once a (or some) pool(s) is(are) allocated scalable malloc draws from this pool, until it runs out, and then it allocates another large pool.Scalable free, returns to a pool and pools are "never" returned to the C/C++ heap. Malloc internally does something similar to expand its heaps in virtual memory. Each memory allocator (malloc and TBB) once virtual memory is allocated, they don't return this memory until exit (under the assumption your program will allocate/freee again and again using the same allocator)

When using a combination of malloc and TBB scalable malloc you can run into allocation failures when memory gets fragmented not only within a heap or pool but also amongst the heapand pool(s).

A paliative measure (when on Windows)might be to enable the Low Fragmentation Heap (search MSDN for LFH).

Alternately, before your step 1), addstep 0) to use TBB scalable allocator to allocate a working set of memory, then scalable free this memory. Afterwhich, your step 1) will have a reducedupper boundfor allocations, and step 3) will have a working set available.

Jim Dempsey

www.quickthreadprogramming.com

Quoting - wzpstbb
Hi,

We are using tbbmalloc to manage our system memory. We use scalable_malloc/scalable_free to allocate/deallocate system memory. Everything worked fine before we ran into below case:
1. Keep allocating 1M textureusing DX10 until the allocation fails. Note that some of the system memory will be consumed by doing this.
2. Release all the allocated textures in step #1.
3. Create some objects on heaps. scalable_malloc returns null while I believe there are a lot system memory available at this point. And we have tried to replace scalable_malloc with malloc, then the memory can be allocated.

Does anyone have any idea about why?

Thanks,
Wallace

Most likely, DX10and MS CRT share the same memory pool, which the TBB allocator can not use. Once virtual memory is exhausted, it all is hoarded in that pool; so the TBB allocator does not succeed in attempts to map somemore memory.
To prove or disprove that, a reproducing test case would be helpful.

Quoting - Alexey Kukanov (Intel)

Most likely, DX10and MS CRT share the same memory pool, which the TBB allocator can not use. Once virtual memory is exhausted, it all is hoarded in that pool; so the TBB allocator does not succeed in attempts to map somemore memory.
To prove or disprove that, a reproducing test case would be helpful.

Thanks for the quick reply.

Sounds likethe cause. I have attached a reproducing test case. Please pay attention to Tutorial01.cpp, line 223~270. I am using VS 2008 and DirectX SDK(March 2009).

- Wallace.

Attachments: 

AttachmentSize
Downloadapplication/zip Tutorial01.zip4.36 MB

Sadly I unable to reproduce your situation, i.e. when I run your test case scalable_malloc returns a memory block successfully.

Also I see that on second and subsequent calls to Render() zero 2D textures can be allocated. Is this expected behavior? Are memory really released? Can it be connected with scalable_malloc behavior you observe?

There is GlobalMemoryStatusEx function to report available virtual memory size. Could you check available size before first scalable_malloc call?

If the reason is lack of virtual address space as Alexey suppose above VirtualAlloc for 1MB block can be failed in place of first scalable_malloc call (Its that scalable_mallod do internally on first call).

Quoting - Alexandr Konovalov (Intel)

Sadly I unable to reproduce your situation, i.e. when I run your test case scalable_malloc returns a memory block successfully.

Also I see that on second and subsequent calls to Render() zero 2D textures can be allocated. Is this expected behavior? Are memory really released? Can it be connected with scalable_malloc behavior you observe?

There is GlobalMemoryStatusEx function to report available virtual memory size. Could you check available size before first scalable_malloc call?

If the reason is lack of virtual address space as Alexey suppose above VirtualAlloc for 1MB block can be failed in place of first scalable_malloc call (Its that scalable_mallod do internally on first call).

It weird! I can easily reproduce the problem with the test case I attached. I ran the case onVista 32 bits OS, 4G RAM, andNvidia 8800 GTS which has640M video memory.

Yes, the textures are allocated and then release immediately. Therefore, it is expected that on second and subsequent calls to Render() zero 2D textures can be allocated.

I will have a try with GlobalMemoryStatusEx function to see the available virtual memory size.If the reason is lack of virtual address space, is there a solution for it?

Thanks,
- Wallace

Quoting - wzpstbb
I will have a try with GlobalMemoryStatusEx function to see the available virtual memory size.If the reason is lack of virtual address space, is there a solution for it?

Yes. I ran into what I think is the same problem. A program that legitimately mallocs or scalable_mallocs and then frees up everything allocated still eventually runs of virtual address space (not memory).
You will see that the dwAvailVirtual as reported by GlobalMemoryStatusEx does not go back after the free.
It has effectively "stepped on" a huge range of addresses, even though the memory was given back.

I did find a solution, which I think will work for you.
Instead of malloc or scalable_malloc use the following two functions for alloc and free.
They DO give back the virtual address SPACE as well as the memory.

data = (byte *)VirtualAlloc(NULL, BLOCK_SIZEP, MEM_COMMIT, PAGE_READWRITE);
and
VirtualFree(data, 0, MEM_RELEASE);

This did totally solve this sticky problem for us.
Good luck!
Mitch

In most cases, VirtualAlloc shouldn't be the allocation method of choice, for at least two reasons: it is much slower than malloc, and it operates with relative large blocks - a range in the address space should first be reserved by 64K chunks, then committed by 4K pages. Basically, it is suitable to build custom memory pools on top of it, but not as a substitution to malloc.

Alexey is right. Use the VirtualAlloc only for these large textures not as a general malloc replacement. In this case, since textures are 1M allocations, the size is not an issue. The speed was not an issue for us. Write a very simple loop to do the following a few dozen times: { Display available virtual memory, Allocate 1000 textures, Free 1000 textures } Each pass through the loop gets a Gigabyte of memory and address space. The next loop pass also gets 1 Gb of memory and address space. If you don't use the VirtualAlloc/Free mechanism, the addresses of the textures will keep crawling throughout the full 2-4 Gb address range and the displayed available VM will go down with each loop pass. This will also be seen if you use the Task Manager to view the VM usage.

Depending on your speed requirements and how often your program allocates textures, you may or may not be able to use this solution. After re-reading your post, I think there may be a better solution, as follows.

Consider address space fragmentation. There is a big difference between doing 100 allocates for 100 textures and allocating a single array of 100 textures. The latter requires the address space to be contiguous, which may quickly not become possible after many alloc/de-allocs. Changing you code to do the less-efficient, separate allocation per texture will get around this address space fragmentation problem and could solve your problem without resorting to Virtual Alloc.

Though we were taught that a demand paging system pretty much solves your memory issues, they never anticipated nor dealt with the address SPACE management issues.

Address space fragmentation may in fact be your main problem.
I'm curious if your code allocates textures in blocks and, if so, if its easy to change that and see if the problem totally goes away.

Mitch

Wallace,
I just took a look at your tutorial01.cpp. You do realize that since your program tries to allocate 65k of 1Mb textures, that is 65 Gigabytes of virtual memory. The limit on any one process is eith 2Gb (Xp), 3Gb (XP w/3GB switch), 4Gb with 64 bit OS. Thus you are always using up your whole virtual address space before you do the scalable_malloc.
Yes, you are giving it all back, but scalable_malloc needs to start with some virtual address space of its own.
If you insert a call to GlobalMemoryStatus before the texArray alloc and again right after the texArray delete, you will see that dwAvailVirtual has gone down and NOT been restored.

Since you are doing separate 1M texture allocations until all of memory is used, then your problem is not the address space fragmentation I mentioned above, but the former problem. If you can change the alloc methods CreateTexture2D() and Release() to use the VirtualAlloc/VirtualFree, I'm pretty sure tutorial01.cpp will work.

Quoting - turks
Wallace,
I just took a look at your tutorial01.cpp. You do realize that since your program tries to allocate 65k of 1Mb textures, that is 65 Gigabytes of virtual memory. The limit on any one process is eith 2Gb (Xp), 3Gb (XP w/3GB switch), 4Gb with 64 bit OS. Thus you are always using up your whole virtual address space before you do the scalable_malloc.
Yes, you are giving it all back, but scalable_malloc needs to start with some virtual address space of its own.
If you insert a call to GlobalMemoryStatus before the texArray alloc and again right after the texArray delete, you will see that dwAvailVirtual has gone down and NOT been restored.

Since you are doing separate 1M texture allocations until all of memory is used, then your problem is not the address space fragmentation I mentioned above, but the former problem. If you can change the alloc methods CreateTexture2D() and Release() to use the VirtualAlloc/VirtualFree, I'm pretty sure tutorial01.cpp will work.

Hi turks,

Thanks for exploring the issue. And sorry for my late reply because I was on a vacation last week.

But I don't want to discard TBB allocator. As suggested by Alexey, VirtualAlloc/VirtualFree is much slower than scalable_malloc and is not recommended to be used for common allocation.

Alexey, can TBB allocator resolves this issue? That is, is it possible for TBB allocator to share the virtual address with DX10 and CRT?

- Wallace

Quoting - wzpstbb
Alexey, can TBB allocator resolves this issue? That is, is it possible for TBB allocator to share the virtual address with DX10 and CRT?

I'd tell you upfront if that was possible; unfortunately it is not - at least not without TBB source changes.

The way for the TBB allocator to use the same pool as malloc would be to call malloc instead of VirtualAlloc; that's relatively easy change one could do having the TBB sources. But that'solnyhalf the work or even less, becauseTBB allocator is also "greedy" and in most cases it does not return the memory back. And finding a good balance between keeping memory blocks to speed up future allocations and returning them back to be more cooperative with other memory managers is a challenge with some ambiguous tradeoffs.

Possibly the best thing I can suggest you is to pre-allocate enough memory with the scalable_allocator before allocating textures.

Can TBB expose someinterfaces for releasing the virtual address? So we canask TBBto release the virtual address whenthe otherallocator fails to allocate memory.

Wallace,

I suggest you adopt your 32-bit strategy to:

1. Keep allocating 1M textureusing DX10 until the allocation fails.
1.a) Determine how to prorate memory into three general sections: 1M textures, TBB pool, malloc pool
1.b) Keep the decided upon number of 1M textures in your own private pool of textures.
2. Release all the non-reserved allocated textures in step #1
3. Remove (if present) the overload of new/delete/malloc/free (via TBB header) and then explicitly use the memory allocator in those sections of code that are suitable for scalable allocation.
4. Use the private pool of previously allocated1Mtexture buffers as you require 1M textures. When you run out, adapt your code to run with this limited number of textures.

Jim Dempsey

www.quickthreadprogramming.com

Hi Jim,

My sample app is just for reproducing the problem. Our use case is much more complex. In our system, we overload the global new/delete operators. We use scalable_allocate/scalable_free to process memory requests from new/delete. It is difficult for us to predict the how much memory would be allocated by TBB.Usually the memory consumed by TBB would be constant.However, at some point thememory allocated using TBB could be extremely high. We want TBB to reserve a medium size of virtual address instead of the maximum virtual address. So it would be great if TBB has some APIs for us to release some virtual address in off-peak duration.

Thanks,
Wallace

Wallace,

Did you try TBB allocator from TBB 4.0 (preferable, 4.0 update 2)? For 4.0 we made significant changes in controlling virtual address space, and there is a hope that the changes can help in your case.

Scalable allocators tend to not be friendly towards return of memory once allocated to the (an) allocator pool. You may have some success with returning the entire pool or none. By this I mean you might have some success by adding (using) a feature that you can scope instantiate a new scalable allocator pool. When you exit that scope, that pool evaporates, and the prior scalable allocator pool is reactivated. This would have the requirement that objects allocated in the nested layer do not persist as you pop out of that scope. This techniquemight be acan-o-worms if you are not careful.

The better strategy might be to not overload default new/delete. Instead overload specific object new/delete with one using the scalable allocator. i.e. only those objects with high flux are subject to scalable allocation.

An intermediary technique would be to overload specific object new/delete with one .NOT.using the scalable allocator, rather using a concurrent_queue. On 'new' pop an item from the queue, if the queue is empty then malloc. On 'delete' push the object pointer into the queue. When tight on memory you can pull items from the queue and free them.... however.... depending on your probram flow you might not get the peak level back again.

If this becomes a problem (out of memory), I suggest you re-think your application such that it is impossible to make more allocations than will run.

Example (made up example with similar issues):

Problem: Make a parallel anti-virus scan

Pseudo code:

program
for each file
enqueue fileTask
end program

The problem with the above is you could end up with 500,000 fileTask's plus the buffering requirements plus the tasks spawned by fileTask.

A better route would be

program
useParallelPipeline(withSomeTokens);
end program

Where the parallel_pipeline pulls in the next file only when a token is available. This puts an upper limit of concurrent file processing at 'withSomeTokens' number of tokens (say 10) instead of an unbounded number of files (say 500,000).

Your problem is not necessarily dealing with files, but it may have a large number of things to process, which in your current design apparently is experience a congestion point where an excessive amout of allocations is required. The point of the programming change is to restrict the peak allocations. Note, your program may run faster on a restricted number of buffers. The thread performing the (excessive) allocations will be available for processing current allocations while waiting for next input token.

Jim Dempsey

www.quickthreadprogramming.com

Quoting Alexandr Konovalov (Intel)

Wallace,

Did you try TBB allocator from TBB 4.0 (preferable, 4.0 update 2)? For 4.0 we made significant changes in controlling virtual address space, and there is a hope that the changes can help in your case.

The Test-Case created by Wallace in 2009 uses TBBversion 2.1 ( some TBB headers are included in the VS project):

tbb_stddef.h

...
#define TBB_VERSION_MAJOR 2
#define TBB_VERSION_MINOR 1
...

It really makes sence to try the latest version 4.0of TBB.

Thank you for all the answers.

One reason we decided to override the global new/delete is that we want to handle all the low-memory/out-of-memorysituations by ourself.In addition,dropping the overriden global new/delete operators would have a big impact on our clients. We will update to TBB 4.0 in our next release.

Wallace

Quoting wzpstbbThank you for all the answers.

One reason we decided to override the global new/delete is that we want to handle all the low-memory/out-of-memorysituations by ourself.

[SergeyK] Why don't you use 'set_new_handler' function to set anerror handling function for
cases when a'new' operator fails?

In addition,dropping the overriden global new/delete operators would have a big impact on our clients. We will update to TBB 4.0 in our next release.

Wallace

We want to act on both low-memory and out-of-memory situations. Basically we would page some resources out to disk. The paging strategy is less aggressivein low-memory situation than out-of-memory situation.

Thanks,
Wallace

Handling out-of-memory situations? Sounds ambitious. You would have to rearrange things without being able to make any other dynamic allocation, not even an implicit one. Better get your low-memory handling right to steer clear of this situation!

How about digging into the source code and redirecting TBB's attempts to allocate the big chunks of memory from which it serves its own clients? You'll need to be able and willing to adapt to changes in the implementation, of course, because it won't be supported. Allocate several more chunks than needed as a buffer to put into use when your own code cannot get more memory, triggering low-memory handling but still serving the request, and beyond that also serve the request but initiate a clean shutdown instead. Does that make sense?

I agree with Raf "Better get your low-memory handling right to steer clear of this situation!"

As soon as you enter the operational realm of low/out ofmemory (oom)situations, most corrective measures are short lived. i.e. you aggravate the situation whereby you run of of memory sooner.

Placing memory in reserve might buy you some time but also means you reach oom sooner.
Under some circumstances, the reserve might be a good option.

As and example: I have a simulation program that I use for simulationg Space Elevators. Simulation runs can run weeks. While I do not experience oom, should I encounter oom two weeks into a run, I'd be rather upset. I do experience crashes (simulation blows up due to infinities etc...). To combat the crash, periodically the program checkpoints itself. Should a crash occure, I can restart from checkpoint, then resume running monitoring the model to find out what caused the crash (e.g. too large of integration step size at a critical point in the simulation).

Well, in your case, combining this with Raf's suggestion (clean shutdown), when you reach the oom with reserve memory, release the reserve memory and set a flag to indicate "make check point and restart as soon as possible".

Issues:

You will have to write code for checkpoint and restart (assuming you have not done this already).
The additional code may aggravate your oom situation.

You will have to run some stress tests to determine the working size of the reserve memory block.
The additional memory reserved may aggravate your oom situation.

An alternative is to rework your code such that it will not reach oom (for all permitted initial conditions).

Before you take these corrective measures you might consider reworking your code such that it is stack conservative. i.e. allocate large-is objects from heap as opposed to from stack. If your current program has stack requirements of 10's of MB, you will likely find that the 10's of MB are actually used by one or less than all threads (at the same time). This may be an un-utilized reservoir of memory. (remember to reduce the linker and/or reserve stack values).

Jim Dempsey

www.quickthreadprogramming.com

Quoting Raf SchietekatHandling out-of-memory situations?...

That's a common problem with 32-bitWindows platforms. It is impossible to allocate more than 2GB
of memory for an application. But, data sets are growing in sizeand more memory is needed for processing!

Note: It is assumed that a 32-bitWindows platform doesn't support AWE ( Address Windowing Extensions ).

Leave a Comment

Please sign in to add a comment. Not a member? Join today