Memory management challenges in parallel applications

Let me share some recent practical experience with memory management issues when developing a multi-threaded application. This can probably be a rather common case (as recent post by Roman Dementiev and its follow up discussion demonstrates), and I’d be happy if my experience were helpful for others.

Working on CAD Exchanger I am designing one of its plugin to convert 3D CAD data between ACIS and Open CASCADE (two modeling kernels) to be parallel. Depending on a model size, the converter has to deal with multiple small objects allocated on a heap (e.g. 20,000+ objects each taking 48bytes + additional object data such as lists, strings, etc).

The translation works just fine and concurrency analysis with Intel Parallel Amplifier indicates high concurrency levels. So far, so good. However I had noticed that when translating the same ACIS file over and over again in the same test harness session translation took longer and longer. Why could it be ?

So I launched the Amplifier to collect hotspots and here is what I saw:



These two top hotspots relate to the memory manager layer (Standard_MMgrRaw class) which simply forwards calls to malloc/free and new/delete. Trying to root-cause the problem I had to switch to the mode to see direct OS functions (toggling off the button on the Amplifier toolbar) and here is a new screenshot:



It shows that hotspots are two system functions – RtlpFindAndCommitPages() and ZwWaitForSingleObject() – which are called from memory allocation / deallocation routines. It also shows that the nearest hotspot related to my code (BSplCLib::Bohm()) is just 1/4 of the time consumed by ZwWaitForSingleObject() (0.47s vs 1.81s).

After experimenting with several runs, analyzing how the hotspot profile changes with growing number of runs, I concluded that the first hotspot is explained by the fact that the ACIS converter creates multiple tiny objects with different size with short life span (they are destroyed after every conversion). This seems to cause strong memory fragmentation which forces the system to constantly look for new memory chunks.

The second hotspot (ZwWaitForSingleObject()) which goes through critical section is caused by the default mechanism of memory management on Windows which uses a lock.

The execution of locks&waits analysis also proves that memory management lock is the greatest one adversely affecting concurrency.



All this is caused by the direct use of calloc/malloc/free, and new/delete called dozens of thousands times. It’s worth mentioning that such hotspots did not exist when I used serial implementation and popped up only when I started using parallel one. The former used a memory manager (in a 3rd party lib) that allocated memory blocks and did not return them to the system reusing them when the application requested new blocks. I couldn’t reuse this memory manager because it was not thread-safe and therefore had to switch to another manager that simply forwarded to malloc/free.

So I almost was forced to write my own memory manager that would implement a previous behavior and would be thread-safe and … fast ! Challenges are good but not when you need to re-write low-level components what can take a lot of time and require diligent thorough testing delaying progress in your project which already receives very limited attention.

So, I approached my colleagues from the Threading Build Blocks team to check if there is anything TBB could help with. What was my surprise when they suggested me trying a new release 2.2. Version 2.2 offers a mechanism to seamlessly replace the system memory manager with the tbb allocator. ‘Seamlessly’ really means it – everything I had to is to add a single line of code into a C++ file:

#include "tbb/tbbmalloc_proxy.h"

The outcome was immediate. Not only did the hotspot profile change completely removing the OS hotspots (see the comparison mode screenshot below) but the overall speed up (on entire test case) was about 25%! One line of code, no need of re-writing anything on my own saved hours of coding, with such a return! Just incredible, the least I could say. Recently released 2.2 Update 1 includes further improvements which my app now benefits from (more reliable processing of realloc(), bug fixes for debug mode, etc).



The colleagues later explained me that the TBB allocator runs concurrently (seemingly without any locks inside) and with a similar fashion of reusing previously allocated blocks. Thus, it was the entire application (not only its parallel part) which benefited from this substitution.

So, if you are migrating from serial to parallel implementation you may encounter something unexpected – memory bottlenecks. If you got accustomed to use some nice single-threaded memory manager you can be forced to consider migration to something alternative. If this is the case you may want to give a try to tbb allocator and see if it helps in your case.

如需更全面地了解编译器优化,请参阅优化注意事项