Is your memory management multi-core ready?

Recently I have got a workload that could not scale beyond a few cores. This particular application is using one thread per user, so theoretically, if one has an 8-core machine then 8 concurrent users should fully utilize the machine giving 8x speedup compared to a sequential run.

It did not happen. At most two cores have been utilized, the query throughput speedup was even smaller.


I talked to the developers of this application. Then spent some time looking at the source code and guessing what the reason could be. Too much sequential code in the processing? or exclusive locks? It was known to me that the relevant portion of the code could be locked with a read-write lock. The queries were read-only. It means it should scale. I have used Intel® Thread Profiler which has helped me a lot previously. It has shown a perfect picture: with 8 users there were 8 threads busy, halting very rarely on a few locks. So, it should scale!

Then I used my best friend VTune™ to see what are the top hot functions in the code. These were mmap, munmap and memset. This was a real surprise for me, because it was known that the application did not use memory mapped files. I have located the piece of code that called that memset function. It was an initialization of recently allocated memory blocks. Then I had understood where these mmaps/munmaps are coming from. It turned out that glibc memory alloc function (malloc) is using memory mapping mechanisms of Linux to allocate bigger chunks of memory (at least for my Linux distribution with the stock kernel). The Linux mmap mechanism allocates new memory by assigning page file space to virtual address space of the app. Virtual to physical address mapping entries are created later upon a page fault in a lazy manner, when the virtual memory pages are touched by the process (i.e. with memset) for the first time. See Page 11 of this document for details. This is why there were a huge number of context switches (shown by the vmstat utility); page faults cause switches from user space to kernel space. The Intel® Thread Profiler had no chance to show the possible locks in page fault handlers because these handlers are executed in Linux kernel (Intel® Thread Profiler operates in user space). Locks in the Linux page fault handlers have been discussed here.

To verify my hunch that the memory management via standard malloc is not scaling as one could hope for, I wrote a simple program that did just that in every thread:

void * worker(void *) {
int iterations = 300000;
while(iterations-- ) {
void * p = malloc(size);
memset(p,0,size);
free(p);
}
return NULL;
}

My test spawned x worker threads and then waited until they are all finished.

Here are results for 200KB blocks (size=200*1024):
1 thread: 21 seconds runtime, at most 1 core is busy (via top utility)
2 threads: 53 seconds runtime, at most 1.4 cores are busy
4 threads: 105 seconds runtime, at most 1.8 cores are busy
8 threads: 325 seconds runtime, at most 2 cores are busy

It looked like the problem had been reproduced. It was the memory management that did not scale.

I knew that Intel® Threading Building Blocks (Intel® TBB) has a scalable memory allocator, and recently a drop-in version has become available in version 2.2 that also supports large object allocation efficiently. One should just define some environment variables to use it for a legacy application:

export LD_LIBRARY_PATH=absolute_path_to_tbb_libs:$LD_LIBRARY_PATH
export LD_PRELOAD=libtbbmalloc_proxy.so.2:libtbbmalloc.so.2:$LD_PRELOAD
run_your_app

You can verify if the allocator library is loaded using "pmap app_process_id | grep tbb"

Now come the results with Intel® TBB malloc lib:
1 thread: 6 seconds runtime, at most 1 core is busy
2 threads: 6 seconds runtime, at most 2 cores are busy
4 threads: 10 seconds runtime, at most 4 cores are busy
8 threads: 14 seconds runtime, at most 8 cores are busy

Looking at the vmstat output one could observe that the number of context switches was back to normal values. For this small test routine the Intel® TBB malloc speedup was amazing 23x :-) My original workload ran about 7x faster utilizing up to 8 cores with this new allocator.

This is of course no guarantee that any arbitrary multithreaded application will profit from Intel® TBB allocator at this scale. I have seen concurrent workloads that were speedup by a few percent only and I find this still encouraging to give Intel® TBB allocator a try.

feedback and questions are very welcome.

Roman

VTune is a trademark of Intel Corporation in the U.S. and other countries

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

HI,

I understand why the context switching activities slow things down, but what is the root cause of that?
Is it because in the original malloc() there is some locks so that it becomes a bottleneck to allow only a processor to do malloc()?

Thanks!



Roman et al:
Just made a post on this issue - memory management in parallel application - the way I crossed it myself. Hope the techniques I used to identify it would be helpful for others - http://software.intel.com/en-us/blogs/2009/10/28/memory-management-challenges-in-parallel-applications/


@Tom: For my workloads the peak memory consumption with Intel TBB malloc was comparable to glibc malloc. I do not know if on your workload the TBB malloc will be better than the glibc malloc with respect to memory fragmentation. Instead of using your memory manager you can try to call directly TBB memory allocator API:

void* scalable_calloc ( size_t nobj, size_t size );
void scalable_free( void* ptr );
void* scalable_malloc( size_t size );
void* scalable_realloc( void* ptr, size_t size );

Header: #include "tbb/scalable_allocator.h"


I work on CAD applications where we often use 64Gig of memory. We stopped using malloc directly years ago do to poor memory fragmentation and now malloc large 1Meg chunks and manage the memory and free lists ourselves. We use a lock in the critical part of the code in our memory manager and it scales quite well. I wonder if the new malloc is better at memory fragmentation which I think would affect locality as well?


Great post. While I'm having a similar problem that I'm about to walk away from this gives me some ideas to fix it. Thanks



Dmitriy,

if memset is replaced with a code, that performs an expensive computation on a local variable (no memory traffic expected) then the running time of the test remains about 6 seconds with TBB malloc (for 1,2,4 and 8 threads). This means TBB allocator scales well. As you pointed out, this is rather the memset that reaches its maximum bandwidth limit.


It's unclear why performance degrades with TBB allocator. Whether it's because of the scalability problems in allocator itself or rather because of the memory bus saturation caused by memset(). It will be interesting to see performance numbers if you will replace memset() with something less suspicious (like local increment or something like that).

All about lock-free algorithms, multicore, scalability, parallel computing and related topics:
http://www.1024cores.net


@Roger: Thanks for the note. You are right, the most recent glibc 2.10 has scalability improvements for malloc. However, I do not know whether these enhancements are also applicable for large block allocations, which is most relevant for my application. I could not test it because I have only SUSE SLES10SP2 (with glibc 2.4) and (SLES11 with glibc 2.9). I hope I can test the new glibc in the next SLES11 service pack.


Pages