Is your memory management multi-core ready?

Recently I have got a workload that could not scale beyond a few cores. This particular application is using one thread per user, so theoretically, if one has an 8-core machine then 8 concurrent users should fully utilize the machine giving 8x speedup compared to a sequential run.

It did not happen. At most two cores have been utilized, the query throughput speedup was even smaller.


I talked to the developers of this application. Then spent some time looking at the source code and guessing what the reason could be. Too much sequential code in the processing? or exclusive locks? It was known to me that the relevant portion of the code could be locked with a read-write lock. The queries were read-only. It means it should scale. I have used Intel® Thread Profiler which has helped me a lot previously. It has shown a perfect picture: with 8 users there were 8 threads busy, halting very rarely on a few locks. So, it should scale!

Then I used my best friend VTune™ to see what are the top hot functions in the code. These were mmap, munmap and memset. This was a real surprise for me, because it was known that the application did not use memory mapped files. I have located the piece of code that called that memset function. It was an initialization of recently allocated memory blocks. Then I had understood where these mmaps/munmaps are coming from. It turned out that glibc memory alloc function (malloc) is using memory mapping mechanisms of Linux to allocate bigger chunks of memory (at least for my Linux distribution with the stock kernel). The Linux mmap mechanism allocates new memory by assigning page file space to virtual address space of the app. Virtual to physical address mapping entries are created later upon a page fault in a lazy manner, when the virtual memory pages are touched by the process (i.e. with memset) for the first time. See Page 11 of this document for details. This is why there were a huge number of context switches (shown by the vmstat utility); page faults cause switches from user space to kernel space. The Intel® Thread Profiler had no chance to show the possible locks in page fault handlers because these handlers are executed in Linux kernel (Intel® Thread Profiler operates in user space). Locks in the Linux page fault handlers have been discussed here.

To verify my hunch that the memory management via standard malloc is not scaling as one could hope for, I wrote a simple program that did just that in every thread:

void * worker(void *) {
int iterations = 300000;
while(iterations-- ) {
void * p = malloc(size);
memset(p,0,size);
free(p);
}
return NULL;
}

My test spawned x worker threads and then waited until they are all finished.

Here are results for 200KB blocks (size=200*1024):
1 thread: 21 seconds runtime, at most 1 core is busy (via top utility)
2 threads: 53 seconds runtime, at most 1.4 cores are busy
4 threads: 105 seconds runtime, at most 1.8 cores are busy
8 threads: 325 seconds runtime, at most 2 cores are busy

It looked like the problem had been reproduced. It was the memory management that did not scale.

I knew that Intel® Threading Building Blocks (Intel® TBB) has a scalable memory allocator, and recently a drop-in version has become available in version 2.2 that also supports large object allocation efficiently. One should just define some environment variables to use it for a legacy application:

export LD_LIBRARY_PATH=absolute_path_to_tbb_libs:$LD_LIBRARY_PATH
export LD_PRELOAD=libtbbmalloc_proxy.so.2:libtbbmalloc.so.2:$LD_PRELOAD
run_your_app

You can verify if the allocator library is loaded using "pmap app_process_id | grep tbb"

Now come the results with Intel® TBB malloc lib:
1 thread: 6 seconds runtime, at most 1 core is busy
2 threads: 6 seconds runtime, at most 2 cores are busy
4 threads: 10 seconds runtime, at most 4 cores are busy
8 threads: 14 seconds runtime, at most 8 cores are busy

Looking at the vmstat output one could observe that the number of context switches was back to normal values. For this small test routine the Intel® TBB malloc speedup was amazing 23x :-) My original workload ran about 7x faster utilizing up to 8 cores with this new allocator.

This is of course no guarantee that any arbitrary multithreaded application will profit from Intel® TBB allocator at this scale. I have seen concurrent workloads that were speedup by a few percent only and I find this still encouraging to give Intel® TBB allocator a try.

feedback and questions are very welcome.

Roman

VTune is a trademark of Intel Corporation in the U.S. and other countries

Einzelheiten zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.