Scalable allocation of large (8MB) memory regions on NUMA architectures

Scalable allocation of large (8MB) memory regions on NUMA architectures

I am currently using a TBB flow graph in which a) a parallel filter processes an array (in parallel with offsets) and puts processed results into an intermediate vector (allocated on the heap; mostly the vector will grow up to 8MB). These vectors are then passed to nodes which then postprocess these results based on their characteristics (determined in a)). Because of synchronized resources, there can only be one such node for each characteristic. The prototype we wrote works well on UMA architectures (tested on a single CPU Ivy Bridge and Sandy Bridge architecture). However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc of a 8MB chunk, then memset the 8MB region; similar to what the initial prototype would do) up to a certain amout of memory. Our findings are:

- On a UMA architecture the application scales up linearly with the number of threads used by the pipeline (set via task_scheduler_init)

- On the NUMA architecture when we pin the application to one socket (using numactl) we see the same linear scale-up

- On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets (negative linear scale-"up")

For us this smells like heap contention. What we tried so far is to substitute Intel"s TBB scalable allocator for the glibc allocator. However, the initial performance on a single socket is worse than using glibc, on multiple sockets performance is not getting worse but also not getting any better. We gained the same effect using tcmalloc and the hoard allocator.

My question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated vectors even after the pipeline ran.

Update: I attached perf stats for the various executions with numactl. Interleaving/localalloc has no effect whatsoever (the QPI bus is not the bottleneck; we verified that with PCM, QPI link load is at 1%).

Update 2: I also added a chart depicting the results for glibc, tbbmalloc, and tcmalloc.

AnhangGröße
Herunterladen perfstats.txt4.28 KB
Herunterladen chart.png8.09 KB
28 Beiträge / 0 neu
Letzter Beitrag
Nähere Informationen zur Compiler-Optimierung finden Sie in unserem Optimierungshinweis.

Hello,

Was the tbbmalloc taken from 4.1 update 1?

--Vladimir

The perf stats are for 4.0; just ran the prototype with tbb41_20121003oss (4.1 update 1) and got the same result though.

>>...My question is if someone experienced similar issues. Stack-allocation is not an option for us as we want to keep the heap-allocated
>>vectors even after the pipeline ran...

I reported some problems with 'scalable_allocator' a couple of months ago and I had no clear response from TBB team. Take a look at a thread:

Forum topic: The memory manager cannot access sufficient memory to initialize; exiting
Web-link: http://software.intel.com/en-us/forums/topic/277187

Quote:

Tobias M. wrote:

The perf stats are for 4.0; just ran the prototype with tbb41_20121003oss (4.1 update 1) and got the same result though.


OK thanks. we'll take a look and come back.
--Vladimir

Thanks. If you need the code sample, ping me.

Tobias,

Yes, could you please send the reproducer?

Attached prototype.cpp (as .txt, does not accept .cpp). Compiled with g++-4.7 std=c++11 march=native -O3.

Anlagen: 

AnhangGröße
Herunterladen prototype.txt1.33 KB

Thanks, got it!

Just to check my understanding: are you interested in raw allocation and memory touching performance, without subsequent releasing? And during performance measurement you use something like “numactl --mebind=0 -- ./foo”, i.e. without CPU binding?

Yes, I'm interested in multithreaded raw allocation and memory touching of say 8MB chunks without releasing the memory within the pipeline (the memory is freed only when the application is closed). For performance measurements I used numactl --cpunodebind=0 ... numactl --cpunodebind=0-3 (we have 4 CPUs). Memory binding, interleaving (--interleave=all) and --localalloc have no effect here. With CPU binding we have the following finding: 1 CPU (speed x), 2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3) [speed=the higher the better]. So what we see is that performance worsens with the number of CPUs.

Our system configuration for reference: 4 socket Nehalem-EX (4 X7560 CPUs) with a 5520/5500/X58 chipset and 1 TB main memory (256 GiB per socket).

Our system configuration for reference: 4 socket Nehalem-EX (4 X7560 CPUs) with a 5520/5500/X58 chipset and 1 TB main memory (256 GiB per socket).

A few comments on your test program:

First off, the program is only performing allocation. Scalable allocators are tuned for reallocation of nodes previously returned after allocation. IOW first allocation may suffer some additional overhead at the benefit of reduced return and reallocation times.

Secondly, in parallel_pipeline designs you pass a token, containing context, meaning not only can buffer pointers be passed through the stages of the pipeline, but also they can persist through recycling of the token (with context) from the back-end of the pipeline back to the front-end of the pipeline. IOW, once allocated, keep the 8MB buffer (pointer) inside the token. Only when larger buffer required, would your code free then reallocate.

Thirdly, run your performance tests longer than 1.66 seconds.

Jim Dempsey

www.quickthreadprogramming.com

I didn't feel that I had sufficient information to say something sensible like Jim may have (depending on what the actual details are). The original question is about a flow graph, but the example code is about a parallel_pipeline, not quite the same. Will the data be scattered or can good cache locality be expected? If the latter, I would allocate at the start and deallocate at the end (here my advice differs from Jim's second comment, because the tasks that hold the tokens may be stolen, so keeping the memory regions with the tokens may be counterproductive) and let the scalable allocator automatically deliver excellent performance after an initial warm-up period that should be ignored in any useful benchmark (here I concur with Jim's first and third comments); this depends on how well these characteristics are now available for allocations of about 8 MB compared to an earlier implementation that merely delegated them to malloc(), otherwise it might be better to roll your own pool based on TLS, but I haven't investigated this (yet). And again, I don't see the full picture yet.

Raf,

>>because the tasks that hold the tokens may be stolen, so keeping the memory regions with the tokens may be counterproductive

Not so Raf. TBB does not have preemption. Tasks run to completion. Threads (and the task they are currently running) can get preempted by threads of other applications (or oversubscribed threads of same application). When the last filter completes in a parallel_pipeline, the token containing the buffer pointer(s) get recycled and are either immediately re-consumed by the thread exiting the pipeline or are enqueued and become available for any available thread to use (the choice is dependent on the implimentation). I do agree with you that unless the parallel_pipeline is using all threads, that the buffer pointers NOT be in TLS (i.e. place them inside the token).

Jim Dempsey

www.quickthreadprogramming.com

Hmm, it's been a while since I've been fiddling with the implementation of a pipeline (back when tasks still had depth!). Perhaps something changed in this regard as well, perhaps I just forgot? I've had another look (release 4.1), and after the last stage a stage_task does bypass the scheduler to wrap around to the beginning, so locality would be preserved.

However, unless it's documented, a new implementation might change that again.

And also now I'm confused about how you would even "keep the 8MB buffer (pointer) inside the token", because a token is just an invisible and therefore inaccessible implementation issue, isn't it? You set the maximum number of tokens, but that's it as far as tokens go. I'll have another look later, because now I'm really starting to get worried. :-)

I see that the problem is identical to what I've experienced a couple of months ago. There is a performance problem with scalable_malloc after some number of allocations is completed by the function. As soon as at least 1.5GB of memory is allocated the performance decreases and a regular CRT-based malloc ( heap based ) outperforms the scalable_malloc.

Here are a couple of more notes related to the test-case provided.

>>...However, the application does not scale on our NUMA architecture (4 CPU Nehalem-EX). We pinned the problem down to memory
>>allocation and created a minimal example in which we have a parallel pipeline that just allocates memory from the heap (via malloc
>>of a 8MB chunk, then memset the 8MB region...

For example, on a Windows platform scalable_malloc doesn't allocate memory from the heap and it uses a Win32 API function VirtualAlloc. Then, VirtualAlloc is called with MEM_COMMIT flag and it initializes the allocated memory to 0. So, a call to:
...
memset( buffer, 0, chunkSize );
...
looks like useless and I would try not to do it. What if memset is "responsible" for that performance decrease? Just comment it and run your tests again.

Raf, if you're going to look at it could you verify what happens when the test is executed on a Linux platform? Does it allocate the memory from the heap?

>>...For us this smells like heap contention...

As I already mentioned, on a Windows platform scalable_malloc doesn't allocate memory from the heap. Just to clear as many as possible uncertanties I recommend to create a test-case that simply verifies scalability of scalable_malloc without using any additional TBB features / classes and calls to any CRT mem-functions. What if there is actually a problem with the parallel_pipeline functionality?

>>And also now I'm confused about how you would even "keep the 8MB buffer (pointer) inside the token", because a token is just an invisible and therefore inaccessible implementation issue, isn't it?

The token would be hidden in your base class, the buffer or buffer pointers would be inside the derived class. Be mindful that the number of buffers == number of tokens. Also, if (when) using affinity pinning, perform the allocation and first touch inside the parallel compute filter. This may require a priming NULL buffer to pass once through the (serial) input filter and a NOP through the remainder of the pipeline on the first pass.

Jim Dempsey

www.quickthreadprogramming.com

"The token would be hidden in your base class, the buffer or buffer pointers would be inside the derived class."

Base class and derived class of what?

Tobias,

Could you provide some update on the status of the problem, please? Also, I don't think that it is possible for many of us to reproduce the problem on a NUMA system. I personally don't have access to any.

>>...
>>- On the NUMA architecutre when we use more than one socket, the time our application runs increases with the number of sockets
>>(negative linear scale-"up")
>>...

Is there any possibility that the test application always allocates 8MB memory blocks from a local memory for CPUs selected for processing?

>>...2 CPUs (speed ~ x/2), 3 CPUs (speed ~ x/3)...

What if these CPUs simply "fight" for access to the same local memory? Could it be the case?

Does TBB have any control of NUMA related features, like access to local or remote memories?

I could be wrong with my assumptions but it looks like a performance drop could be caused by a different problem not related to TBB and you need to take it into consideration.

Best regards,
Sergey

A short status update and a partial answer for the described issue: The call to malloc or scalable_malloc are not the bottleneck, the bottleneck are rather the page faults triggered by memsetting the allocated memory. There is no difference between glibc malloc and other scalable allocators such as Intel's TBB scalable_malloc: for allocations larger than a specific threshold (usually 1MB if nothing is freed; can be defined by madvise) the memory will be allocated by an anoymous mmap. Initially all pages of the map point to a kernel-interneal page that is pre-0ed and read-only. When we memset the memory, this triggers an exception (mind the kernel page is read-only) and a page fault. A new page will be 0ed at this time. Small pages are 4KB so this will happen 2048 times for the 8MB buffer we allocate and write. What I measured is that these page faults are not so expensive on single-socket machines but get more and more expensive on NUMA machines with multiple CPUs.

Solutions I came up so far:

- Use huge pages: helps but only delays the problem

- Use a preallocated and pre-faulted (either memset or mmap + MAP_POPULATE) memory region (memory pool) and allocate from there: helps but one not necessarily wants to do that

- Address this scalability issue in the Linux kernel

I'd like to understand how much memory do you actually need to process your array of data?

>>...There is no difference between glibc malloc and other scalable allocators such as Intel's TBB scalable_malloc...

This is really strange because by design they have different implementations and scalable_malloc has to be faster. I didn't have a chance to look at Linux codes for scalable_malloc and I think you need to take a look at it more closely.

I finally decided to use malloc because scalable_malloc didn't give me any performance advantages in case of allocation memory blocks larger than 1.5GB ( and this is what I need ). My actual goal is even larger blocks ( up to 2.8GB with Win32 API function VirtualAlloc ) on 32-bit Windows platforms which support Address Windowing Extensions ( AWE ) technology since it allows to allocate up to 3GB of memory for a 32-bit application. Unfortunately, it is not applicable in your case since you're using Linux.

>>...When we memset the memory, this triggers an exception (mind the kernel page is read-only) and a page fault...

Absolutely the same happens on Windows platforms and it doesn't surprise me.

>>...Address this scalability issue in the Linux kernel...

I can't say anything here.

Sergey,

Large blocks (in your case 1.5GB) tend to be allocated once, and you keep it around for as long as a block of that size will be used (or later re-used) by your application. For allocations like this, you want to bypass the scalable allocator because there is no performance benefit in using a scalable allocator in this circumstance. Note, once allocated (on a memory limited system) you would not want to return this allocation as it may later get fragmented. You will want to defer the return of this buffer until a buffer of this size will never be required at a later time in the application.

A typical scalable allocator will have a slab size wired in (some may have several slab sizes). Where a slab is a large block (4MB or larger) which is allocated by a non-scalable allocator (e.g. VirtualAlloc), and then subsequent to that allocation, it may be (on demand) subdivided into pools of specific sized nodes (16, 32, 64,... bytes) where the specific sized node pool creation is deferred until there is a demand for a node of that size. Any allocation larger than the slap size will either fall into a higher slab size category or else bypass the scalable allocator and go directly to use of non-scalable allocator. In your case (1.5GB) bypassed the scalable allocator memory, iow not comming out of slab, and went "directly" to VirtualAlloc. "directly" does contain additional scalable allocator overhead to classify the size request and to package it for eventual return. Your application allocations of these large blocks are more efficiently performed by specific code that bypasses the scalable allocator.

Jim Dempsey

www.quickthreadprogramming.com

>>>What if these CPUs simply "fight" for access to the same local memory? Could it be the case?>>>

Probably NUMA architecture - related memory distances coupled with the thread beign executed by the different nodes and forced to access its non-local memory could be responsible for any performance degradation related to the memory accesses.
When the number of nodes is greater than 1 some performance penalty will be expected.IIRC the performance penalty is measured in unit of "NUMA distance" with the normalized value of 10 and every access to the local memory has the cost of 10(normalized value) thats mean 1.0
when the process accesses off-node memory(remote) from the NUMA "point of view" some penalty will be added because of overhead related to moving data over the numa interlink.Accessing neighbouring node can add up to 0.4 performance penalty so the total penalty can reach 1.4.
More information can be found in ACPI documentation.

>>...More information can be found in ACPI documentation...

Iliya, please post the link to these docs. Thanks in advance.

>>>Iliya, please post the link to these docs. Thanks in advance.>>>

I searched through the ACPI standard, but the information contained there is very scarce.I have only found structure describing distance table,and some explanation regarding numa-distance calculation. Look for SLIT and SRAT data structures.

Here is the link : http://www.acpi.info/spec.htm
Here is the numa topology diagram ,albeit for Magny Cores architecture link :http://code.google.com/p/likwid-topology/wiki/AMD_MagnyCours8
Very interesting information regarding NUMA performation degradation link :http://communities.vmware.com/thread/391284
Follow also this link :https://docs.google.com/viewer?a=v&q=cache:K06wsPrSIFYJ:cs.nyu.edu/~lern...
Read also this article :http://kevinclosson.wordpress.com/2009/08/14/intel-xeon-5500-nehalem-ep-...

Later today I will search through the chipset specification,which I believe could contain more valuable information.I will post the links.

Got it. Thanks, Iliya.

>>>Got it. Thanks, Iliya.>>>
No problem:)

Did a preliminary search through the chipset documentation and sadly NUMA was mentioned only once.
Here is the link:http://www.intel.com/content/www/us/en/chipsets/server-chipsets/server-c...
It seems that there can not be found any valuable information regarding NUMA API implementation besides Linux numa header files.

Kommentar hinterlassen

Bitte anmelden, um einen Kommentar hinzuzufügen. Sie sind noch nicht Mitglied? Jetzt teilnehmen