How to control NUMA locality of ALLOCATE'd memory in OpenMP programs ?

How to control NUMA locality of ALLOCATE'd memory in OpenMP programs ?

For an OpenMP program I want to enforce the use of NUMA local memory with threads that are permanently bound tothe sameCPU. These threads do frequent ALLOCATE's and DEALLOCATE's.

It seems likekmp_malloc() is the only way of enforcing NUMA locality (same as thread locality in this context) for memory that is frequently allocated and freed. The first confusing issue is that kmp_malloc() is not mentioned in ifort documentation although it's in the libraries.

Fortran ALLOCATE does call malloc(). Even ifmalloc()along with proper array initialization gets me local memory at the first invocation that does not help for long. Aftersome time offrequent malloc() and free()it ends up with a fragmentedbag of local and remote memory pages. malloc() has no knowledge of locality. Please correct me if I'm wrong.

Now I don't see a supported way of forcing ALLOCATE to usekmp_malloc() instead of malloc(). Apparently the only waymight be intercepting malloc()using LD_PRELOAD. That's not a clean way of doing things.

While MALLOC is available as a Fortran intrinsic that is not the case for KMP_MALLOC.A moredesirable solution could be an environment variable that switches ALLOCATE from malloc() to kmp_malloc(). Any other ideas ? Highly appreciated.

23 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

ALLOCATE should not control the NUMA placement of the allocation. That should occur when the memory is first touched. e.g. initialized.

I speculate that the Fortran 2008 BLOCK construct would permit ALLOCATABLE objects to be declared inside parallel regions, thus making them entirely private in the OpenMP sense. It's not clear to me if this is what you are looking for.

Fortran now has some portability functions/subroutines such that an allocation can be performed in C/C++ and then the appropriate array descriptor can be created. You might consider doing your NUMA allocations in this manner. You may still have a NUMA pool contamination problem should the thread that returns the not not reside on the same NUMA node. Your C/C++ code couldprependa "which NUMA node" tag in a similar way as you do for alligned malloc....

Jim

www.quickthreadprogramming.com

I think you're looking for stuff like the hwloc (portable hardware locality), an associated project of the openmpi group.

http://www.open-mpi.org/projects/hwloc/

I think what you ensure is that the allocating thread is in the core you want, using something like the hwloc library.

Ricardo Reis 'Non Serviam' @ http://www.lasef.ist.utl.pt @ http://www.radiozero.pt @ http://rreis.tumblr.com @ http://www.flickr.com/photos/rreis

Jim, yes, this is what I'm looking for. Could you please point me to those portability functions you mentioned ? Thanks a lot.

Michael

Best Reply

Michael,

Look under C/C++ interoperability (not portable) functions.

Principally C_F_POINTER

Something like this untested code

REAL, POINTER :: ARRAY(:,:,:)
C_PTR :: CallocatedArray
...
CallocatedArray =YourCMalloc(nX * nY * nZ * SIZEOF(ARRAY(1,1,1))
if(CallocatedArray == 0) call Oops()
CALL C_F_POINTER(CallocatedArray, ARRAY, /nX,nY,nZ/)

Don't forget to return the allocated memory when you are done with it.

Jim Dempsey

www.quickthreadprogramming.com

Excellent. This is it. Thanks !

And don't skip over Jim's point that Linux uses a lazy paging scheme. malloc() simply sets aside space in the page table - pages are physically allocated on first touch. Watch which thread does the first touch.

Thus, you can still ALLOCATE your data in the main thread, just don't touch it (initalize it, for example). Drop into an OpenMP region to touch the data and have each thread touch it's share of an array. this will guarantee locality IFF your subsequent OpenMP regions assign the data distribution to the threads in the same manner as the initial touch.

Do you understand this? For example, what you DON'T want to do is to touch the data using column distribution of a 2-D array, and in a subsequent OMP region go across the rows.

And for the current generation of Nehalem, Core i7 i5 systems, the remote memory access may be 1.7x slower in latency than that off the local memory controller and maybe 2x less bandwidth, is still WAY faster than the previous generation FSB technology.

And of course, Windows is different - physically allocates pages on malloc(), you do have to replace your malloc() with another set of calls to VirtualAlloc to get lazy page alllocation.

And for Mac OS, forget it all. the underlying OS won't pin threads, although it typically does not migration threads unless you oversubscribe threads to cores.

Yes indeed, the first touch policy would work fine if you malloc just once,touch it,then use it for the rest of the thread lifetime. However in my case there are frequent ALLOCATE's and DEALLOCATE's with varying chunk sizes within the parallel regions. A deallocated chunk cannot be "un-touched" afaik so when it's reallocated you have no guarantee for locality.

This is why I'm planning to useJim's suggestionto invokekmp_malloc() instead of malloc(). Calling kmp_malloc() from within a parallel region gives me memory from the thread local heap. In addition I'm binding threads to CPU's right from the start so thread locality implies NUMA locality.

Maybe I'll try this first with a STREAMS type of test case. Give me a few days ...

How do you bind the thread to the core? From the outside or from the inside?

Ricardo Reis 'Non Serviam' @ http://www.lasef.ist.utl.pt @ http://www.radiozero.pt @ http://rreis.tumblr.com @ http://www.flickr.com/photos/rreis

STREAM is a good testbed.

I don't want to be a spoiler, but let me give a hint for STREAM: add and triad may start to saturate the bandwidth on the single local controller with greater than 3 threads. thus, if you have 2 sockets and want to get max results for add and triad on 4 threads, use 3 local and 1 remote. copy and scale will be fine with 4 threads on the one socket through the one memory controller.

STREAM is a fun testbed for these sort of thought experiments. It will be more interesting to see your real world app does with 4 threads on a socket with the enhancements you describe to insure you're getting memory off the local controller.

this sounds like a good experiment. Keep us posted.

Ricardo,

the easiest way is toset KMP_AFFINITY to a value like "compact" or to a list like "proclist=[1,3,5,7,0,2,4,6]". This is an Intel feature.

With other compilers it's less easy. While the numactl command is o.k.to controlprocess bindingit isinsufficient for thread binding. So you might end up using the libnuma API from inside.

Michael

I guess most of us assumed the OP was using some form of affinity. Evidently, NUMA locality is destroyed when a thread hops to a different CPU, regardless of what precautions are taken with allocation and first touch.

KMP_AFFINITY is not as easy a solution as it ought to be. With newer CPUs, particularly with HT enabled, it's necessary to check manually whether it is necessary to adjust proclist; automatic options such as compact don't always work. The documentation warns against assuming the physical option will work.

The widely used OpenMP and MPI implementations all have useful affinity options; unfortunately, all different. The difficulties encountered in making explicit threading affinity work effectively on a variety of platforms are among the reasons for choosing OpenMP and MPI parallel (and auto-parallel schemes which are implemented under OpenMP and MPI). TBB also does affinity, but isn't compatible, in released form, with the other threading options I mentioned.

The verbose option, e.g. KMP_AFFINITY=physical,0,verbose sends diagnosis to the screen which you may use to see whether the assignment of threads to logicals is satisfactory.

If you are performing a large number of allocate/deallocate then consider using a NUMA aware scalable memory allocator. You may need to roll your own. I have one that works under Windows but you could make your own specialty pool allocator. There is a TBB allocator that is scalable but to my limited knowledge it is not NUMA aware.

Simple things can be done. Do it in C++ as it is easier to fake out the Fortran side.

An example of this is:

For each thread

maintain a linked list of previously returned similar sized objects (e.g. in cache line sized units).
always have the thread that allocates an object return the same object (this can be thread on NUMA node basis but save this enhancement for later)

allocation checks thread's local list first, heap allocates second

deallocation returns to local list

theft when allocation fails (local list empty, heap consumed) then take away from other threads.

Ttat is the simple description.

Jim

www.quickthreadprogramming.com

thanks for the pointers. I remember that in NUMA machines one of the alternatives is hybrid codes: one MPI process per CPU, n OpenMP threads per core/CPU

Ricardo Reis 'Non Serviam' @ http://www.lasef.ist.utl.pt @ http://www.radiozero.pt @ http://rreis.tumblr.com @ http://www.flickr.com/photos/rreis

With more than 1 MPI process per node, you still need an affinity scheme in the MPI to keep the threads from each MPI process on a single CPU. This is a relatively recent development in MPI support. A common method is for MPI to check OMP_NUM_THREADS setting so as to affinitize each process to appropriate cores.

sorry, but I'm quite mixed on that last part, just check if I got you right

I meant: bind each MPI process to each multicore-CPU (socket) for the duration of the run.

The OpenMP threads spwaned by each MPI process will naturally reside in the MPI process CPU. I think the MPI process shouldn't spawn more threads that the number of cores for it's CPU. So, OMP_NUM_THREADS should be limited to the number of cores in each CPU. Or you can feed the program an hierarchy table of the machine in some way and set the number of threads from inside the program (using omp_set_num_threads). Maybe using that hwloc library mentioned earlier (or you know another friendlier way of getting the machine layout? maybe writting it on a file to set things up? We are working in a MPI-GPU code right now and thats on the table because the machine has multiple GPUs and the device numbers are not consecutive...)

cheers,

Ricardo Reis 'Non Serviam' @ http://www.lasef.ist.utl.pt @ http://www.radiozero.pt @ http://rreis.tumblr.com @ http://www.flickr.com/photos/rreis

Ricardo,

regarding the hybrid MPI+OpenMP approach you are right. MPIshould not spawn the threads of a single rank across more than one socket or more than one shared L3 cache domain (which is the same thing for most processors these days). Otherwise you get ugly performance side effectsdue tofalse cache line sharing. I have worked on the hybrid matter quite a bit. You may have a look atthis paper:

http://www.hp.com/techservers/hpccn/hpccollaboration/ADCatalyst/downloads/RiedmannMultiCoreLM_R10.pdf

Regarding analysis of the machine layout - that's available in

/sys/devices/system/cpu/cpu*/topology

While it's not always easy to extract what you exactly want - at least the information is all there.

Cheers,

Michael

Thanks for the pointers Michael, didn't knew about that. I'll download and read.

cheers!

Ricardo Reis 'Non Serviam' @ http://www.lasef.ist.utl.pt @ http://www.radiozero.pt @ http://rreis.tumblr.com @ http://www.flickr.com/photos/rreis

Finally here's some feedback. It turned out thatkmp_malloc() does not behave asexpectedas itis not NUMA aware. Instead, I ended up using numa_alloc_onnode() and numa_free()whichare partof libnuma-1.0.3.

The downside of this approach is that ALLOCATE and DEALLOCATE must be replaced with a custom API, and that numa_free() must be given the size of the array to be freed, which implies some extra bookkeeping. That has to do with the mmap() allocation mechanism that is usedin libnumainstead ofbrk(). I also suggest to use numa_alloc_onnode() only for large arrays as the underlyingmmap() and munmap() system calls arerather expensive and thus not applicable forfrequent calling as withC++ style.

Anothertopic here is to create a generic API for all types of arrays w.r.t. elementary type and number of dimensions.

Testing with a modified STREAMS benchmark on a 2-socket Nehalem box exposeda 30-50% performance gap between best case and worst case allocation so this stuffreally matters.

The simplified way toallocate from Fortran looks like below. Thisapproach isportable, it can be used withPGI10 and gfortran 4.3 as well.

Enjoy,

Michael

USE, INTRINSIC :: iso_c_binding
INTERFACE

FUNCTION numa_alloc_onnode(nbytes, node) BIND(C)
USE, INTRINSIC :: ISO_C_BINDING
INTEGER(8), VALUE :: nbytes
INTEGER(4), VALUE :: node
TYPE(C_PTR) :: numa_alloc_onnode
END FUNCTION

END INTERFACE
...
REAL*8, dimension(:), pointer :: array
TYPE(c_ptr) :: cp
...
cp = numa_alloc_onnode(nbytes, node)
if (.not. c_associated(cp)) then
print *,'numa_alloc_onnode (',nbytes,') failed'
stop
end if
call C_F_POINTER(cp, array, (/dim1/) )
...
! From now on ARRAY can be usedlike any normal FORTRAN array, except for the deallocation.

Michael,

Thanks for your follow-up. All too often when someone posts a query and receives several suggestions, then derives their solution, they tend not to report back. So thank you.

Some things for you to consider to embelish your fundamental technique.

1) Consider adding a flavor for aligned allocations. Often aligned allocations are helpful for Small Vector processing of arrays.

2) Consider layering your allocation inside a pool of pools. (this is what I do with the QuickThread scallable allocator). Same or similar sized allocations are occur frequently within an application. The numa_alloc_onnode() is a generalized function that may tend to conserve memory by use of a single pool per node. Application performance can be improved greatly by expending some additional RAM in exchange for reduced number of LOCKs,CriticalSections, and search of node trees. Using a pool of pools structure can greatly reduce allocation time.

Jim Dempsey

www.quickthreadprogramming.com

Jim, yes indeed, the next logical step would be putting a fast multi arena allocator (maybe one instance per thread) on top of numa_alloc_onnode() which wouldeven allow to intercept malloc() and have a transparent solution without introducing a special API. You may argue whether or not intercepting malloc() is a legalsolution ...

For C++ programs that is mandatory.Formost of myclassic Fortran style programswhich do all theirallocsin a central place during startup (just recently migrated from COMMON blocks to ALLOCATABLEs)it is o.k. to makejust a fewchanges as outlined above.

Anyway - your suggestions have helped develop this quite a bit. Thanks.
Michael

From my experience in writing one of these, the most efficient method (fastest) comes with a 3-tiered method

per-thread
per-node
per-system

Where the deallocate supplies not only the pointer but also the size of the memory node to be returned.
These are nested pools. Per-thread pool access is without locks. per-thread to/from per node has locks but only one lock perstore/fetch of pool (not node). Larger pool size means larger memory requirements but less overhead. A tradeoff between size and speed.

On 8 thread system, the pool based system is about 2.8x faster than malloc. And appears to have a linear scaling factor of 0.532225 (measured on Core i7 920).

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today