verifying first-touch memory allocation

verifying first-touch memory allocation

Is anyone aware of a basic tool for verifying first-touch memory allocation on a NUMA platform such as Xeon EP?

According to usual expectation, pinning of MPI processes to a single CPU should result in this happening automatically (barring running out of memory, etc.), unless a non-NUMA BIOS option has been selected.

Likewise, OpenMP where data are initialized by a parallel data access scheme consistent with the way they will be used should result in allocation local to the CPU, rather than on remote memory.

For this to work, apparently, MPI or OpenMP libraries have to coordinate with the BIOS.

It seems there might be a way to determine the address ranges which are local to each CPU on a shared memory platform and perform tests to see where each thread is placing its first touch allocation.

As you might guess, I'm looking for verification of suspected performance problems which seem to indicate threads within MPI ranks pinned to certain CPUs consistently using remote memory.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Hey Tim,

Have you looked at NumaTop ( ? Assuming you are using linux...

It seems like, if you have each thread malloc a big array and repeatedly run through the array, then something like numatop should be able to show the local vs remote stats pretty easily. I've never actually used NumaTop.. I'm just pretty sure the guys who created it know what they are doing.


Thanks, that looks like an interesting option.  It requires building a custom kernel with PEBS latency counters, with the step "build kernel as usual" ( it says that verbatim in the man page) looking a bit daunting.

I was able to build a running kernel (including access to this forum page) according to the numatop instructions as best I understood.  However, numatop says "CPU is not supported." Not surprisingly, at a minimum, the Intel(c) Xeon Phi(tm) would need to be rebuilt for that to run.

GUI tools such as red hat system monitor are still present but show fewer cores than they did under Red Hat (where they didn't see all the cores).

/proc/cpuinfo still looks OK.

The developers confirmed that it's sufficient to add the CPU model number to the list in order to make numatop accept it.

FIrefox has particularly bad memory locality, probably no surprise there.

My application ran around 50% remote memory accesses when running just 1 MPI process (OpenMP threaded across all cores) but shows good locality when running an even number of processes.  Must look elsewhere for problems.

Standard Linux systems track whether they were able to provide pages according to the NUMA policy requested.

You can dump the stats before and after your run using
      cat /sys/devices/system/node/node0/numastat

The output looks like:
      numa_hit 672421856
      numa_miss 632409
      numa_foreign 185449
      interleave_hit 269407187
      local_node 672420899
      other_node 633366

I find the naming a bit confusing, and typically have to run test cases using numactl with various processor and memory binding options to remind myself what they mean.

John D. McCalpin, PhD
"Dr. Bandwidth"

Leave a Comment

Please sign in to add a comment. Not a member? Join today