The traditional SMP memory subsystem treated all memory accesses from any processor the same. In an effort to speed up some transactions, NUMA (Non-Uniform Memory Access) splits up the memory subsystem to allow each processor to have both local and remote memory accesses. Accesses to the local memory are faster than accesses to remote memory. Also, Intel® QuickPath Interconnect (Intel® QPI) technology allows for a faster point-to-point memory speeds. With the combination of NUMA and Intel® QPI, a machine has faster local memory accesses and greater bandwidth than would have been previously possible with the SMP (Symmetric Multi-Processor) memory subsystem. This performance comes at a price, however. While some memory accesses for NUMA are faster, some run slower. The downside of using NUMA is that if a user does not exploit NUMA effectively, performance may go down with the additional overheads of the remote memory accesses.
On multi-socket machines based on the Intel® Core™ i7 processor, NUMA is typically available as a BIOS boot option, but effective use also requires some OS and some user-level intervention.
Windows* OS support for NUMA started with Windows* Server 2003, and Linux* OS support requires kernel 2.6 or higher.
This document will discuss a few isolated tips and hints for NUMA performance and may expand over time.
How to activate NUMA
NUMA may not be on automatically. Also, depending on the application and usage model, NUMA may not give performance improvements. However, it may give performance improvements in the cases where the code is designed to exploit it.
NUMA must be activated by both the appropriate BIOS settings and OS settings. We mentioned earlier OS requirements. Since the BIOS settings vary from machine to machine, we can only suggest to look through the BIOS documentation. This means that there are some cases where NUMA must be set at boot time, and not run time.
If Linux has a NUMA API available, you may find details on how to use it with "man numa". This is the Linux NUMA Policy library.
Routines like numa_available(), numa_max_node(), numa_get_interleave_mask(), etc., may be used to get a feel for how the system may be configured.
There are also routines for allocating memory in a NUMA configuration, that may be useful.
Tips and situations
Each CPU in NUMA mode will have local and remote memory, where local memory accesses will run faster. In some cases, NUMA does not affect Intel MKL performance. In others, the effect can be dramatic. We consider one dramatic case for the purposes of illustration on how to get the greatest performance.
Situation 1: NUMA is enabled on a Linux Kernel 2.6 or later and a user runs the Intel MKL LINPACK Benchmark.
The performance may be lower than expected, depending on how the memory is physically configured. The application may be loading all the data on the same memory subsystem, and the other cores have to take performance penalties for each access. A better solution would be if the memory was equally distributed amongst the cores. Change the runme_* scripts so the xlinpack lines are proceeded by "numactl --interleave=all." This will force the application to interleave memory amongst the cores. Do a "man numactl" for additional information on numactl scheduling and memory placement policies.
Note that performance may also vary depending on the HT setting. If HT is off, then use the above. Otherwise, use 'numactl --localalloc'
Situation 2: Exploiting NUMA with a first touch policy
This situation is the most common, and it relates to how users may set up their data, and nothing to do with how a library like Intel MKL works. The idea here is to alloc and initialize the data in a specific way before making Intel MKL calls.
The user should alloc their data on page-size boundaries and let each thread that the application will use touch the pages initially (in the way the application will divide up the work to the threads). When the data is paged in, it will determine which core owns which pages. Note that Intel MKL uses OpenMP*, so using the same OpenMP* and pinning the threads to a given core will allow Intel MKL to use the same memory configuration that the user had in mind. OpenMP* pinning is a topic for another KB article though.
On Windows*, looking for "NUMA Support" from the http://msdn.microsoft.com website can give programmers ideas for how to organize their data for NUMA as well as setting things like affinity masks. It doesn't appear that Windows* has the equivalent of "numactl" or "numa (Linux* NUMA policy library)".