The traditional Symmetric Multi-Processor (SMP) memory subsystem treats all memory accesses from any processor the same. In an effort to speed up some transactions, NUMA (Non-Uniform Memory Access) splits up the memory subsystem to allow each processor to have both local and remote memory accesses. Accesses to the local memory are faster than accesses to remote memory. Also, Intel® QuickPath Interconnect (Intel® QPI) technology allows for faster point-to-point memory speeds. With the combination of NUMA and Intel® QPI, a machine has faster local memory accesses and greater bandwidth than would have been previously possible with the SMP memory subsystem. This performance comes at a price, however. While some memory accesses for NUMA are faster, some run slower. The downside of using NUMA is that if a user does not exploit NUMA effectively, performance may go down with the additional overheads of the remote memory accesses.
On multi-socket machines based on the Intel® Xeon® processor, NUMA is typically available as a BIOS boot option, but effective use also requires some OS and some user-level intervention. Windows* OS support for NUMA started with Windows* Server 2003, and Linux* OS support requires kernel 2.6 or higher.
This document will discuss a few isolated tips and hints for NUMA performance and may expand over time.
How to activate NUMA
NUMA may not be on automatically. Also, depending on the application and usage model, NUMA may not give performance improvements. However, it may give performance improvements in the cases where the code is designed to exploit it.
NUMA must be activated by both the appropriate BIOS settings and OS settings. We mentioned earlier OS requirements. Since the BIOS settings vary from machine to machine, we can only suggest to look through the BIOS documentation. This means that there are some cases where NUMA must be set at boot time and not run time.
Checking NUMA settings on Linux
To check NUMA capability on Linux, use “numactl --show” command:
$> numactl --show
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
cpubind: 0 1
nodebind: 0 1
membind: 0 1
In this case, this system has 2 NUMA nodes enumerated next to the cpubind, nodebind and membind rows above. Also, “numactl --hardware” shows detailed NUMA information.
Checking NUMA settings on Windows
To check NUMA capability on Windows, the task manager can be inspected. If enabled, the available NUMA nodes are listed at the performance tab.
On both Linux and Windows, the default memory allocation policy is “local”, which means that memory is allocated at the NUMA node where the process is running. This is the optimal setting if all active cores are within the same NUMA node. In order to assign a specific NUMA node for the binary execution, the “numactl --cpunodebind” command can be used. The following example uses NUMA node 1 for the binary execution:
$> numactl --cpunodebind=1 <your binary>
Another way to restrict the binary execution to certain cores is to use the KMP_AFFINITY environment variable supported by the Intel® OpenMP* runtime. Please see the Thread Affinity Interface section in the Intel® Compiler Documentation for details. Similarly, process pinning feature of Intel® MPI Library can be used to map MPI processes to certain NUMA nodes. Please see the Intel MPI Library Reference Manual for details.
Linux has another memory allocation policy, which is called interleaved mapping. If cores across NUMA nodes will be employed for the program, interleaved memory mapping can be considered where memory is allocated on the NUMA nodes in a round-robin fashion. Interleaved mapping can be specified by using the “numactl --interleave” command. The following example interleaves memory allocation between NUMA node 0 and 1 for the binary execution.
$> numactl --interleave=0,1 --cpunodebind=0,1 <your binary>
On Windows* OS, there isn’t a command equivalent to numactl. When NUMA is enabled on Windows* OS, the only memory allocation policy is “local”. For applications that need interleaved memory mapping across nodes on a multi-socket machine, NUMA has to be disabled.
NUMA allows faster local memory access, but there is a performance penalty for remote memory access. In this article we described how to check the NUMA settings on a system and presented two ways to interact with NUMA: pinning a process to a specific NUMA node to minimize remote memory accesses penalties and employing interleaved memory mapping for an application that is not NUMA aware. The Intel® MKL users are encouraged to experiment with the different NUMA usage models to find the optimal settings for their applications.