Optimizing Software Applications for NUMA: Part 5 (of 7)

3.2. Data Placement Using Implicit Memory Allocation Policies

In the simple case, many operating systems transparently provide support for NUMA-friendly data placement. When a single-threaded application allocates memory, the processor will simply assign memory pages to the physical memory associated with the requesting thread’s node (CPU package), thus insuring that it is local to the thread and access performance is optimal.

Alternatively, some operating systems will wait for the first memory access before committing on memory page assignment.[2] To understand the advantage here, consider a multi-threaded application with a start-up sequence that includes memory allocations by a main control thread, followed by the creation of various worker threads, followed by a long period of application processing or service. While it may seem reasonable to place memory pages local to the requesting thread, in fact, they are more effectively placed local to the worker threads that will access the data. As such, the operating system will observe the first access request and commit page assignments based on the requester’s node location.

These two policies together illustrate the importance of an application programmer being aware of the NUMA context of the program’s deployment. If the page placement policy is based on first access, the programmer can exploit this fact by including a carefully designed data access sequence at startup that will generate “hints” to the operating system on optimal memory placement. If the page placement policy is based on requester location, the programmer should insure that memory allocations are made by the thread that will subsequently access the data and not by an initialization or control thread designed to act as a provisioning agent.

Multiple threads accessing the same data are best co-located on the same node so that the memory allocations of one, placed local to the node, can benefit all. This may, for example, be used by prefetching schemes designed to improve application performance by generating data requests in advance of actual need. Such threads must generate data placement that is local to the actual consumer threads for the NUMA architecture to provide its characteristic performance speedup.

It should be noted that when an operating system has fully consumed the physical memory resources of one node, memory requests coming from threads on the same node will typically be fulfilled by sub-optimal allocations made on a remote node. The implication for memory-hungry applications is to correctly size the memory needs of a particular thread and to insure local placement with respect to the accessing thread.

For situations where a large number of threads will randomly share the same pool of data from all nodes, the recommendation is to stripe the data evenly across all nodes. Doing so spreads the memory access load and avoids bottleneck access patterns on a single node within the system. [3]

References:
[2] Intel® 64 and IA-32 Architectures Optimization Reference Manual. See Section 8.8 on “Affinities and Managing Shared Platform Resources”. March 2009.
[3] Lameter, Christoph. “Local and Remote Memory: Memory in a Linux/NUMA System”. June 2006.

Categorías:
Etiquetas:
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.