Best known methods and information related to using NUMA with the Intel® Math Kernel Library
NUMA
Знакомство с Numa
Во время конкурса Acceler8 были обсуждения векторизации, разворотов циклов, и различных микрооптимизаций. Но, наверное, многие, как и я, сталкивались с проблемой, что какие бы оптимизации они не делали, программа на компьютере с небольшим числом ядер разгонялась, а на MTL скорость лишь неумолимо падала. Тогда стало ясно, что нужно каким-то образом разгружать RAM, иначе какой-либо прогресс невозможен.
Optimizing Software Applications for NUMA: Part 6 (of 7)
3.3 Data Placement Using Explicit Memory Allocation Directives
Another approach to data placement in NUMA-based systems is to make use of system APIs that explicitly configure the location of memory page allocations. An example of such APIs is the libnuma library for Linux.[1]
Another approach to data placement in NUMA-based systems is to make use of system APIs that explicitly configure the location of memory page allocations. An example of such APIs is the libnuma library for Linux.[1]
Optimizing Software Applications for NUMA: Part 5 (of 7)
3.2. Data Placement Using Implicit Memory Allocation Policies
In the simple case, many operating systems transparently provide support for NUMA-friendly data placement. When a single-threaded application allocates memory, the processor will simply assign memory pages to the physical memory associated with the requesting thread’s node (CPU package), thus insuring that it is local to the thread and access performance is optimal.
In the simple case, many operating systems transparently provide support for NUMA-friendly data placement. When a single-threaded application allocates memory, the processor will simply assign memory pages to the physical memory associated with the requesting thread’s node (CPU package), thus insuring that it is local to the thread and access performance is optimal.
Optimizing Software Applications for NUMA: Part 4 (of 7)
3. Strategies for NUMA Optimization
Two key notions in managing performance within the NUMA shared memory architecture are processor affinity and data placement.
3.1. Processor Affinity
Two key notions in managing performance within the NUMA shared memory architecture are processor affinity and data placement.
3.1. Processor Affinity
Optimizing Software Applications for NUMA: Part 3 (of 7)
2. NUMA Advantages and Risks
Optimizing Software Applications for NUMA: Part 1 (of 7)
1. The Basics of NUMA
NUMA, or Non-Uniform Memory Access, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system. Perhaps the best way to understand NUMA is to compare it with its cousin UMA, or Uniform Memory Access.
In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:
NUMA, or Non-Uniform Memory Access, is a shared memory architecture that describes the placement of main memory modules with respect to processors in a multiprocessor system. Perhaps the best way to understand NUMA is to compare it with its cousin UMA, or Uniform Memory Access.
In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:

