Optimizing Software Applications for NUMA: Part 4 (of 7)

3. Strategies for NUMA Optimization

Two key notions in managing performance within the NUMA shared memory architecture are processor affinity and data placement.

3.1. Processor Affinity

Affinity refers to the persistence of association with a particular resource instance, despite the availability of another instance for the same purpose. Consider the case of processor affinity. Today’s complex operating systems assign application threads to processor cores using a scheduler. A scheduler will take into account system state and various policy objectives (e.g., “balance load across cores” or “aggregate threads on a few cores and put remaining cores to sleep”) and match application threads to physical cores accordingly. A given thread will execute on its assigned core for some period of time and then wait as other threads are given the chance to execute. If another core becomes available, the scheduler may choose to migrate the thread to insure timely execution and meet its policy objectives.

Thread migration from one core to another poses a problem for the NUMA shared memory architecture because of the way it disassociates a thread from its local memory allocations. That is, a thread may allocate memory on node 1 at startup as it runs on a core within the node 1 package. But when the thread is later migrated to a core on node 2, the data stored earlier becomes remote and memory access time significantly increases.

Enter processor affinity. Using a system API, or by modifying an OS data structure (e.g., affinity mask), a specific core or set of cores can be associated with an application thread. The scheduler will then observe this affinity in its scheduling decisions for the lifetime of the thread. For example, a thread may be configured to run only on cores 0 through 3, all of which belong to quad core CPU package 0. Henceforth, the scheduler will choose among these alternatives without migrating the thread to another package.

Exercising processor affinity insures that memory allocations remain local to the thread(s) that need them. Several downsides, however, should be noted. In general, processor affinity may significantly harm system performance by restricting scheduler options and creating resource contention when better resources management could have otherwise been used. For example, affinity restrictions may prevent the scheduler from assigning waiting threads to unutilized cores during a particular interval. Or, low priority threads may adversely impact high priority threads due to affinity restrictions that prevent adjustments through the use of additional cores. Processor affinity restrictions may even hurt the application itself when additional execution time on another node would have more than compensated for a slower memory access time.

Such downsides imply the need to think carefully about whether processor affinity solutions are right for a particular application and shared system context. Note, finally, that processor affinity APIs offered by some systems support priority “hints” and affinity “suggestions” to the scheduler in addition to explicit directives. Such suggestions may insure optimal performance in the common case yet avoid constraining scheduling options during periods of high resource contention.

Para obtener información más completa sobre las optimizaciones del compilador, consulte nuestro Aviso de optimización.