Process and Thread Affinity for Intel® Xeon Phi™ Processors

The Intel® MPI Library and OpenMP* runtime libraries can create affinities between processes or threads, and hardware resources. This affinity keeps an MPI process or OpenMP thread from migrating to a different hardware resource, which can have a dramatic effect on the execution speed of a program.

Hardware Threading

The Intel® Xeon Phi™ processor (code-named Knights Landing) supports up to four hardware thread contexts per core. Two cores sharing a single level 2 cache comprise one tile, as in Figure 1.


Figure 1:An Intel® Xeon Phi™ processor x200 tile has two cores, four vector processing units, a 1 MB cache shared by the two cores, and a cache home agent.

Additional hardware threads help hide latencies. While one hardware thread is stalled, another can schedule a core. The optimal number of hardware threads an application uses per core or per tile depends on the application. Some applications may benefit from executing only one thread per hardware tile. For all examples in this paper, an Intel Xeon Phi processor has 34 tiles.

OpenMP Thread Affinity

OpenMP separates allocating hardware resources from pinning threads to the hardware resources.

Intel compilers support both OpenMP 4 affinity settings (as of version 13.0) and the Intel OpenMP runtime extensions. The following settings are used to allocate hardware resources and pin OpenMP threads to hardware resources.

 

OpenMP* 4 Affinity

Intel OpenMP Runtime Extensions

Allocate hardware threads

OMP_PLACES

KMP_PLACE_THREADS

Pin OpenMP threads to hardware threads

OMP_PROC_BIND

KMP_AFFINITY

Thread Affinity Using Intel OpenMP Runtime Extensions

KMP_PLACE_THREADS controls allocation of hardware resources. An OpenMP application may be assigned a number of cores and a number of threads per core. The letter C indicates cores, and T indicates threads. For example, 68c,4t specifies four threads per core on 68 cores, and 34c,2t specifies two threads per core on 34 cores.

KMP_AFFINITY controls how OpenMP threads are bound to resources. Common choices are COMPACT, SCATTER, and BALANCED. The granularity can be set to CORE or THREAD. The affinity choices are illustrated in Figure 2, Figure 3, and Figure 4.


Figure 2:KMP_AFFINITY=compact


Figure 3:KMP_AFFINITY=balanced


Figure 4:KMP_AFFINITY=scatter

A full explanation of KMP_PLACE_THREADS and KMP_AFFINITY is available in the Thread Affinity Interface section of the Intel compiler documentation.

The following examples demonstrate how to pin OpenMP threads to a specific number of threads per tile or core on Intel Xeon Phi processors x200 using Intel OpenMP runtime extensions on a Linux* system.

1 thread per tile

KMP_AFFINITY=proclist=[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit

1 thread per core

KMP_PLACE_THREADS=1T KMP_AFFINITY=compact

2 threads per core

KMP_PLACE_THREADS=2T KMP_AFFINITY=compact

3 threads per core

KMP_PLACE_THREADS=3T KMP_AFFINITY=compact

4 threads per core

KMP_PLACE_THREADS=4T KMP_AFFINITY=compact

Tip: Use the KMP_AFFINITY VERBOSE modifier to see how threads are mapped to OS processors. This modifier also shows how the OS processors map to physical cores.

 

The same settings work when undersubscribing the cores.

1 thread per tile

KMP_AFFINITY=proclist=[0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66],explicit OMP_NUM_THREADS=4

1 thread per core

KMP_PLACE_THREADS=1T KMP_AFFINITY=compact OMP_NUM_THREADS=8

2 threads per core

KMP_PLACE_THREADS=2T KMP_AFFINITY=compact OMP_NUM_THREADS=16

3 threads per core

KMP_PLACE_THREADS=3T KMP_AFFINITY=compact OMP_NUM_THREADS=24

4 threads per core

KMP_PLACE_THREADS=4T KMP_AFFINITY=compact OMP_NUM_THREADS=32

Thread Affinity Using OpenMP 4 Affinity

Version 4 of the OpenMP standard introduced affinity settings controlled by OMP_PLACES and OMP_PROC_BIND environment variables. OMP_PLACES specifies hardware resources. The value can be either an abstract name describing a list of places or (uncommonly) an explicit list of places. Choices are CORES or THREADS. OMP_PROC_BIND controls how OpenMP threads are bound to resources. Common values for OMP_PROC_BIND include CLOSE and SPREAD.

The following examples show how to run an OpenMP threaded application using one to four hardware threads per core using OpenMP 4 affinity.

1 thread per tile

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=34

1 thread per core

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=68

2 threads per core

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=136

3 threads per core

OMP_PROC_BIND=spread OMP_PLACES=threads OMP_NUM_THREADS=204

4 threads per core

OMP_PROC_BIND=close OMP_PLACES=threads

 

These examples show how to undersubscribe the cores using OpenMP 4 affinity.

1 thread per tile

OMP_PROC_BIND=spread OMP_PLACES=”threads(32)” OMP_NUM_THREADS=4

1 thread per core

OMP_PROC_BIND=spread OMP_PLACES=”threads(32)” OMP_NUM_THREADS=8

2 threads per core

OMP_PROC_BIND=close OMP_PLACES=”cores(8)” OMP_NUM_THREADS=16

3 threads per core

OMP_PROC_BIND=close OMP_PLACES=”cores(8)” OMP_NUM_THREADS=24

4 threads per core

OMP_PROC_BIND=close OMP_PLACES=threads OMP_NUM_THREADS=32

Nested Thread Affinity Using OpenMP 4 Affinity Settings

When an application has more than one level of OpenMP threading, additional values are specified for OMP_PLACES and OMP_NUM_THREADS. The following example executes nested threads using one hardware thread per core. For additional hardware threads, increase the second value of OMP_NUM_THREADS to 4, 6, or 8.

OMP_NESTED=1

OMP_MAX_ACTIVE_LEVELS=2

KMP_HOT_TEAMS=1

KMP_HOT_TEAMS_MAX_LEVEL=2

OMP_NUM_THREADS=34,2

OMP_PROC_BIND=spread,spread

OMP_PLACES=cores

MPI Library Affinity

MPI library affinity is controlled by environment variable I_MPI_PIN_PROCESSOR_LIST. A list may be an explicit list of logical processors or a processor set defined by keywords. Common keywords include ALL, ALLCORES, GRAIN, and SHIFT.

  • ALL specifies all logical processors, including the hardware threads.
  • ALLCORES specifies the physical cores.
  • GRAIN specifies the pinning granularity.
  • SHIFT specifies the granularity of the round-robin scheduling in GRAIN units.

The following are examples of how to run an MPI executable with one rank per tile, and one, two, or four ranks per core.

1 rank per tile

mpirun –perhost 34 –env I_MPI_PIN_PROCESSOR_LIST all:shift=cache2

1 rank per core

mpirun –perhost 68 –env I_MPI_PIN_PROCESSOR_LIST allcores

2 ranks per core

mpirun –perhost 136 -env I_MPI_PIN_PROCESSOR_LIST all:grain=2,shift=2

4 ranks per core

mpirun –perhost 272 -env I_MPI_PIN_PROCESSOR_LIST all

Tips:

  • Set I_MPI_DEBUG to 4 or higher to see how ranks are mapped to OS processors.
  • Intel MPI cpuinfo utility shows how the OS processors map to physical caches.

Intel® MPI Library Interoperability with OpenMP

Intel® MPI and OpenMP affinity settings may be combined for hybrid execution. When using all cores, specifying one to four hardware threads per core is straightforward in the following examples, using either OpenMP runtime extensions or OpenMP 4 affinity.

Intel MPI/OpenMP affinity examples using Intel OpenMP runtime extensions, for Intel Xeon Phi processors:

1 thread per core

mpirun –env KMP_PLACE_THREADS 1T -env KMP_AFFINITY compact

2 threads per core

mpirun –env KMP_PLACE_THREADS 2T -env KMP_AFFINITY compact

3 threads per core

mpirun -env KMP_PLACE_THREADS 3T -env KMP_AFFINITY compact

4 threads per core

mpirun -env KMP_PLACE_THREADS 4T -env KMP_AFFINITY compact

 

Intel MPI/OpenMP affinity examples using OpenMP 4 affinity, for Intel Xeon Phi processors:

1 thread per tile

mpirun -env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 8

1 thread per core

mpirun –env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 17

2 threads per core

mpirun -env OMP_PROC_BIND spread –env OMP_PLACES threads –env OMP_NUM_THREADS 34

4 threads per core

mpirun -env OMP_PROC_BIND close –env OMP_PLACES threads

 

Intel MPI also provides an environment variable, I_MPI_PIN_DOMAIN, for use with executables launching both MPI ranks and OpenMP threads. The variable is used to define a number of non-overlapping subsets of logical processors, binding one MPI rank to each of these domains. An explicit domain binding is especially useful for undersubscribing the cores. The following examples run a hybrid MPI/OpenMP executable on fewer than the 68 cores of the Intel Xeon Phi processor.

1 thread per core on 2 quadrants

mpirun -perhost 2 –env I_MPI_PIN_DOMAIN 68 –env KMP_PLACE_THREADS 1T -env KMP_AFFINITY compact

12 ranks, 1 rank per tile, 2 threads per core

mpirun -perhost 12 –env I_MPI_PIN_DOMAIN 8 –env KMP_PLACE_THREADS 2T -env KMP_AFFINITY compact

Tip: When I_MPI_PIN_DOMAIN is set, I_MPI_PIN_PROCESSOR_LIST is ignored.

Future

The Intel MPI library and Intel OpenMP runtime extensions will be extended in 2016 and 2017 to simplify placing processes and threads on Intel Xeon Phi processor x200 NUMA domains.

Conclusion

The Intel MPI Library and OpenMP runtimes provide mechanisms to bind MPI ranks and OpenMP threads to specific processors. Our examples showed how to experiment with different core and hardware thread configurations on Intel® Xeon Phi™ processor  (code-named Knights Landing). Following the examples, we can discover whether an application performs best using from one to four hardware threads per core, and we can look for optimal combinations of MPI ranks and OpenMP threads.

More Information

Intel® Fortran Compiler User and Reference Guide, Thread Affinity Interface, https://software.intel.com/en-us/compiler_15.0_ug_f

Intel® C++ Compiler User and Reference Guide, Thread Affinity Interface, https://software.intel.com/en-us/compiler_15.0_ug_c

OpenMP* 4.0 Complete Specifications, http://openmp.org

Intel® MPI Library Developer Reference for Linux* OS, Process Pinning, https://software.intel.com/en-us/intel-mpi-library/documentation

Intel® MPI Library Developer Reference for Linux* OS, Interoperability with OpenMP* API, https://software.intel.com/en-us/mpi-refman-lin-html

Using Nested Parallelism In OpenMP, https://software.intel.com/en-us/videos/using-nested-parallelism-in-openmp

Beginning Hybrid MPI/OpenMP Development, https://software.intel.com/en-us/articles/beginning-hybrid-mpiopenmp-development

For more complete information about compiler optimizations, see our Optimization Notice.

4 comments

Top
Gregg S. (Intel)'s picture

The 2017 Intel MPI Library brings us I_MPI_PIN_DOMAIN=numa, which matches MPI domains to SNC2 or SNC4 NUMA domains.  (Note this only works when the number of MPI ranks matches the number of NUMA domains.)

mpirun –n 4 –env I_MPI_PIN_DOMAIN numa

[0] MPI startup(): Rank    Pid      Node name  Pin cpu

[0] MPI startup(): 0       28501    knl     {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221}     

[0] MPI startup(): 1       28504    knl {18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239}                                                                             

[0] MPI startup(): 2       28507    knl {36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}             

[0] MPI startup(): 3       28510    knl {52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271}

NUMA node0 CPU(s):     0-17,68-85,136-153,204-221

NUMA node1 CPU(s):     18-35,86-103,154-171,222-239

NUMA node2 CPU(s):     36-51,104-119,172-187,240-255

NUMA node3 CPU(s):     52-67,120-135,188-203,256-271

Gregg S. (Intel)'s picture

Recently I was asked how an application can switch from MPI parallelism on all cores to OpenMP parallelism on all cores, and back.

Suppose an application wants to run OpenMP threads from MPI rank 0, pinning MPI ranks and rank 0 threads to the same cores.  KMP_AFFINITY=norespect is the solution, allowing OpenMP to ignore the affinity mask set by MPI.

1 hardware thread per core

mpirun perhost 68 -env I_MPI_PIN_PROCESSOR_LIST allcores env KMP_PLACE_THREADS 1T -env KMP_AFFINITY norespect,compact

2 hardware threads per core

mpirun perhost 136 -env I_MPI_PIN_PROCESSOR_LIST all:grain=2,shift=2 env KMP_PLACE_THREADS 2T -env KMP_AFFINITY norespect,compact

4 hardware threads per core

mpirun perhost 272 -env I_MPI_PIN_PROCESSOR_LIST all env KMP_PLACE_THREADS 4T -env KMP_AFFINITY norespect,compact

Gregg S. (Intel)'s picture

These commands come directly from my actual work on systems with Intel Xeon Phi x200 processors.  For example, here is my full command for launching Sandia National Laboratories's MiniFE on a cluster with Intel Xeon Phi 7250 processors (which have 68 cores):  mpirun -hosts $hosts -perhost 34 -env KMP_PLACE_THREADS 2C,2T -env KMP_AFFINITY compact ./miniFE.x

We can cover more complex situations here in the comments.  I'll add a couple examples now for (1) sub-NUMA clustering, and (2) MPI parallelism and OpenMP parallelism, each taking turns using all cores.

That article is too theoretical and generic. When it comes to practical applications two sections, about OpenMP and MPI thread affinity settings, are not useful.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.