Interoperability with OpenMP API

I_MPI_PIN_DOMAIN

Intel® MPI Library provides an additional environment variable to control process pinning for hybrid MPI/OpenMP* applications. This environment variable is used to define a number of non-overlapping subsets (domains) of logical processors on a node, and a set of rules on how MPI processes are bound to these domains by the following formula: one MPI process per one domain. See the picture below.

Each MPI process can create a number of children threads for running within the corresponding domain. The process threads can freely migrate from one logical processor to another within the particular domain.

If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable setting is ignored.

If the I_MPI_PIN_DOMAIN environment variable is not defined, then MPI processes are pinned according to the current value of the I_MPI_PIN_PROCESSOR_LIST environment variable.

The I_MPI_PIN_DOMAIN environment variable has the following syntax forms:

  • Domain description through multi-core terms <mc-shape>

  • Domain description through domain size and domain member layout <size>[:<layout>]

  • Explicit domain description through bit mask <masklist>

The following tables describe these syntax forms.

Multi-core Shape

I_MPI_PIN_DOMAIN=<mc-shape>

Figure 3.2-1 Domain Example

<mc-shape>

Define domains through multi-core terms.

core

Each domain consists of the logical processors that share a particular core. The number of domains on a node is equal to the number of cores on the node.

socket | sock

Each domain consists of the logical processors that share a particular socket. The number of domains on a node is equal to the number of sockets on the node. This is the recommended value.

numa

Each domain consists of the logical processors that share a particular NUMA node. The number of domains on a machine  is equal to the number of NUMA nodes on the machine.

node

All logical processors on a node are arranged into a single domain.

cache1

Logical processors that share a particular level 1 cache are arranged into a single domain.

cache2

Logical processors that share a particular level 2 cache are arranged into a single domain.

cache3

Logical processors that share a particular level 3 cache are arranged into a single domain.

cache

The largest domain among cache1, cache2, and cache3 is selected.

Note

If Cluster on Die is disabled on a machine, the number of NUMA nodes equals to the number of sockets. In this case, pinning for I_MPI_PIN_DOMAIN = numa is equivalent to pinning for I_MPI_PIN_DOMAIN = socket.

Explicit Shape

I_MPI_PIN_DOMAIN=<size>[:<layout>]

<size>

Define a number of logical processors in each domain (domain size)

omp

The domain size is equal to the OMP_NUM_THREADS environment variable value. If the OMP_NUM_THREADS environment variable is not set, each node is treated as a separate domain.

auto

The domain size is defined by the formula size=#cpu/#proc, where #cpu is the number of logical processors on a node, and #proc is the number of the MPI processes started on a node

<n>

The domain size is defined by a positive decimal number <n>

 

<layout>

Ordering of domain members. The default value is compact

platform

Domain members are ordered according to their BIOS numbering (platform-depended numbering)

compact

Domain members are located as close to each other as possible in terms of common resources (cores, caches, sockets, and so on). This is the default value

scatter

Domain members are located as far away from each other as possible in terms of common resources (cores, caches, sockets, and so on)

Explicit Domain Mask

I_MPI_PIN_DOMAIN=<masklist>

<masklist>

Define domains through the comma separated list of hexadecimal numbers (domain masks)

[m1,...,mn]

For <masklist>, each mi is a hexadecimail bit mask defining an individual domain. The following rule is used: the ith logical processor is included into the domain if the corresponding mi value is set to 1. All remaining processors are put into a separate domain. BIOS numbering is used.

Note

To ensure that your configuration in <masklist> is parsed correctly, use square brackets to enclose the domains specified by the <masklist>. For example: I_MPI_PIN_DOMAIN=[0x55,0xaa]

Note

These options are available for both Intel® and non-Intel microprocessors, but they may perform additional optimizations for Intel microprocessors than they perform for non-Intel microprocessors.

Note

To pin OpenMP* processes or threads inside the domain, the corresponding OpenMP feature (for example, the KMP_AFFINITY environment variable for Intel® compilers) should be used.

Note

The following configurations are effectively the same as if pinning is not applied:

  • If you set I_MPI_PIN_DOMAIN=auto and a single process is running on a node (for example, due to I_MPI_PERHOST=1)

  • I_MPI_PIN_DOMAIN=node

If you do not want the process to be migrated between sockets on a multi-socket platform, specify the domain size as I_MPI_PIN_DOMAIN=socket or smaller.

You can also use I_MPI_PIN_PROCESSOR_LIST, which produces a single-cpu process affinity mask for each rank (the affinity mask is supposed to be automatically adjusted in presence of IBA* HCA).

See the following model of a symmetric multiprocessing (SMP) node in the examples:

Figure 3.2-2 Model of a Node

The figure above represents the SMP node model with a total of 8 cores on 2 sockets. Intel® Hyper-Threading Technology is disabled. Core pairs of the same color share the L2 cache.

Figure 3.2-3 mpirun -n 2 -env I_MPI_PIN_DOMAIN socket ./a.out

In Figure 3.2-3, two domains are defined according to the number of sockets. Process rank 0 can migrate on all cores on the 0-th socket. Process rank 1 can migrate on all cores on the first socket.

Figure 3.2-4 mpirun -n 4 -env I_MPI_PIN_DOMAIN cache2 ./a.out

In Figure 3.2-4, four domains are defined according to the amount of common L2 caches. Process rank 0 runs on cores {0,4} that share an L2 cache. Process rank 1 runs on cores {1,5} that share an L2 cache as well, and so on.

Figure 3.2-5 mpirun -n 2 -env I_MPI_PIN_DOMAIN 4:platform ./a.out

In Figure 3.2-5, two domains with size=4 are defined. The first domain contains cores {0,1,2,3}, and the second domain contains cores {4,5,6,7}. Domain members (cores) have consecutive numbering as defined by the platform option.

Figure 3.2-6 mpirun -n 4 -env I_MPI_PIN_DOMAIN auto:scatter ./a.out

In Figure 3.2-6, domain size=2 (defined by the number of CPUs=8 / number of processes=4), scatter layout. Four domains {0,2}, {1,3}, {4,6}, {5,7} are defined. Domain members do not share any common resources.

Figure 3.2-7 setenv OMP_NUM_THREADS=2
mpirun -n 4 -env I_MPI_PIN_DOMAIN omp:platform ./a.out

In Figure 3.2-7, domain size=2 (defined by OMP_NUM_THREADS=2), platform layout. Four domains {0,1}, {2,3}, {4,5}, {6,7} are defined. Domain members (cores) have consecutive numbering.

Figure 3.2-8 mpirun -n 2 -env I_MPI_PIN_DOMAIN [0x55,0xaa] ./a.out

In Figure 3.2-8 (the example for I_MPI_PIN_DOMAIN=<masklist>), the first domain is defined by the 0x55 mask. It contains all cores with even numbers {0,2,4,6}. The second domain is defined by the 0xAA mask. It contains all cores with odd numbers {1,3,5,7}.

I_MPI_PIN_ORDER

Set this environment variable to define the mapping order for MPI processes to domains as specified by the I_MPI_PIN_DOMAIN environment variable.

Syntax

I_MPI_PIN_ORDER=<order>

Arguments

<order>

Specify the ranking order

range

The domains are ordered according to the processor's BIOS numbering. This is a platform-dependent numbering

scatter

The domains are ordered so that adjacent domains have minimal sharing of common resources

compact

The domains are ordered so that adjacent domains share common resources as much as possible. This is the default value

spread

The domains are ordered consecutively with the possibility not to share common resources

bunch

The processes are mapped proportionally to sockets and the domains are ordered as close as possible on the sockets

Description

The optimal setting for this environment variable is application-specific. If adjacent MPI processes prefer to share common resources, such as cores, caches, sockets, FSB, use the compact or bunch values. Otherwise, use the scatter or spread values. Use the range value as needed. For detail information and examples about these values, see the Arguments table and the Example section of I_MPI_PIN_ORDER in this topic.

The options scatter, compact, spread and bunch are available for both Intel® and non-Intel microprocessors, but they may perform additional optimizations for Intel microprocessors than they perform for non-Intel microprocessors.

Examples

For the following configuration:

  • Two socket nodes with four cores and a shared L2 cache for corresponding core pairs.

  • 4 MPI processes you want to run on the node using the settings below.

Compact order:

I_MPI_PIN_DOMAIN=2
I_MPI_PIN_ORDER=compact

Figure 3.2-9 Compact Order Example

Scatter order:

I_MPI_PIN_DOMAIN=2
I_MPI_PIN_ORDER
=scatter

Figure 3.2-10 Scatter Order Example

Spread order:

I_MPI_PIN_DOMAIN=2
I_MPI_PIN_ORDER
=spread

Figure 3.2-11 Spread Order Example

Bunch order:

I_MPI_PIN_DOMAIN=2
I_MPI_PIN_ORDER
=bunch

Figure 3.2-12 Bunch Order Example

For more complete information about compiler optimizations, see our Optimization Notice.