Intel® MPI Library and Process Pinning on Xeon Phi™

Overview

When running a hybrid Intel MPI/threaded program on a Xeon Phi™ coprocessor (either in native or symmetric mode), thread placement is just as important (if not more so) as on a standard Xeon" processor.

Process Pinning VS. Thread Affinity

Process pinning defines a set of processors on which a program is allowed to run. The threads of a program are allowed to utilize these processors in whatever manner the program specifies. Thread affinity defines how individual threads of a program are to be pinned. Thread affinity works within process pinning. For more information on thread affinity on Xeon Phi™, read OpenMP* Thread Affinity Control.

Approach

The Intel® MPI Library provides a mechanism for runtime control of where a process is pinned.  When a process is pinned, the threads used by that process will be restricted to the cores available to the process.  The environment variable I_MPI_PIN_DOMAIN tells Intel MPI how to pin processes.  The following tables show what the different values of I_MPI_PIN_DOMAIN will do for process pinning.  Each table corresponds to a different pinning form.

Multi-core Shape - I_MPI_PIN_DOMAIN=<mc-shape>

<mc‑shape> Define domains through multi-core terms
core Each domain consists of the logical processors that share a particular core. The number of domains on a node is equal to the number of cores on the node
socket | sock Each domain consists of the logical processors that share a particular socket. The number of domains on a node is equal to the number of sockets on the node. This is the recommended value.
node All logical processors on a node are arranged into a single domain
cache1 Logical processors that share a particular level 1 cache are arranged into a single domain
cache2 Logical processors that share a particular level 2 cache are arranged into a single domain
cache3 Logical processors that share a particular level 3 cache are arranged into a single domain
cache The largest domain among cache1, cache2, and cache3 is selected

Explicit shape - I_MPI_PIN_DOMAIN=<size>:<layout>

<size> Define a number of logical processors in each domain (domain size)
omp The domain size is equal to the OMP_NUM_THREADS environment variable value. If the OMP_NUM_THREADS environment variable is not set, each node is treated as a separate domain.
auto The domain size is defined by the formula size=#cpu/#proc, where #cpu is the number of logical processors on a node, and #proc is the number of the MPI processes started on a node
<n> The domain size is defined by a positive decimal number <n>
<layout> Ordering of domain members. The default value is compact
platform Domain members are ordered according to their BIOS numbering (platform-depended numbering)
compact Domain members are located as close to each other as possible in terms of common resources (cores, caches, sockets, etc.). This is the default value
scatter Domain members are located as far away from each other as possible in terms of common resources (cores, caches, sockets, etc.)

Explicit Domain Mask - I_MPI_PIN_DOMAIN=<masklist>

<masklist> Define domains through the comma separated list of hexadecimal numbers (domain masks)
[m1,...,mn] Each mi number defines one separate domain. The following rule is used: the ith logical processor is included into the domain if the corresponding mi value is set to 1. All remaining processors are put into a separate domain. BIOS numbering is used

These options will define where the threads of a rank are allowed to run.  Pinning is especially important on a coprocessor due to the high number of cores couples with the reduced memory resources as compared to a standard processor.

Differences Between Xeon™ and Xeon Phi™ Pinning

The main difference between pinning on a Xeon Phi™ vs. on a Xeon™ system is the core numbering.  On the coprocessor, core number 0 is reserver for the operating system.  If you are explicitly defining the process pinning, keep this in mind, as you will want to start with core 1.  Core 0 is also matched with the final 3 cores, rather than cores 1-3.  The multi-core shaping and explicit pinning methods will automatically account for this, and where reasonable, these are the preferred methods.  Using a masklist requires explicitly skipping core 0.  The final example on this page shows how this can be done.

Examples

The example program used here is a trivial MPI/OpenMP* hybrid hello world program run with 4 ranks. In order to determine process pinning, set I_MPI_DEBUG=4. This tells MPI to display process pinning information (along with additional diagnostics). Only the pinning output and I_MPI_PIN_DOMAIN values are shown.  These were run with

mpirun -n 4 -host mic0 -genv I_MPI_PIN_DOMAIN <value> ./hybrid_hello

I_MPI_PIN_DOMAIN=core

Rank Pin cpu
0 {1, 2, 3, 4}
1 {5, 6, 7, 8}
2 {9, 10, 11, 12}
3 {13, 14, 15, 16}

I_MPI_PIN_DOMAIN=socket

Rank Pin cpu
0 {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243}
1 {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243}
2 {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243}
3 {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243}

I_MPI_PIN_DOMAIN=[0001E,001E0,01E00,1E000] (As a note, the leading zeros can be omitted in the masklist.)

Rank Pin cpu
0 {1, 2, 3, 4}
1 {5, 6, 7, 8}
2 {9, 10, 11, 12}
3 {13, 14, 15, 16}
For more complete information about compiler optimizations, see our Optimization Notice.