Intel® MPI Library: Compatibility among Intel® Xeon® Processors, Intel® Xeon Phi™ Coprocessors and Intel® Xeon Phi™ Processors

1. Introduction

As the Message Passing Interface (MPI) parallel programming paradigm becomes a de facto standard for parallel computing on distributed memory systems, Intel® MPI Library becomes more important in supporting a wide range of Intel products. As a result, users need to be aware of usage differences when dealing with a particular Intel product.

The first part of this document provides an overview of the similarities and differences seen when using Intel® MPI Library on three different microarchitectures: the Intel® Xeon® Processors, Intel® Xeon Phi™ Coprocessors and Intel® Xeon Phi™ Processors.

The second part of this document focuses on the Intel Xeon Phi Processor and shares some helpful best practices for using the Intel MPI library with this processor.

2. Similarity and difference when using the Intel® MPI Library

The current Intel® MPI Library implements the Message Passing Interface version 3.0 specification (MPI 3.0). The library is optimized for each of the Intel platforms. From the user’s stand-point, the library can be used without regard to the particular hardware provided the following items are kept in mind.  

Compiler Flag

To use the Intel MPI Library, users need to establish the proper environment and compile with mpiicc for C, mpiicpc for C++, or mpiifort for Fortran:

$ source <compilerinstalldir>/bin/compilervars.sh intel64
$ source <installdir>/bin64/mpivars.sh
$ mpiicc application.c

To get better execution performance from the underlying Intel®  hardware, users may compile their applications with a hardware specific flag. For the Intel Xeon Processor, the flag –xHost will generate an executable which takes advantage of the optimal instructions available on the compilation host processor:

$ mpiicc -xHost application.c

The Intel® C/C++ Compiler does not execute directly on the Intel Xeon Phi Coprocessor. When compiling on an Intel Xeon Processor for the coprocessor, use the flag –mmic to build your application:

$ mpiicc -mmic application.c

When compiling on the Intel Xeon Processor for the Intel Xeon Phi Processor, use the flag –xMIC-AVX512 to build your application:

$ mpiicc –xMIC-AVX512 application.c

When compiling on the Intel Xeon Phi Processor for the Intel Xeon Phi Processor, you may use either the –xMIC-AVX512 or –xHost flag. Note that –xHost option is used for the highest instruction set available on the compilation host processor.

Binary Compatibility

MPI programs built with the option –xCOMMON-AVX512 can run on the Intel Xeon processor which supports Intel® AVX-512 instructions and Intel Xeon Phi processor. MPI programs built for the Intel Xeon processor or the Intel Xeon Phi processor cannot run on an Intel Xeon Phi coprocessor and vice versa.

Programing Models

The Intel Xeon Processor, the Intel Xeon Phi Coprocessor, and the Intel Xeon Phi Processor run MPI programs natively. That is, after building an MPI program for that specific platform, the corresponding executable begins execution directly on that platform.

Since the Intel Xeon Phi Coprocessor connects to a Xeon host, it offers more options:

  • Offload model: the MPI ranks reside on the Intel Xeon Processor host and offload the work to the Intel Xeon Phi Coprocessor(s). The offload model can be implemented using the Offload directives, the Intel® Math Kernel Library (Intel® MKL) Implicit offload, or the Intel MKL explicit offload.

  • Symmetric model: MPI ranks reside on both the Intel Xeon machine and on the Intel Xeon Phi Coprocessor. In this model, both the Intel Xeon Processor and the Intel Xeon Phi Coprocessor are heterogeneous computing nodes.

When executing directly on the Intel Xeon Phi Coprocessor, users can launch the executable either from the host or directly from the coprocessor. If launching from the host, users need to enable the I_MPI_MIC environment variable:

$ export I_MPI_MIC=on

This environment variable setting is useful even with native launch on the Intel Xeon Phi coprocessor since it selects optimized setting for the Intel Xeon Phi coprocessor.

Threading Models

The three architectures support Intel MKL, Intel® Threading Building Blocks, Intel® Cilk™ Plus, OpenMP*, and Pthreads*.

Note that in an Intel Xeon system, each core appears as two logical cores with hyperthreading enabled; whereas the number of available hardware threads in the Intel Xeon Phi Coprocessor, and the Intel Xeon Phi Processor is equal to four times the number of available cores.

The following table summarizes the differences:

 

Intel® Xeon® Processor

Intel® Xeon Phi™ Coprocessor

Intel® Xeon Phi™ Processor

Compiler flag

-xHost

-mmic

-xHost or -xMIC-AVX512

Compatible with Intel® Xeon®

 

No

Yes

Programing Models

Native

Native, Offload, Symmetric

Native

Threading Models

Intel® Math Kernel Library (Intel® MKL), Intel® Threading Building Blocks, Intel® Cilk™ Plus, OpenMP*, and Pthreads*.

Intel® Math Kernel Library (Intel® MKL), Intel® Threading Building Blocks, Intel® Cilk™ Plus, OpenMP*, and Pthreads*.

Intel® Math Kernel Library (Intel® MKL), Intel® Threading Building Blocks, Intel® Cilk™ Plus, OpenMP*, and Pthreads*.

Maximum number of threads

2 times the number of core

4 times the number of core

4 times the number of core

Instruction Set

SSE, MME, Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2)

Intel® Initial Many Core Instruction (Intel® IMCI)

SSE, MME, Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2), Intel® Advanced Vector Extensions 512

Memory Type

DDR4

GDDR5

DDR4 and MCDRAM

3. Working with Intel® Xeon Phi™ Processor

The Intel Xeon Phi Processor is different from the other two architectures in that it has high bandwidth memory, the ability to cluster cores (called “cluster mode”), and the new Instruction Set Architecture (ISA) instructions. To get the most out of cluster mode, it is important to understand the thread affinity and how it can be modified to suit an application. This section shares some best practices when dealing with high bandwidth memory and affinity on the Intel Xeon Phi Processor.

3.1 Using High-Bandwidth Memory on Intel Xeon Phi Processor

There are two ways to allocate high bandwidth memory: explicit calls to the memkind library or enabling AutoHBW. The Intel MPI Library does not use the memkind library to allocate high bandwidth memory. It is up to your MPI application to allocate the high bandwidth memory if you choose to use the memkind library.

Alternatively, you can use AutoHBW to automatically allocate high bandwidth memory on an Intel Xeon Phi  Processor system without modifying or recompiling your source code. The article Using The AutoHBW Library with Jemalloc and Memkind shows how to create a script which will allow you to use AutoHBW with an MPI program.

For example, on a 72 cores Intel Xeon Phi Processor equipped with MCDRAM, the command “numactl –H” shows that this machine has two NUMA nodes: node 0 has 288 (72 x 4) OS processors (CPUs) and 96 GB memory, and node 1 has 8 GB MCDRAM memory. The user can use AutoHBW to verify if an MPI program can benefit from allocating MCDRAM before changing the code.

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287
node 0 size: 98200 MB
node 0 free: 92912 MB
node 1 cpus:
node 1 size: 8192 MB
node 1 free: 7901 MB
node distances:
node   0   1
  0:  10  31
  1:  31  10

3.2 OpenMP Affinity on Intel Xeon Phi Processor

You can set affinity on the Intel Xeon Phi Processor by using either Intel’s KMP affinity or using the new OpenMP 4.0 Thread Affinity environment variable.

With KMP affinity, the KMP_PLACE_THREADS environment variable can be used to control thread placement, to define the number of cores to be used and number of threads per core. The KMP_AFFINITY environment variable can be used to indicate the type of thread affinity: compact, scatter, etc. Like the Intel Xeon Phi Coprocessor, its default is scatter affinity.

On an Intel Xeon Phi Processor  system, since each physical core has 4 hardware threads, the Linux* OS assigns 4 logical cores to each physical core. The intend of assigning logical core number on Intel Xeon Phi Processor  is as following: the first logical core on each physical core is numbered from 0 to n-1 (n is the number of physical core), the second logical core on each physical core is numbered from n to 2n-1, the third logical core on each physical core is numbered from 2n to 3n-1, the fourth logical core on each physical core is numbered from 3n to 4n-1, However, there is a bug in all versions of Linux version running on Intel Xeon Phi Processor  that assigns the OS processors in a seemingly random fashion. Thus schemes that assume a particular numbering scheme will fail. This is particularly important if you are using the OpenMP explicit affinity type.

To determine OS processor to physical core mapping, you can take look at the file /proc/cpuinfo :

$ cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 87
model name      : 06/57
stepping        : 0
microcode       : 0xffff002d
cpu MHz         : 1200.000
cache size      : 1024 KB
physical id     : 0
siblings        : 288
core id         : 0
cpu cores       : 72
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
<truncate here>

Each item in the file has a processor number and a corresponding core number. The “processor” shows the logical core number and the “core id” shows the physical core number. Another way to parse this file to get all logical processors and their mapping cores is shown below:

$ grep 'processor\|core id' /proc/cpuinfo

From the output, you can easily identify all the cores and their mapping processors. For example, the physical core 0 has four logical processors 0, 109, 169, 229, …

core:  0 procs: 0 109 169 229
core:  1 procs: 50 110 170 230
core:  2 procs: 51 111 171 231
core:  3 procs: 52 112 172 232
core:  4 procs: 53 113 173 233
core:  5 procs: 54 114 174 234
core:  6 procs: 55 115 175 235
core:  7 procs: 56 116 176 236
core:  8 procs: 57 117 177 237
core:  9 procs: 58 118 178 238

………………………………………….

 

You can also set KMP_AFFINITY to verbose to see how threads are mapped to OS processors.

One thread per core: on an Intel Xeon Phi Processor with 72 cores, the following commands run the OpenMP program “a.out” with one thread per core for all 72 cores, for the first 36 cores and for the first 18 cores, respectively:

$ KMP_PLACE_THREADS=72C,1T KMP_AFFINITY=compact,verbose ./a.out
$ KMP_PLACE_THREADS=36C,1T KMP_AFFINITY=compact,verbose ./a.out
$ KMP_PLACE_THREADS=18C,1T KMP_AFFINITY=compact,verbose ./a.out

Two threads per core: the following commands run the OpenMP program “a.out” with two threads per core for 72 cores, for the first 36 cores and for the first 18 cores, respectively:

$ KMP_PLACE_THREADS=72C,2T KMP_AFFINITY=compact,verbose ./a.out
$ KMP_PLACE_THREADS=36C,2T KMP_AFFINITY=compact,verbose ./a.out
$ KMP_PLACE_THREADS=18C,2T KMP_AFFINITY=compact,verbose ./a.out

Four threads per core: similarly, the following commands run the OpenMP program “a.out” with four threads per core for 72 cores, for the first 36 cores and for the first 18 cores, respectively:

$ KMP_PLACE_THREADS=72C,4T KMP_AFFINITY=compact,verbose ./a.out
$ KMP_PLACE_THREADS=36C,4T KMP_AFFINITY=compact,verbose ./a.out
$ KMP_PLACE_THREADS=18C,4T KMP_AFFINITY=compact,verbose ./a.out

Let n be the number of cores on an Intel Xeon Phi Processor, the following table summarizes how we can use the KMP_PLACE_THREADS and KMP_AFFINITY environment variables to create one thread/core, two threads/core and four threads/core for all cores, for one half of the cores, or for one quarter of the cores:

 

Use all n cores

 Use only (n/2) cores

Use only (n/4) cores

1 thread per core

KMP_PLACE_THREADS=“n”C,1T

KMP_AFFINITY=compact

KMP_PLACE_THREADS=“n/2”C,1T

KMP_AFFINITY=compact

KMP_PLACE_THREADS=“n/4”C,1T

KMP_AFFINITY=compact

2 threads per core

KMP_PLACE_THREADS=“n”C,2T

KMP_AFFINITY=compact

KMP_PLACE_THREADS=“n/2”C,2T

KMP_AFFINITY=compact

KMP_PLACE_THREADS=“n/4”C,2T

KMP_AFFINITY=compact

4 threads per core

KMP_PLACE_THREADS=“n”C,4T

KMP_AFFINITY=compact

KMP_PLACE_THREADS=“n/2”C,4T

KMP_AFFINITY=compact

KMP_PLACE_THREADS=“n/4”C,3T

KMP_AFFINITY=compact

The second way to set the affinity is to use the OMP_PLACES environment variable, defined in OpenMP 4.0. This environment variable allows you to pin logical threads to processor threads, cores or sockets. Combining with the OMP_NUM_THREADS environment variable, you can generate threads and pin them to the OS processors.

One thread per tile: On an Intel Xeon Phi Processor with 72 cores, the following commands run the OpenMP program “a.out” with one thread per tile.

Note that instead of specifying the number of cores to use, you specify the number of processor threads to distribute the OpenMP threads across. The Intel Xeon Phi Processor is organized into tiles each containing 2 cores; each core has 4 threads. Therefore, the commands to run the program a.out with 36 OpenMP threads distributed across 288 processor threads, one OpenMP thread per tile, or with 18 OpenMP threads distributed across 144 processor threads, one thread per tile or with 9 OpenMP threads distributed across 72 processor threads, would be, respectively

$ OMP_PLACES="threads(288)" OMP_NUM_THREADS=36 ./a.out
$ OMP_PLACES="threads(144)" OMP_NUM_THREADS=18 ./a.out
$ OMP_PLACES="threads(72)"  OMP_NUM_THREADS=9  ./a.out

One thread per core: the following commands run the OpenMP program “a.out” with one thread per core for 288 OS proc (72 cores), 144 OS proc (36 cores) and 72 OS proc (18 cores), respectively:

$ OMP_PLACES="threads(288)" OMP_NUM_THREADS=72 ./a.out
$ OMP_PLACES="threads(144)" OMP_NUM_THREADS=36 ./a.out
$ OMP_PLACES="threads(72)"  OMP_NUM_THREADS=18 ./a.out

Two threads per core: the following commands run the OpenMP program “a.out” with two threads per core for 72 cores 288 OS proc (72 cores), 144 OS proc (36 cores), and 72 OS proc (18 cores), respectively:

$ OMP_PLACES="threads(288)" OMP_NUM_THREADS=144 ./a.out
$ OMP_PLACES="threads(144)" OMP_NUM_THREADS=72  ./a.out
$ OMP_PLACES="threads(72)"  OMP_NUM_THREADS=36  ./a.out

Four threads per core: similarly, the following commands run the OpenMP program “a.out” with four threads per core for 72 cores 288 OS proc (72 cores), 144 OS proc (36 cores), and 72 OS proc (18 cores), respectively:

$ OMP_PLACES="threads(288)" OMP_NUM_THREADS=288 ./a.out
$ OMP_PLACES="threads(144)" OMP_NUM_THREADS=144 ./a.out
$ OMP_PLACES="threads(72)"  OMP_NUM_THREADS=72  ./a.out

Alternately, the OMP_PROC_BIND environment variable can be used to specify whether or not threads can be moved among processors and the thread affinity policy. When this environment variable is set to TRUE, the threads will not be moved. By default, OMP_PROC_BIND is set to TRUE. This environment variable specifies thread affinity policy by parameters MASTER, CLOSE, and SPREAD (same place partition as the master thread, in contiguous place partitions, or distributed among place partitions).

$ OMP_PROC_BIND=close OMP_NUM_THREADS=288 ./a.out$ OMP_PROC_BIND=close OMP_NUM_THREADS=144 ./a.out$ OMP_PROC_BIND=close OMP_NUM_THREADS=72  ./a.out

 Let n be the number of cores on an Intel Xeon Phi processor, the following table summarizes how we can use the OMP_PLACES and the OMP_NUM_THREADS environment variables to create one thread/core, two threads/core and four threads/core for all cores, one half the cores, or one quarter of the cores:

 

Use all n cores

Use only (n/2) cores

Use only (n/4) cores

1 thread per tile

OMP_PLACES="threads(n)" 

OMP_NUM_THREADS=n/8

OMP_PLACES="threads(n/2)" 

OMP_NUM_THREADS=n/16

OMP_PLACES="threads(n/4)" 

OMP_NUM_THREADS=n/32

1 thread per core

"threads(n)" 

OMP_NUM_THREADS=n/4

OMP_PLACES="threads(n/2)" 

OMP_NUM_THREADS=n/8

OMP_PLACES="threads(n/4)" 

OMP_NUM_THREADS=n/16

2 threads per core

"threads(n)" 

OMP_NUM_THREADS=n/2

OMP_PLACES="threads(n/2)" 

OMP_NUM_THREADS=n/4

OMP_PLACES="threads(n/4)" 

OMP_NUM_THREADS=n/8

4 threads per core

"threads(n)" 

OMP_NUM_THREADS=n

OMP_PLACES="threads(n/2)" 

OMP_NUM_THREADS=n/2

OMP_PLACES="threads(n/4)" 

OMP_NUM_THREADS=n/4

3.3 Intel® MPI Affinity on the Intel Xeon Phi Processor

For illustration purpose, use the sample program from <installdir>/test/test.c  . Compile and run it as below:

$ mpiicc –xMIC-AVX512 test.c
$ mpirun –n 4 –env I_MPI_DEBUG 4 ./a.out

Set the I_MPI_DEBUG environment variable to 4 or higher to display how MPI ranks are mapped to CPUs (OS processors). In an Intel Xeon Phi Processor machine with 72 cores, each core has 4 hardware threads, the total number of CPUs is 72 x 4 = 288. The above program launches 4 MPI ranks; therefore each rank pins to 288/4 = 72 CPUs. The MPI rank mapping is shown in the following run:

[0] MPI startup(): Multi-threaded optimized library
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       244756   knl4 {0,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,109,110,111,112,113,114,115               
,116,117,118,119,120,121,122,123,124,125,126,169,170,171,172,173,174,175,176,177
,178,179,180,181,182,183,184,185,186,229,230,231,232,233,234,235,236,237,238,239
,240,241,242,243,244,245,246}
[0] MPI startup(): 1       244757   knl4  {67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,127,128,129,130,131,132,133                      
,134,135,136,137,138,139,140,141,142,143,144,187,188,189,190,191,192,193,194,195                                
,196,197,198,199,200,201,202,203,204,247,248,249,250,251,252,253,254,255,256,257
,258,259,260,261,262,263,264}
[0] MPI startup(): 2       244758   knl4  {85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,145,146,147,148,149,150
,151,152,153,154,155,156,157,158,159,160,161,162,205,206,207,208,209,210,211,212                                  
,213,214,215,216,217,218,219,220,221,222,265,266,267,268,269,270,271,272,273,274
,275,276,277,278,279,280,281,282}
[0] MPI startup(): 3       244759   knl4  {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30
,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,103,104,105,106,107,108                                  
,163,164,165,166,167,168,223,224,225,226,227,228,283,284,285,286,287}

One MPI rank per tile: when starting 36 MPI ranks on an Intel Xeon Phi Processor system with 72 cores, each MPI rank is mapped to 72 x 4 / 36 = 8 OS processors as shown. The below output shows that the first rank maps to 8 OS processors (2 cores) {0,50,109,110,169,170,229,230}; The second rank maps to the following 8 OS processors {51,52,111,112,171,172,231,232}, etc. Please note that each MPI rank can be moved between two cores on a tile.

$ mpirun -n 36 -env I_MPI_DEBUG 4 ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       245851   knl4        {0,50,109,110,169,170,229,230}
[0] MPI startup(): 1       245852   knl4        {51,52,111,112,171,172,231,232}
[0] MPI startup(): 2       245853   knl4        {53,54,113,114,173,174,233,234}
[0] MPI startup(): 3       245854   knl4        {55,56,115,116,175,176,235,236}
[0] MPI startup(): 4       245855   knl4        {57,58,117,118,177,178,237,238}
[0] MPI startup(): 5       245856   knl4        {59,60,119,120,179,180,239,240}
[0] MPI startup(): 6       245857   knl4        {61,62,121,122,181,182,241,242}
[0] MPI startup(): 7       245858   knl4        {63,64,123,124,183,184,243,244}
< truncate here >

One rank per core (two ranks per tile): when launching 72 MPI ranks on the same socket with 72 cores, each MPI rank is mapped to 72 x 4 / 72 = 4 OS processors as shown. The below output shows that the fist rank maps to 4 OS processors {0,109,169,229}; the second rank maps to the following OS processors {50,110,170,230}, etc.

$ mpirun -n 72 -env I_MPI_DEBUG 4 ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       245953   knl4        {0,109,169,229}
[0] MPI startup(): 1       245954   knl4        {50,110,170,230}
[0] MPI startup(): 2       245955   knl4        {51,111,171,231}
[0] MPI startup(): 3       245956   knl4        {52,112,172,232}
[0] MPI startup(): 4       245957   knl4        {53,113,173,233}
[0] MPI startup(): 5       245958   knl4        {54,114,174,234}
[0] MPI startup(): 6       245959   knl4        {55,115,175,235}
[0] MPI startup(): 7       245960   knl4        {56,116,176,236}
< truncate here >

Two ranks per core (four ranks per tile): similarly when launching 144 MPI ranks, each rank maps to 2 (72 x 4 / 144) OS processors.

$ mpirun -n 144 -env I_MPI_DEBUG 4 ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       247281   knl4        {0,109}
[0] MPI startup(): 1       247282   knl4        {169,229}
[0] MPI startup(): 2       247283   knl4        {50,110}
[0] MPI startup(): 3       247284   knl4        {170,230}
[0] MPI startup(): 4       247285   knl4        {51,111}
[0] MPI startup(): 5       247286   knl4        {171,231}
[0] MPI startup(): 6       247287   knl4        {52,112}
[0] MPI startup(): 7       247288   knl4        {172,232}
< truncate here >

Four ranks per core (eight ranks per tile): with 288 MPI ranks, each rank is mapped to 1 (72 x 4 /288) OS processor.

$ mpirun -n 288 -env I_MPI_DEBUG 4 ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       247918   knl4        0
[0] MPI startup(): 1       247919   knl4        109
[0] MPI startup(): 2       247920   knl4        169
[0] MPI startup(): 3       247921   knl4        229
[0] MPI startup(): 4       247922   knl4        50
[0] MPI startup(): 5       247923   knl4        110
[0] MPI startup(): 6       247924   knl4        170
[0] MPI startup(): 7       247925   knl4        230
< truncate here >

The following table summarizes. Let k be the number of cores on an Intel Xeon Phi Processor system, each rank can float among a set of OS processors:

 

Floating

1 rank /tile

n  = k / 2

1 rank /core

n  = k

2 ranks /core

n = k * 2

4 ranks /core

n = k * 4

To bind each MPI to a particular OS processor, use the I_MPI_PIN_PROCESSOR_LIST environment variable. Setting I_MPI_PIN_PROCESSOR_LIST may be appropriate when the number of ranks is less than the number of cores.

There are two ways to control pinning: use I_MPI_PIN_DOMAIN and I_MPI_PIN_PROCESSOR_LIST:

  • If the I_MPI_PIN_DOMAIN environment variable is defined, then the I_MPI_PIN_PROCESSOR_LIST environment variable is ignored

  • If the I_MPI_PIN_DOMAIN environment variable is not defined, then MPI ranks are pinned according to the I_MPI_PIN_PROCESSOR_LIST environment variable

Intel MPI Library defines an environment variable to control process pinning, called I_MPI_PIN_DOMAIN, to define a number of non-overlapping subsets (or domains) of logical processors on a node. Each MPI rank is pinned to a domain, each MPI rank can create a number of children threads for running within the corresponding domain.

By default, if you do not specify any value for any process pinning environment variables, I_MPI_PIN_DOMAIN=auto:compact . The auto value specifies the domain size to #cpu/#rank where #cpu is the number of logical processors and #rank is the number of MPI ranks. The compact value specifies the domain members are located as close to each other as possible in terms of common resources (cores, caches, sockets, and so on).

One rank per tile and pin it to an OS processor: set the I_MPI_PIN_PROCESSOR_LIST environment variable to all:map=scatter

$ mpirun -n 36 -env I_MPI_PIN_PROCESSOR_LIST all:map=scatter -env \
I_MPI_DEBUG 4 ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       42345    knl4        0
[0] MPI startup(): 1       42346    knl4        51
[0] MPI startup(): 2       42347    knl4        53
[0] MPI startup(): 3       42348    knl4        55
<truncate here>

One rank per core (two ranks per tile) and pin each to an OS processor: set I_MPI_PIN_PROCESSOR_LIST to all:shift=4

$ mpirun –n 72 -env I_MPI_PIN_PROCESSOR_LIST all:shift=4 -env I_MPI_DEBUG 4 ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       42622    knl4        0
[0] MPI startup(): 1       42623    knl4        50
[0] MPI startup(): 2       42624    knl4        51
[0] MPI startup(): 3       42625    knl4        52
< truncate here >

Two ranks per core (four ranks per tile) and pin each rank to an OS processor: set I_MPI_PIN_PROCESSOR_LIST to all:grain=2,shift=2

$ mpirun –n 144 -env I_MPI_PIN_PROCESSOR_LIST all:grain=2,shift=2 -env I_MPI_DEBUG 4 \ ./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       43082    knl4        0
[0] MPI startup(): 1       43083    knl4        109
[0] MPI startup(): 2       43084    knl4        50
[0] MPI startup(): 3       43085    knl4        110
< truncate here>

Four ranks per core (eight ranks per tile) and pin each rank to an OS processor: set I_MPI_PIN_PROCESS to all:map=bunch

$ mpirun –n 288 -env I_MPI_PIN_PROCESSOR_LIST all:map=bunch -env I_MPI_DEBUG 4 \
./a.out
[0] MPI startup(): Rank    Pid      Node name   Pin cpu
[0] MPI startup(): 0       276413   knl4        0
[0] MPI startup(): 1       276414   knl4        109
[0] MPI startup(): 2       276415   knl4        169
[0] MPI startup(): 3       276416   knl4        229
[0] MPI startup(): 4       276417   knl4        50
[0] MPI startup(): 5       276418   knl4        110
[0] MPI startup(): 6       276419   knl4        170
[0] MPI startup(): 7       276420   knl4        230
< truncate here>

The following table summarizes the parameters in order to launch MPI ranks in an Intel Xeon Phi Processor. Let k be the number of cores on this system, each pins to an OS processor:

 

Pinning to an OS processor

1 rank per tile

n  = k / 2

I_MPI_PIN_PROCESSOR_LIST all:map=scatter

1 rank per core

n  = k

I_MPI_PIN_PROCESSOR_LIST all:shift=4

2 ranks per core

n = k * 2

I_MPI_PIN_PROCESSOR_LIST all:grain=2,shift=2

4 ranks per core

n = k * 4

I_MPI_PIN_PROCESSOR_LIST all:map=bunch

3.4 Hybrid MPI and OpenMP Affinity on the Intel Xeon Phi Processor

You can combine the above affinity methods to deal with hybrid MPI/OpenMP affinity on Intel Xeon Phi Processor. Set KMP_AFFINITY to verbose to see how threads are mapped to OS processors.

For illustration purpose, in the following examples we launch 4 MPI ranks in an Intel Xeon Phi Processor with 72 cores. Each rank has a team of OpenMP threads where each thread pins to an OS processor.

Again, there are two ways to do this: the first way is to use KMP_AFFINITY and the second way is to use OpenMP 4.0 OMP_PLACES:

One thread per core (72 cores, 4 ranks, each rank maps to 72/4=18 cores, KMP_PLACE_THREADS 18C,1T specifies 1 thread on each 18 cores)

$ mpirun -n 4 -env KMP_PLACE_THREADS 18C,1T ./a.out

Two threads per core

$ mpirun -n 4 -env KMP_PLACE_THREADS 18C,2T ./a.out

Four threads per core

$ mpirun -n 4 -env KMP_PLACE_THREADS 18C,4T ./a.out

Let n be the number of cores on the Intel Xeon Phi Processor. The following table summarizes how to launch four MPI ranks where each rank has n/4 of the threads (1 thread per core), n/2 threads (2 threads per core) and n threads (4 threads per core), respectively:

 

 

For 4 MPI ranks

1 thread per core

KMP_PLACE_THREADS=“n/4”C,1T

2 threads per core

KMP_PLACE_THREADS=“n/4”C,2T

4 threads per core

KMP_PLACE_THREADS=“n/4”C,4T

Alternatively, you can use the OpenMP 4.0OMP_PLACES environment variable:

One thread per tile - 72 cores, 4 ranks, each rank maps to 72/4=18 cores; The OMP_NUM_THREADS environment variable specifies each rank has 9 threads and the OMP_PLACES threads(72) environment variable specifies placing these threads on the corresponding 72 OS processors (18 x 4):

$ mpirun -n 4 -env OMP_PLACES threads(72) -env OMP_NUM_THREADS 9 ./a.out

One thread per core (72 cores, 4 ranks, each rank maps to 72/4=18 cores: OMP_NUM_THREADS specifies each rank has 18 threads and OMP_PLACES threads(72)specifies places threads on 72 OS processors (hardware threads):

$ mpirun -n 4 -env OMP_PLACES threads(72) -env OMP_NUM_THREADS 18 ./a.out

Two threads per core

$ mpirun -n 4 -env OMP_PLACES threads(72) -env OMP_NUM_THREADS 36 ./a.out

 

Four threads per core

$ mpirun -n 4 -env OMP_PLACES threads(72) -env OMP_NUM_THREADS 72 ./a.out

 

Let n be the number of cores on an Intel Xeon Phi Processor. The following table summarizes how to launch four MPI ranks where each rank has n/4 of threads (1 thread/core), n/2 threads (2 threads/core) and n threads (4 threads/core) respectively:

 

 

For 4 MPI ranks

1 thread /core

OMP_PLACES="threads(n)" 

OMP_NUM_THREADS=n/4

2 threads/core

OMP_PLACES="threads(n)" 

OMP_NUM_THREADS=n/2

4 threads /core

OMP_PLACES="threads(n)" 

OMP_NUM_THREADS=n

One can also use the I_MPI_PIN_DOMAIN environment variable to define a number of logical processors where each MPI rank is pinned to that subset. The following example shows two MPI ranks running in two quadrants of a system with 72 cores. Each MPI rank creates a team of 18 OpenMP threads (1 thread per core).

 

One thread per core, two quadrants: each domain consists of 72 OS processors or 72/4 = 18 cores

$ mpirun -n 2 -env I_MPI_PIN_DOMAIN 72 -env KMP_PLACE_THREADS 18C,1T ./a.out

Or

$ mpirun -n 2 -env I_MPI_PIN_DOMAIN 72 -env OMP_PLACES “threads(72)” \
-env OMP_NUM_THREADS 18 ./a.out

Let k be the number of cores in an Intel Xeon Phi processor system, when starting 4 MPI ranks the distribution of MPI ranks and OpenMP threads might look as follows:

 

4. Conclusion

The first part of this article highlighted some similarities and differences when using Intel MPI Library on the Intel Xeon Processor, the Intel Xeon Phi Coprocessor and the Intel Xeon Phi Processor. In the second part, this article discussed in detail some useful best practices when working with the Intel Xeon Phi Processor: the use of high-bandwidth memory, setting OpenMP thread affinity, setting MPI process affinity, and setting hybrid MPI+OpenMP affinity.

 

5. References

 

Notices

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Cilk, Intel, the Intel logo, Intel Xeon Phi, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2015 Intel Corporation.

For more complete information about compiler optimizations, see our Optimization Notice.