GPU Pinning
Use this feature to distribute Intel GPU devices between MPI ranks.
To enable this feature, set
I_MPI_OFFLOAD_TOPOLIB=l0
.
This feature requires that the Level-Zero* library be installed on the nodes. The
device pinning information is printed out in the Intel MPI debug output at
I_MPI_DEBUG=3.Default settings:
I_MPI_OFFLOAD_CELL=tile
I_MPI_OFFLOAD_DOMAIN_SIZE=-1
I_MPI_OFFLOAD_DEVICES=all
By default, all available resources are distributed between MPI ranks as equally as
possible given the position of the ranks; that is, the distribution of resources
takes into account on which NUMA node the rank and the resource are located.
Ideally, the rank will have resources only on the same NUMA node on which the rank
is located.
Examples:
All examples below represent a machine configuration with two NUMA nodes and two
GPUs with two tiles.
Figure 1. Four MPI Ranks

Debug output I_MPI_DEBUG=3:
[0] MPI startup(): ===== GPU pinning on host1 =====
[0] MPI startup(): Rank Pin tile
[0] MPI startup(): 0 {0}
[0] MPI startup(): 1 {1}
[0] MPI startup(): 2 {2}
[0] MPI startup(): 3 {3}
Figure 2. Three MPI Ranks

Debug output
I_MPI_DEBUG=3
:[0] MPI startup(): ===== GPU pinning on host1 =====
[0] MPI startup(): Rank Pin tile
[0] MPI startup(): 0 {0}
[0] MPI startup(): 1 {1}
[0] MPI startup(): 2 {2,3}
I_MPI_OFFLOAD_TOPOLIB
Set the interface for GPU topology recognition.
Syntax
I_MPI_OFFLOAD_TOPOLIB=<
arg
>Arguments
<
arg
> String parameter.l0
Use Level-Zero library for GPU topology
recognition.Description
Set this environment variable to define the interface for GPU topology
recognition. Setting this variable enables the GPU Pinning feature.
I_MPI_OFFLOAD_LEVEL_ZERO_LIBRARY
Specify the name and full path to the Level-Zero library.
Syntax
I_MPI_OFFLOAD_LEVEL_ZERO_LIBRARY="<path>/<name>"
Arguments
<
path
> Full path to the Level-Zero
library.<
name
> Name of the Level-Zero library.Description
Set this environment variable to specify the name and full path to Level-Zero
library. Set this variable if Level-Zero is not located in the default path.
Default value:
libze_loader.so
.I_MPI_OFFLOAD_CELL
Set this variable to define the base unit: tile (subdevice) or device (gpu).
Syntax
I_MPI_OFFLOAD_CELL=<cell>
Arguments
<cell>
Specify the base unit.tile
One tile (subdevice). Default value.device
Whole device (gpu) with all subdevicesDescription
Set this variable to define the base unit. This variable may affect other GPU
pinning variables.
Example
Figure 3. Four MPI ranks, I_MPI_OFFLOAD_CELL=device

I_MPI_OFFLOAD_DOMAIN_SIZE
Control the number of base units per MPI rank.
Syntax
I_MPI_OFFLOAD_DOMAIN_SIZE=<
value
>Arguments
<
value
> Integer number. -1
Auto. Default value. Each MPI rank may have a
different domain size to use all available resources.> 0
Custom domain size.Description
Set this variable to define how many base units will be pinned to the MPI rank.
I_MPI_OFFLOAD_CELL variable defines the base unit: tile or device.
Examples
Figure 4. Three MPI ranks, I_MPI_OFFLOAD_DOMAIN_SIZE=1

I_MPI_OFFLOAD_DEVICES
Define a list of available devices.
Syntax
I_MPI_OFFLOAD_DEVICES=<
devicelist
>
Arguments
<
devicelist
> A comma-separated list of
available devices.all
All devices are available. Default value.<
l
> Device with logical number <l
>.<
l
>-<m
> Range
of devices with logical numbers from <l
> to
<m
>.<
k
>,<l
>-<m
> Device <k
> and
devices from <l
> to <m
>.Description
Set this variable to define the available devices. This variable also gives you
the ability to exclude devices.
Example
Figure 5. Four MPI ranks, I_MPI_OFFLOAD_DEVICES=0

I_MPI_OFFLOAD_DEVICE_LIST
Define a list of base units to pin for each MPI rank.
Syntax
I_MPI_OFFLOAD_DEVICE_LIST=<base_units_list>
Arguments
<base_units_list> A comma-separated list of base units. The process with
the i-th rank is pinned to the i-th base unit in the list.
<l> Base unit with logical number <l>.
<l>-<m> Range of base units with logical numbers from <l> to
<m>.
<k>,<l>-<m> Base unit <k> and base units from <l>
to <m>.
Description
Set this variable to define the list of base units to pin for each MPI rank. The
process with the i-th rank is pinned to the i-th base unit in the list.
I_MPI_OFFLOAD_CELL variable defines the base unit: tile or device.
I_MPI_OFFLOAD_DEVICE_LIST variable has less priority than the
I_MPI_OFFLOAD_DOMAIN variable.
Example
Figure 6. Four MPI ranks,
I_MPI_OFFLOAD_DEVICE_LIST=3,2,0,1

I_MPI_OFFLOAD_DOMAIN
Define domains through the comma separated list of hexadecimal numbers for each
MPI rank
Syntax
I_MPI_OFFLOAD_DOMAIN=<
masklist
>
Arguments
<
masklist
> A comma-separated list of hexadecimal
numbers.[m1,...,mn ]
For <masklist
>, each mi is a hexadecimal bit mask defining an individual
domain. The following rule is used: the i-th base unit is included into the domain if the
corresponding bit in mi value is set to 1.
Description
Set this variable to define the list of hexadecimal bit masks. For the i-th bit
mask, if the j-th bit set to 1, then the j-th base unit will be pinned to the i-th
MPI rank.
I_MPI_OFFLOAD_CELL variable defines the base unit: tile or device.
I_MPI_OFFLOAD_DOMAIN variable has higher priority than the
I_MPI_OFFLOAD_DEVICE_LIST.
Example
Figure 7. Four MPI ranks, I_MPI_OFFLOAD_DOMAIN=[B,2,5,C].
Parsed bit masks: [1101,0100,1010,0011]
