Controlling Process Placement with the Intel® MPI Library

When running an MPI program, process placement is critical to maximum performance.  Many applications can be sufficiently controlled with a simple process placement scheme, while some will require a more complex approach.  The Intel® MPI Library offers multiple options for controlling process placement within the Hydra process manager.

Host List

The host list is a list of hosts that can be used for a job.  There are several ways to specify a host list.  If none are used, only the current host is included.  If you are using a supported job scheduler, Hydra will get the host list from the scheduler via environment variables.  You can specify hosts on the command line individually, using -host, or in groups, using -hosts. 

mpirun -host node1 ...

Will only use node1.

mpirun -hosts node1,node2,node3 ... 

Will use a host list of node1, node2, and node3.  You can also create a hostfile which lists the hosts to be used.  This is a text file, with one host per line.  Commenting out a host using # will exclude it from the host list.  For example:

node1
#node2
node3

will use node1 and node3, but skip node2 (useful if you normally use node2, but there's a temporary reason not to do so).

Default Placement

By default, Hydra will attempt to place as many ranks as there are physical cores on the node.  The next node in the host list will then be used.  If the host list is exhausted before all ranks are assigned, Hydra will begin at the start of the host list again.  For example, assuming 8 cores per node:

mpirun -n 10 ./hello

Will give the following output:

Hello world: rank 0 of 10 running on node0
Hello world: rank 1 of 10 running on node0
Hello world: rank 2 of 10 running on node0
Hello world: rank 3 of 10 running on node0
Hello world: rank 4 of 10 running on node0
Hello world: rank 5 of 10 running on node0
Hello world: rank 6 of 10 running on node0
Hello world: rank 7 of 10 running on node0
Hello world: rank 8 of 10 running on node1
Hello world: rank 9 of 10 running on node1

The default settings are assuming you are not using a job scheduler.  If you are using a job scheduler, Hydra will attempt to use information provided by the job scheduler to determine default process placement.  You can manually override this using any of the following methods.

Ranks Per Node

Setting the number of ranks per node works similarly to the default, except that instead of using the number of physical cores available, Hydra will assign the specified number of ranks to each node.  There are several options to set the number of ranks per node.  The environment variable I_MPI_PERHOST will set this value.  On the mpirun command line, you can use "-perhost <#>", "-ppn <#>", or "-grr <#>" to set the number of ranks per node.  Using the command line will override the environment variable.  Also, the option "-rr" is equivalent to "-perhost 1".  For example:

mpirun -n 8 -ppn 3 ./hello
Hello world: rank 0 of 8 running on node0
Hello world: rank 1 of 8 running on node0
Hello world: rank 2 of 8 running on node0
Hello world: rank 3 of 8 running on node1
Hello world: rank 4 of 8 running on node1
Hello world: rank 5 of 8 running on node1
Hello world: rank 6 of 8 running on node0
Hello world: rank 7 of 8 running on node0

Machine File Specification

Using a host file defines a list of hosts for execution.  However, a host file does not specify process placement among the hosts.  Using a machine file (with "-machinefile <file>" or "-machine <file>") controls process placement within the machine file.  The format of each line of the machine file is:

<host[:nranks]>

Each host will have the specified number of ranks (1 if not specified) assigned to it, then the next host will be used.  Repeating instances of a host are treated individually.  So this machine file:

node0:2
node1:2
node0
node2

Would result in the following:

mpirun -machinefile hosts.txt -n 8 ./hello
Hello world: rank 0 of 8 running on node0
Hello world: rank 1 of 8 running on node0
Hello world: rank 2 of 8 running on node1
Hello world: rank 3 of 8 running on node1
Hello world: rank 4 of 8 running on node0
Hello world: rank 5 of 8 running on node2
Hello world: rank 6 of 8 running on node0
Hello world: rank 7 of 8 running on node0

Argument Sets/Configuration Files

Using argument sets provides another option to control process placement (along with many other aspects of your job).  Each argument set is a unique group of MPI options, and all of the argument sets will be combined into a single MPI job.  Argument sets can be provided either on the command line or in a configuration file.  On the command line, global options (applied to all argument sets) appear first.  Local options (applied only to the current argument set) are specified in groups separated by a colon (:).  For example:

mpirun -genv MESSAGE hello -host node0 -n 2 ./hello : -host node0 -n 2 ./hello_message : -host node1 -n 2 ./hello
Hello world: rank 0 of 6 running on node0
Hello world: rank 1 of 6 running on node0
Hello world: rank 2 of 6 running says hello
Hello world: rank 3 of 6 running says hello
Hello world: rank 4 of 6 running on node1
Hello world: rank 5 of 6 running on node1

Argument sets can be as complex as necessary, up to the limit of the command line length.  Argument sets can also be specified in a configuration file.  In this case, if there are global arguments, they should appear on the first line of the file.  Each line represents a unique argument set.  Once the configuration file is built, run:

mpirun -configfile <configuration file>

No other arguments should be specified on the command line.

For more complete information about compiler optimizations, see our Optimization Notice.

Comments

Good article. For more

Good article. For more complex MPMD/heteregenous adapter usecases, it would be great to have references with detailed examples and explanations on how to use -hostfile/-machinefile, -configfile in a single cluster launch cmd.  Such references are very much appreciated!