Developer Guide

Contents

Job Schedulers Support

The Intel® MPI Library supports the majority of commonly used job schedulers in the HPC field.
The following job schedulers are supported on Linux* OS:
  • Altair* PBS Pro*
  • Torque*
  • OpenPBS*
  • IBM* Platform LSF*
  • Parallelnavi* NQS*
  • SLURM*
  • Univa* Grid Engine*
The Hydra Process manager detects Job Schedulers automatically by checking specific environment variables. These variables are used to determine how many nodes were allocated, which nodes, and the number of processes per tasks.

Altair PBS Pro*, TORQUE*, and OpenPBS*

If you use one of these job schedulers, and
$PBS_ENVIRONMENT
exists with the value
PBS_BATCH
or
PBS_INTERACTIVE
,
mpirun
uses
$PBS_NODEFILE
as a machine file for
mpirun
. You do not need to specify the
-machinefile
option explicitly.
The following is an example of a batch job script:
#PBS -l nodes=4:ppn=4 #PBS -q queue_name cd $PBS_O_WORKDIR mpirun -n 16 ./myprog

IBM Platform LSF*

The IBM Platform LSF* job scheduler is detected automatically if the
$LSB_MCPU_HOSTS
and
$LSF_BINDIR
environment variables are set.
The Hydra process manager uses these variables to determine how many nodes were allocated, which nodes, and the number of processes per tasks. To run processes on the remote nodes, the Hydra process manager uses the
blaunch
utility by default. This utility is provided by the IBM Platform LSF.
The number of processes, the number of processes per node, and node names may be overridden by the usual Hydra options (
-n
,
-ppn
,
-hosts
).
Examples:
bsub -n 16 mpirun ./myprog bsub -n 16 mpirun -n 2 -ppn 1 ./myprog

Parallelnavi NQS*

If you use the Parallelnavi NQS job scheduler and the
$ENVIRONMENT
,
$QSUB_REQID
,
$QSUB_NODEINF
options are set, the
$QSUB_NODEINF
file is used as a machine file for
mpirun
. Also,
/usr/bin/plesh
is used as remote shell by the process manager during startup.

Slurm*

The Slurm job scheduler can be detected automatically by
mpirun
and
mpiexec
. Job scheduler detection is enabled in
mpirun
by default and enabled in
mpiexec
if hostnames are not specified. The only prerequisite is setting
I_MPI_PIN_RESPECT_CPUSET=0
.
For autodetection, the Hydra process manger uses these environment variables:
  • SLURM_JOBID
  • SLURM_NODELIST
  • SLURM_NNODES
  • SLURM_NTASKS_PER_NODE
    or
    SLURM_NTASKS
  • SLURM_CPUS_PER_TASK
Using these variables, Hydra can determine which nodes are available, how many nodes were allocated, the number of MPI processes per node, and the domain size per MPI process.
SLURM_NTASKS_PER_NODE
is used for the implicit specification of
I_MPI_PERHOST
, or alternatively
SLURM_NTASKS/SLURM_NNODES
. The value of
SLURM_CPUS_PER_TASK
defines implicitly
I_MPI_PIN_DOMAIN
and overwrites the "auto" default. If some of the the Slurm variables are not defined the corresponding Intel MPI defaults are used. Based on the environment detection it is sufficient to execute the following simple command line under Slurm:
export I_MPI_PIN_RESPECT_CPUSET=0; mpirun ./myprog
The approach works in standard situations with simple Slurm pinning (for example, only using the Slurm flag
--cpus-per-task
). If a Slurm job requires a more complicated pinning setup (using the Slurm flag
--cpu-bind
) then the process pinning may be incorrect. In this case, full pinning control is gained by launching the MPI run with
srun
or enable Intel MPI Library pinning by setting the
I_MPI_PIN_RESPECT_CPUSET=0
environment variable (see the Developer Reference, “Process Pinning” and “Environmental Variables for Process Pinning”). When using
mpirun
, the required pinning has to be explicitly replicated using
I_MPI_PIN_DOMAIN
.
If the Slurm job scheduler was not detected automatically, you can set the
I_MPI_HYDRA_RMK=slurm
or
I_MPI_HYDRA_BOOTSTRAP=slurm
variables (see the Developer Reference, “Hydra Environment Variables”).
To run processes on the remote nodes, Hydra uses the
srun
utility. These environment variables control which utility is used in this case (see the Developer Reference, “Hydra Environment Variables”):
  • I_MPI_HYDRA_BOOTSTRAP
  • I_MPI_HYDRA_BOOTSTRAP_EXEC
  • I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS
You can also launch applications with the
srun
utility without Hydra by setting the
I_MPI_PMI_LIBRARY
environment variable (see the Developer Reference, “Other Environment Variables”).
PMI versions currently supported are PMI-1 and PMI-2.
By default, the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the
-ppn
option has no effect. To change this behavior and control process placement through
-ppn
(and related options and variables), set
I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off
Examples:
# Allocate nodes. salloc --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> # Run your application using Hydra. mpiexec ./myprog #or mpirun ./myprog # Run your application using srun with the PMI-1 interface. I_MPI_PMI_LIBRARY=<path-to-libpmi.so>/libpmi.so srun ./myprog # Run your application using srun with the PMI-2 interface. I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so srun --mpi=pmi2 ./myprog # Change per-host process placement. I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 ./myprog # Change per-host process placement and hostnames and use srun utility for remote launch. I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 -hosts host3,host1 -bootstrap=slurm ./myprog # Use Intel MPI Library pinning. I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog # Use the --cpus-per-task Slurm option in Intel MPI Library pinning. salloc --cpus-per-task=<cpus-per-task> --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog #or I_MPI_PIN_DOMAIN=${SLURM_CPUS_PER_TASK} I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog

Univa Grid Engine*

If you use the Univa Grid Engine job scheduler and the
$PE_HOSTFILE
is set, then two files will be generated:
/tmp/sge_hostfile_${username}_$$
and
/tmp/sge_machifile_${username}_$$
. The latter is used as the machine file for
mpirun
. These files are removed when the job is completed.

SIGINT, SIGTERM Signals Intercepting

If resources allocated to a job exceed the limit, most job schedulers terminate the job by sending a signal to all processes.
For example, Torque* sends
SIGTERM
three times to a job and if this job is still alive,
SIGKILL
will be sent to terminate it.
For Univa Grid Engine, the default signal to terminate a job is
SIGKILL
. The Intel MPI Library is unable to process or catch that signal causing
mpirun
to kill the entire job. You can change the value of the termination signal through the following queue configuration:
  1. Use the following command to see available queues:
    $ qconf -sql
  2. Execute the following command to modify the queue settings:
    $ qconf -mq <queue_name>
  3. Find
    terminate_method
    and change signal to
    SIGTERM
    .
  4. Save queue configuration.

Controlling Per-Host Process Placement

When using a job scheduler, by default the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the
-ppn
option has no effect. To change this behavior and control process placement through
-ppn
(and related options and variables), use the
I_MPI_JOB_RESPECT_PROCESS_PLACEMENT
environment variable:
$ export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.