Developer Guide

  • 2021.2
  • 03/31/2021
  • Public Content
Contents

Job Schedulers Support

The Intel® MPI Library supports the majority of commonly used job schedulers in the HPC field.
The following job schedulers are supported on Linux* OS:
  • Altair* PBS Pro*
  • Torque*
  • OpenPBS*
  • IBM* Platform LSF*
  • Parallelnavi* NQS*
  • SLURM*
  • Univa* Grid Engine*
The Hydra Process manager detects Job Schedulers automatically by checking specific environment variables. These variables are used to determine how many nodes were allocated, which nodes, and the number of processes per tasks.

Altair PBS Pro*, TORQUE*, and OpenPBS*

If you use one of these job schedulers, and
$PBS_ENVIRONMENT
exists with the value
PBS_BATCH
or
PBS_INTERACTIVE
,
mpirun
uses
$PBS_NODEFILE
as a machine file for
mpirun
. You do not need to specify the
-machinefile
option explicitly.
The following is an example of a batch job script:
#PBS -l nodes=4:ppn=4 #PBS -q queue_name cd $PBS_O_WORKDIR mpirun -n 16 ./myprog

IBM Platform LSF*

The IBM Platform LSF* job scheduler is detected automatically if the
$LSB_MCPU_HOSTS
and
$LSF_BINDIR
environment variables are set.
The Hydra process manager uses these variables to determine how many nodes were allocated, which nodes, and the number of processes per tasks. To run processes on the remote nodes, the Hydra process manager uses the
blaunch
utility by default. This utility is provided by the IBM Platform LSF.
The number of processes, the number of processes per node, and node names may be overridden by the usual Hydra options (
-n
,
-ppn
,
-hosts
).
Examples:
bsub -n 16 mpirun ./myprog bsub -n 16 mpirun -n 2 -ppn 1 ./myprog

Parallelnavi NQS*

If you use the Parallelnavi NQS job scheduler and the
$ENVIRONMENT
,
$QSUB_REQID
,
$QSUB_NODEINF
options are set, the
$QSUB_NODEINF
file is used as a machine file for
mpirun
. Also,
/usr/bin/plesh
is used as remote shell by the process manager during startup.

Slurm*

The Slurm job scheduler can be detected automatically by
mpirun
and
mpiexec
. Job scheduler detection is enabled in
mpirun
by default and enabled in
mpiexec
if hostnames are not specified. The only prerequisite is setting
I_MPI_PIN_RESPECT_CPUSET=0
.
For autodetection, the Hydra process manger uses these environment variables:
  • SLURM_JOBID
  • SLURM_NODELIST
  • SLURM_NNODES
  • SLURM_NTASKS_PER_NODE
    or
    SLURM_NTASKS
  • SLURM_CPUS_PER_TASK
Using these variables, Hydra can determine how many nodes were allocated, which nodes, and the number of processes per task. If the Slurm job scheduler was not detected automatically, you can set the
I_MPI_HYDRA_RMK=slurm
or
I_MPI_HYDRA_BOOTSTRAP=slurm
variables (see the Developer Reference, “Hydra Environment Variables”).
To run processes on the remote nodes, Hydra uses the
srun
utility. These environment variables control which utility is used in this case (see the Developer Reference, “Hydra Environment Variables”):
  • I_MPI_HYDRA_BOOTSTRAP
  • I_MPI_HYDRA_BOOTSTRAP_EXEC
  • I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS
You can also launch applications with the
srun
utility without Hydra by setting the
I_MPI_PMI_LIBRARY
environment variable (see the Developer Reference, “Other Environment Variables”).
PMI versions currently supported are PMI-1 and PMI-2.
By default, the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the
-ppn
option has no effect. To change this behavior and control process placement through
-ppn
(and related options and variables), set
I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off
By default, the Intel MPI Library uses the process pinning provided by Slurm. If the job was launched using
mpirun
or
mpiexec
and some Slurm options for pinning were set, then process pinning may be incorrect. In this case, launch your job with
srun
or enable Intel MPI Library pinning by setting the
I_MPI_PIN_RESPECT_CPUSET=0
environment variable (see the Developer Reference, “Process Pinning” and “Environmental Variables for Process Pinning”).
Intel MPI Library process pinning supports some of Slurm’s pinning options. The current list of supported options is:
--cpus-per-task
.
Examples:
# Allocate nodes. salloc --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> # Run your application using Hydra. mpiexec ./myprog #or mpirun ./myprog # Run your application using srun with the PMI-1 interface. I_MPI_PMI_LIBRARY=<path-to-libpmi.so>/libpmi.so srun ./myprog # Run your application using srun with the PMI-2 interface. I_MPI_PMI_LIBRARY=<path-to-libpmi2.so>/libpmi2.so srun --mpi=pmi2 ./myprog # Change per-host process placement. I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 ./myprog # Change per-host process placement and hostnames and use srun utility for remote launch. I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off mpiexec -n 2 -ppn 1 -hosts host3,host1 -bootstrap=slurm ./myprog # Use Intel MPI Library pinning. I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog # Use the --cpus-per-task Slurm option in Intel MPI Library pinning. salloc --cpus-per-task=<cpus-per-task> --nodes=<number-of-nodes> --partition=<partition> --ntasks-per-node=<number-of-processes-per-node> I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog #or I_MPI_PIN_DOMAIN=${SLURM_CPUS_PER_TASK} I_MPI_PIN_RESPECT_CPUSET=off mpiexec ./myprog

Univa Grid Engine*

If you use the Univa Grid Engine job scheduler and the
$PE_HOSTFILE
is set, then two files will be generated:
/tmp/sge_hostfile_${username}_$$
and
/tmp/sge_machifile_${username}_$$
. The latter is used as the machine file for
mpirun
. These files are removed when the job is completed.

SIGINT, SIGTERM Signals Intercepting

If resources allocated to a job exceed the limit, most job schedulers terminate the job by sending a signal to all processes.
For example, Torque* sends
SIGTERM
three times to a job and if this job is still alive,
SIGKILL
will be sent to terminate it.
For Univa Grid Engine, the default signal to terminate a job is
SIGKILL
. The Intel MPI Library is unable to process or catch that signal causing
mpirun
to kill the entire job. You can change the value of the termination signal through the following queue configuration:
  1. Use the following command to see available queues:
    $ qconf -sql
  2. Execute the following command to modify the queue settings:
    $ qconf -mq <queue_name>
  3. Find
    terminate_method
    and change signal to
    SIGTERM
    .
  4. Save queue configuration.

Controlling Per-Host Process Placement

When using a job scheduler, by default the Intel MPI Library uses per-host process placement provided by the scheduler. This means that the
-ppn
option has no effect. To change this behavior and control process placement through
-ppn
(and related options and variables), use the
I_MPI_JOB_RESPECT_PROCESS_PLACEMENT
environment variable:
$ export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.