Intel® MPI Library

Fault Tolerance

Intel® MPI Library provides extra functionality to enable fault tolerance support in the MPI applications. The MPI standard does not define behavior of MPI implementation if one or several processes of MPI application are abnormally aborted. By default, Intel® MPI Library aborts the whole application if any process stops.

Set the environment variable I_MPI_FAULT_CONTINUE to on to change this behavior. For example,

Environment Variables

Description of the environment variables used by the Hydra process manager (mpiexec.hydra or mpirun) on Linux: I_MPI_HYDRA, I_MPI_HYDRA_DEBUG, I_MPI_HYDRA_ENV, I_MPI_JOB_TIMEOUT (I_MPI_MPIEXEC_TIMEOUT), I_MPI_JOB_TIMEOUT_SIGNAL, I_MPI_JOB_ABORT_SIGNAL, I_MPI_JOB_SIGNAL_PROPAGATION, I_MPI_HYDRA_BOOTSTRAP, I_MPI_HYDRA_BOOTSTRAP_EXEC, I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS, I_MPI_HYDRA_BOOTSTRAP_AUTOFORK, I_MPI_HYDRA_RMK, I_MPI_HYDRA_PMI_CONNECT, I_MPI_PERHOST, I_MPI_JOB_TRACE_LIBS, I_MPI_JOB_CHECK_LIBS, I_MPI_HYDRA_BRANCH_COUNT, I_MPI_HYDRA_PMI_AGGREGATE, I_MPI_HYDRA_GDB_REMOTE_SHELL, I_MPI_HYDRA_JMI_LIBRARY, I_MPI_HYDRA_IFACE, I_MPI_HYDRA_DEMUX, I_MPI_HYDRA_CLEANUP, I_MPI_TMPDIR, I_MPI_JOB_RESPECT_PROCESS_PLACEMENT, I_MPI_GTOOL, I_MPI_HYDRA_USE_APP_TOPOLOGY.

Installing Intel® MPI Library

If you have a previous version of the Intel® MPI Library for Linux* OS installed, you do not need to uninstall it before installing the latest version.

Extract the l_mpi[-rt]_p_<version>.<package_num>.tar.gz package by using following command:

tar –xvzf l_mpi[-rt]_p_<version>.<package_num>.tar.gz

This command creates the subdirectory l_mpi[-rt]_p_<version>.<package_num>.

Checking Correctness

Use -check_mpi option to link the resulting executable file against the Intel® Trace Collector correctness checking library. This has the same effect as when -profile=vtmc is used as an argument to mpiicc or another compiler script.

$ mpiicc -profile=vtmc test.c -o testc

Or

$ mpiicc -check_mpi test.c -o testc

To use this option, you need to:

Environment Problems

Environmental errors may happen when there are problems with the system environment, such as mandatory system services are not running, shared resources are unavailable and so on.

When you encounter environmental errors, check the environment. For example, verify the current state of important services.

Example 1

Symptom/Error Message

librdmacm: Warning: couldn't read ABI version.
librdmacm: Warning: assuming: 4
librdmacm: Fatal: unable to get RDMA device list

or:

Subscribe to Intel® MPI Library