Running large scale Intel® MPI applications on Omni-Path or InfiniBand* clusters, one might have recognized an increasing time spend within the MPI_Init() routine. The reason for this behavior are some MPI runtime infrastructure management operations that are necessary in order to make sure that all MPI ranks have a common environment. Having large MPI runs with multiple thousands of ranks, these operations can consume a huge part of the MPI initialization phase time.
There are several factors which lead to the increased startup time. This includes extra communication over the PMI (Process Management Interface) before the fabric is available. In addition there are initial global- collective operations which may lead to high fabric load during the startup phase. Therefore, with growing MPI rank counts the amount of messages passed across the fabric will increase exponentially, which may cause long startup times - especially at scale.
In order to address the problem, one can reduce the startup times by switching off certain startup checks while making sure that a consistent environment of the individual ranks is given.
Before doing so however, one should make sure to use the latest Intel MPI- as well as the latest Fabric library. The latest IMPI library can be found here - https://software.intel.com/en-us/intel-mpi-library.
Having the latest libraries in place, one can start switching off certain environmental checks while making sure that all ranks share the same environment with reference to the regarding system check.
- Making sure that all ranks are working on the same Intel Microprocessor Architecture generation, one can switch off the platform check.:
- Since this check is also done in order to tune the collective operations for specific architectures, one needs to manually specify the processor architecture being used.:
- I_MPI_PLATFORM=bdw – (bdw=Broadwell - see the Intel MPI Reference Manual for other architectures)
- Switching to an alternative PMI data exchange algorithm, can also help to speed up the startup phase.:
- The Hydra branch count mechanism (enabled by default) helps to reduce the pressure on the Linux File Descriptors. However, this mechanism sometimes introduces additional overhead and therefore it can be helpful to reduce the number of branches per PMI Proxy to e.g. 4 while default would be 32. If you still observe issues during startup, it might be worth to try to completely switch off the mechanism via -1.
- For InfiniBand* fabrics, where all ranks have the same DAPL provider (I_MPI_DAPL_UD_PROVIDER) set as well as the same DAPL library version working underneath the Intel MPI ranks, one can switch off the DAPL provider check.:
These settings, among other further fine-tuning work have enabled a simple init-finalize Intel MPI application to run through successfully on 80k Haswell cores (3k nodes) within less than 2 minutes.