Running large scale Intel® MPI applications on Omni-Path or InfiniBand* clusters, one might have recognized an increasing time spend within the MPI_Init() routine. The reason for this behavior are some MPI runtime infrastructure management operations that are necessary in order to make sure that all MPI ranks have a common environment. Having large MPI runs with multiple thousands of ranks, these operations can consume a huge part of the MPI initialization phase time.
There are several factors which lead to the increased startup time. This includes extra communication over the PMI (Process Management Interface) before the fabric is available. In addition there are initial global- collective operations which may lead to high fabric load during the startup phase. Therefore, with growing MPI rank counts the amount of messages passed across the fabric will increase exponentially, which may cause long startup times - especially at scale.
In order to address the problem, one can reduce the startup times by switching off certain startup checks while making sure that a consistent environment of the individual ranks is given.
Before doing so however, one should make sure to use the latest Intel MPI- as well as the latest Fabric library. The latest IMPI library can be found here - https://software.intel.com/en-us/intel-mpi-library.
Having the latest libraries in place, one can start switching off certain environmental checks while making sure that all ranks share the same environment with reference to the regarding system check.
These settings, among other further fine-tuning work have enabled a simple init-finalize Intel MPI application to run through successfully on 80k Haswell cores (3k nodes) within less than 2 minutes.
英特尔的编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同（或不同）。这些优化包括 SSE2、SSE3 和 SSSE3 指令集和其他优化。对于在非英特尔制造的微处理器上进行的优化，英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于微处理器的优化仅适用于英特尔微处理器。某些非特定于英特尔微架构的优化保留用于英特尔微处理器。关于此通知涵盖的特定指令集的更多信息，请参阅适用产品的用户指南和参考指南。