Reducing Initialization Times of the Intel® MPI Library

签署人: Michael Steyer

已发布:08/05/2015   最后更新时间:08/05/2015

Running large scale Intel® MPI applications on Omni-Path or InfiniBand* clusters, one might have recognized an increasing time spend within the MPI_Init() routine. The reason for this behavior are some MPI runtime infrastructure management operations that are necessary in order to make sure that all MPI ranks have a common environment. Having large MPI runs with multiple thousands of ranks, these operations can consume a huge part of the MPI initialization phase time.

There are several factors which lead to the increased startup time. This includes extra communication over the PMI (Process Management Interface) before the fabric is available. In addition there are initial global- collective operations which may lead to high fabric load during the startup phase. Therefore, with growing MPI rank counts the amount of messages passed across the fabric will increase exponentially, which may cause long startup times - especially at scale.

In order to address the problem, one can reduce the startup times by switching off certain startup checks while making sure that a consistent environment of the individual ranks is given.

Before doing so however, one should make sure to use the latest Intel MPI- as well as the latest Fabric library. The latest IMPI library can be found here - https://software.intel.com/en-us/intel-mpi-library.

Having the latest libraries in place, one can start switching off certain environmental checks while making sure that all ranks share the same environment with reference to the regarding system check.

  1. Making sure that all ranks are working on the same Intel Microprocessor Architecture generation, one can switch off the platform check.:
    • I_MPI_PLATFORM_CHECK=0
    • Since this check is also done in order to tune the collective operations for specific architectures, one needs to manually specify the processor architecture being used.:
    • I_MPI_PLATFORM=bdw – (bdw=Broadwell - see the Intel MPI Reference Manual for other architectures)
  2. Switching to an alternative PMI data exchange algorithm, can also help to speed up the startup phase.:
    • I_MPI_HYDRA_PMI_CONNECT=alltoall
  3. The Hydra branch count mechanism (enabled by default) helps to reduce the pressure on the Linux File Descriptors. However, this mechanism sometimes introduces additional overhead and therefore it can be helpful to reduce the number of branches per PMI Proxy to e.g. 4 while default would be 32. If you still observe issues during startup, it might be worth to try to completely switch off the mechanism via -1. 
    • I_MPI_HYDRA_BRANCH_COUNT=4
  4. For InfiniBand* fabrics, where all ranks have the same DAPL provider (I_MPI_DAPL_UD_PROVIDER) set as well as the same DAPL library version working underneath the Intel MPI ranks, one can switch off the DAPL provider check.: 
    • I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0

These settings, among other further fine-tuning work have enabled a simple init-finalize Intel MPI application to run through successfully on 80k Haswell cores (3k nodes) within less than 2 minutes.

产品和性能信息

1

英特尔的编译器针对非英特尔微处理器的优化程度可能与英特尔微处理器相同(或不同)。这些优化包括 SSE2、SSE3 和 SSSE3 指令集和其他优化。对于在非英特尔制造的微处理器上进行的优化,英特尔不对相应的可用性、功能或有效性提供担保。该产品中依赖于微处理器的优化仅适用于英特尔微处理器。某些非特定于英特尔微架构的优化保留用于英特尔微处理器。关于此通知涵盖的特定指令集的更多信息,请参阅适用产品的用户指南和参考指南。

通知版本 #20110804