Tuning the Intel® MPI Library: Advanced Techniques

签署人: Carlos Rosales-Fernandez Michael Steyer

已发布:01/29/2018   最后更新时间:01/29/2018

This article describes advanced techniques to tune the performance of codes built and executed using the Intel® MPI Library. The tuning methodologies presented are intended for experienced users, and their impact on overall performance depends strongly on the particular application and workload used.

Apply wait mode to oversubscribed jobs

This option is particularly relevant for oversubscribed MPI jobs. The goal is to enable the wait mode of the progress engine in order to wait for messages without polling the fabric(s). This can save CPU cycles but decreases the message-response rate (latency), so it should be used with caution. To enable wait mode simply use:

I_MPI_WAIT_MODE=1

Adjust the eager / rendezvous protocol threshold

Two protocols control the handshake mechanism for message exchange in MPI:

  • Eager: Sends data and metadata immediately regardless of receiver status 
  • Rendezvous:  Sends metadata first. Full message sent after receiver acknowledges readiness (RTS/CTS)

The Eager protocol has lower latency, but requires more runtime memory in order to maintain the required intermediate buffers. It is commonly used for small messages. The Rendezvous protocol requires less memory, and has higher latency. It is typically used for large messages where the increased overhead is negligible.

For communication across nodes the Eager / Rendezvous threshold is defined by the variable I_MPI_EAGER_THRESHOLD (in bytes, default is 256 kB).

For communication within a node the variable I_MPI_INTRANODE_EAGER_THRESHOLD controls the switch point between both protocols (in bytes) except for the SHM fabric, and it is also set to 256 kB.

For the shm fabric the threshold control is provided by the variable I_MPI_SHM_CELL_SIZE (in bytes), whcih has a default that is platform dependent and that typically ranges between 8 kB and 64 kB. 

Note that for DAPL and DAPL-UD fabrics the switch control is provided instead by the I_MPI_DAPL_DIRECT_COPY_THRESHOLD and the I_MPI_DAPL_UD_DIRE CT_COPY_THRESHOLD, which set the switch point between Eager and the direct-copy mechanisms in bytes.

Enforce asynchronous message transfer for non-blocking operations

It is possible to overlap computation and communication by spawning a helper thread. This can cause oversubscription, so it is disabled by default. To enable asynchronous progress:

I_MPI_ASYNC_PROGRESS=1

It is also possible to explicitly pin the helper thread - taking one rank out of the pinning mask:

I_MPI_ASYNC_PROGRESS_PIN=1

Tuning shared memory

There are three possible types of message exchange when using shared memory:

  • User space – double copy – cache / dram - fast for small messages (eager) [I_MPI_SHM_CACHE_BYPASS_THRESHOLDS/I_MPI_INTRANODE_EAGER_THRESHOLD]
  • Kernel assisted – single copy (CMA – Linux* 3.2) – fast for medium / large (rendezvous) [I_MPI_SHM_LMT/ I_MPI_INTRANODE_EAGER_THRESHOLD]
  • Loopback Fabric – DMA can be performed by HW – might be faster for large messages in certain situations [I_MPI_SHM_BYPASS/ I_MPI_INTRANODE_EAGER_THRESHOLD]

Tuning the progress engine

Significant portions of CPU cycles are sometimes spent in the MPI progress engine (CH3/CH4). This can be addressed reducing the spin count, which is the number of times the progress engine spins waiting for a message or connection request before yielding to the OS:

I_MPI_SPIN_COUNT=<scount>

The default setting is 250 unless more than one process runs per processor, in which case it is set to 1.

Multi-core platforms with intranode communication dominated executions may benefit from a customized value of the intranode spin count:

I_MPI_SHM_SPIN_COUNT=<scount>

Reduce initialization times at scale

Start by making sure you use the latest Intel MPI library as well as the latest PSM2 version

If all ranks work on the same Intel Architecture generation, switch off the platform check:

I_MPI_PLATFORM_CHECK=0

Specify the processor architecture being used to tune the collective operations:

I_MPI_PLATFORM=uniform

Alternative PMI data exchange algorithm can help to speed up the startup phase:

I_MPI_HYDRA_PMI_CONNECT=alltoall

Customizing the branching may also help startup times (default is 32 for over 127 nodes):

I_MPI_HYDRA_BRANCH_COUNT=<n>

Additional settings for Infiniband*

Reduce initialization times at scale

Traditional Infiniband* support uses the Reliable Connection (RC) protocol to exchange MPI messages, but the User Datagram (UD) protocol has emerged as a lower memory consumption, more scalable alternative. This is the recommended DAPL mode for large scale runs. 

Use DAPL UD UCM:

I_MPI_FABRICS=shm:dapl
I_MPI_DAPL_UD=1
I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx5_0-1u

DAPL library UCM settings are automatically adjusted for MPI jobs of more  than 1024 ranks, resulting in more stable job start-up:

Provider /etc/dat.conf Description
CMA ofa-v2-ib0 Uses OFA rdma_cm to setup QP. IPoIB, ARP, and SA queries required
SCM ofa-v2-mlx5_0-1 Uses sockets to exchange QP information. IPoIB, ARP and SA queries NOT required
UCM ofa-v2-mlx5_0-1u Uses UD QP to exchange QP information. Sockets, IPoIB, ARP ans SA queries NOT required

If all ranks have the same DAPL provider (I_MPI_DAPL_UD_PROVIDER) and the same DAPL library version for all ranks, switch off the DAPL provider check using hte following environmental variable setting:

I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0

Setting IPoIB as the PMI low level transport layer may also improve the startup performance:

I_MPI_HYDRA_IFACE=ib0

Tune DAPL-UD for memory efficiency at scale

DAPL-UD is fairly efficient in its memory consumption, but it can be further optimized for large scale runs. The following table contains a suggested set of optimized settings to use as a starting point.

Variable Default Value Tuned Value
I_MPI_DAPL_UD_SEND_BUFFER_NUM Runtime dependent 8208
I_MPI_DAPL_UD_RECV_BUFFER_NUM Runtime dependent 8208
I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE 256 8704
I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE Runtime dependent 8704
I_MPI_DAPL_UD_RNDV_EP_NUM 4 2

Tune DAPL for large runs

Besides selecting the DAPL UCM provider, the following settings may be used for large scale runs (over 512 tasks).

Variable Value Description
DAPL_UCM_REP_TIME 2000 REQUEST timer, waiting for REPLY in milliseconds
DAPL_UCM_RTU_TIME 2000 REPLY timer, waiting for RTU in milliseconds
DAPL_UCM_CQ_SIZE 2000 CM completion queue
DAPL_UCM_QP_SIZE 2000 CM message queue
DAPL_UCM_RETRY 7 REQUEST and REPLY retries
DAPL_ACK_RETRY 7 RC acknowledgement retry count
DAPL_ACK_TIMER 20 RC acknowledgement retry timer

Use DAPL UD / RDMA mixed communication

Mixed UD / RDMA mode allows small messages to pass through UD, while large messages are passed through RDMA. The switch over point can be set using the following variable:

I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=<nbytes>

Note that RDMA is not supported for connectionless communication, but that large MPI messagescan benefit from RDMA. This mixed communication is disabled by default, but it can be enabled with:

I_MPI_DAPL_UD_RDMA_MIXED=1

Useful Resources

产品和性能信息

1

性能因用途、配置和其他因素而异。请访问 www.Intel.cn/PerformanceIndex 了解更多信息。