Tuning the Intel® MPI Library: Advanced Techniques
By Carlos Rosales-Fernandez, Michael Steyer
Published:01/29/2018 Last Updated:01/29/2018
This article describes advanced techniques to tune the performance of codes built and executed using the Intel® MPI Library. The tuning methodologies presented are intended for experienced users, and their impact on overall performance depends strongly on the particular application and workload used.
Apply wait mode to oversubscribed jobs
This option is particularly relevant for oversubscribed MPI jobs. The goal is to enable the wait mode of the progress engine in order to wait for messages without polling the fabric(s). This can save CPU cycles but decreases the message-response rate (latency), so it should be used with caution. To enable wait mode simply use:
I_MPI_WAIT_MODE=1
Adjust the eager / rendezvous protocol threshold
Two protocols control the handshake mechanism for message exchange in MPI:
- Eager: Sends data and metadata immediately regardless of receiver status
- Rendezvous: Sends metadata first. Full message sent after receiver acknowledges readiness (RTS/CTS)
The Eager protocol has lower latency, but requires more runtime memory in order to maintain the required intermediate buffers. It is commonly used for small messages. The Rendezvous protocol requires less memory, and has higher latency. It is typically used for large messages where the increased overhead is negligible.
For communication across nodes the Eager / Rendezvous threshold is defined by the variable I_MPI_EAGER_THRESHOLD (in bytes, default is 256 kB).
For communication within a node the variable I_MPI_INTRANODE_EAGER_THRESHOLD controls the switch point between both protocols (in bytes) except for the SHM fabric, and it is also set to 256 kB.
For the shm fabric the threshold control is provided by the variable I_MPI_SHM_CELL_SIZE (in bytes), whcih has a default that is platform dependent and that typically ranges between 8 kB and 64 kB.
Note that for DAPL and DAPL-UD fabrics the switch control is provided instead by the I_MPI_DAPL_DIRECT_COPY_THRESHOLD and the I_MPI_DAPL_UD_DIRE CT_COPY_THRESHOLD, which set the switch point between Eager and the direct-copy mechanisms in bytes.
Enforce asynchronous message transfer for non-blocking operations
It is possible to overlap computation and communication by spawning a helper thread. This can cause oversubscription, so it is disabled by default. To enable asynchronous progress:
I_MPI_ASYNC_PROGRESS=1
It is also possible to explicitly pin the helper thread - taking one rank out of the pinning mask:
I_MPI_ASYNC_PROGRESS_PIN=1
Tuning shared memory
There are three possible types of message exchange when using shared memory:
- User space – double copy – cache / dram - fast for small messages (eager) [I_MPI_SHM_CACHE_BYPASS_THRESHOLDS/I_MPI_INTRANODE_EAGER_THRESHOLD]
- Kernel assisted – single copy (CMA – Linux* 3.2) – fast for medium / large (rendezvous) [I_MPI_SHM_LMT/ I_MPI_INTRANODE_EAGER_THRESHOLD]
- Loopback Fabric – DMA can be performed by HW – might be faster for large messages in certain situations [I_MPI_SHM_BYPASS/ I_MPI_INTRANODE_EAGER_THRESHOLD]
Tuning the progress engine
Significant portions of CPU cycles are sometimes spent in the MPI progress engine (CH3/CH4). This can be addressed reducing the spin count, which is the number of times the progress engine spins waiting for a message or connection request before yielding to the OS:
I_MPI_SPIN_COUNT=<scount>
The default setting is 250 unless more than one process runs per processor, in which case it is set to 1.
Multi-core platforms with intranode communication dominated executions may benefit from a customized value of the intranode spin count:
I_MPI_SHM_SPIN_COUNT=<scount>
Reduce initialization times at scale
Start by making sure you use the latest Intel MPI library as well as the latest PSM2 version
If all ranks work on the same Intel Architecture generation, switch off the platform check:
I_MPI_PLATFORM_CHECK=0
Specify the processor architecture being used to tune the collective operations:
I_MPI_PLATFORM=uniform
Alternative PMI data exchange algorithm can help to speed up the startup phase:
I_MPI_HYDRA_PMI_CONNECT=alltoall
Customizing the branching may also help startup times (default is 32 for over 127 nodes):
I_MPI_HYDRA_BRANCH_COUNT=<n>
Additional settings for Infiniband*
Reduce initialization times at scale
Traditional Infiniband* support uses the Reliable Connection (RC) protocol to exchange MPI messages, but the User Datagram (UD) protocol has emerged as a lower memory consumption, more scalable alternative. This is the recommended DAPL mode for large scale runs.
Use DAPL UD UCM:
I_MPI_FABRICS=shm:dapl
I_MPI_DAPL_UD=1
I_MPI_DAPL_UD_PROVIDER=ofa-v2-mlx5_0-1u
DAPL library UCM settings are automatically adjusted for MPI jobs of more than 1024 ranks, resulting in more stable job start-up:
Provider | /etc/dat.conf | Description |
---|---|---|
CMA | ofa-v2-ib0 | Uses OFA rdma_cm to setup QP. IPoIB, ARP, and SA queries required |
SCM | ofa-v2-mlx5_0-1 | Uses sockets to exchange QP information. IPoIB, ARP and SA queries NOT required |
UCM | ofa-v2-mlx5_0-1u | Uses UD QP to exchange QP information. Sockets, IPoIB, ARP ans SA queries NOT required |
If all ranks have the same DAPL provider (I_MPI_DAPL_UD_PROVIDER) and the same DAPL library version for all ranks, switch off the DAPL provider check using hte following environmental variable setting:
I_MPI_CHECK_DAPL_PROVIDER_COMPATIBILITY=0
Setting IPoIB as the PMI low level transport layer may also improve the startup performance:
I_MPI_HYDRA_IFACE=ib0
Tune DAPL-UD for memory efficiency at scale
DAPL-UD is fairly efficient in its memory consumption, but it can be further optimized for large scale runs. The following table contains a suggested set of optimized settings to use as a starting point.
Variable | Default Value | Tuned Value |
---|---|---|
I_MPI_DAPL_UD_SEND_BUFFER_NUM | Runtime dependent | 8208 |
I_MPI_DAPL_UD_RECV_BUFFER_NUM | Runtime dependent | 8208 |
I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE | 256 | 8704 |
I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE | Runtime dependent | 8704 |
I_MPI_DAPL_UD_RNDV_EP_NUM | 4 | 2 |
Tune DAPL for large runs
Besides selecting the DAPL UCM provider, the following settings may be used for large scale runs (over 512 tasks).
Variable | Value | Description |
---|---|---|
DAPL_UCM_REP_TIME | 2000 | REQUEST timer, waiting for REPLY in milliseconds |
DAPL_UCM_RTU_TIME | 2000 | REPLY timer, waiting for RTU in milliseconds |
DAPL_UCM_CQ_SIZE | 2000 | CM completion queue |
DAPL_UCM_QP_SIZE | 2000 | CM message queue |
DAPL_UCM_RETRY | 7 | REQUEST and REPLY retries |
DAPL_ACK_RETRY | 7 | RC acknowledgement retry count |
DAPL_ACK_TIMER | 20 | RC acknowledgement retry timer |
Use DAPL UD / RDMA mixed communication
Mixed UD / RDMA mode allows small messages to pass through UD, while large messages are passed through RDMA. The switch over point can be set using the following variable:
I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=<nbytes>
Note that RDMA is not supported for connectionless communication, but that large MPI messagescan benefit from RDMA. This mixed communication is disabled by default, but it can be enabled with:
I_MPI_DAPL_UD_RDMA_MIXED=1
Useful Resources
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.