In the High Performance Computing (HPC) area, parallel computing techniques such as MPI, OpenMP*, one-sided communications, shmem, and Fortran coarray are widely utilized. This blog is part of a series that will introduce the use of these techniques, especially how to use them on the Intel® Xeon Phi™ coprocessor. This first blog discusses the main usage of the hybrid MPI/OpenMP model.
Message Passing Interface (MPI) is a message-passing library specification which is designed for high performance computing in distributed memory systems. In MPI, data is moved from one process (or MPI rank) to another process through cooperative operations. Each MPI rank owns a separate address space. MPI involves both communication and synchronization, and it has become the de facto standard for communication among processes running on distributed memory systems.
Open Multi-Processing (OpenMP) is a specification for a set of pragmas, run-time libraries and environment variables that are used for thread parallelism. Shared memory is accessed by all OpenMP threads. OpenMP programming is used for shared memory processes which run on a multicore processor.
Combining MPI/OpenMP is very common in HPC since this approach takes advantage of both parallel computing techniques: it uses a well-defined communication model among multiple ranks running on heterogeneous nodes, and uses groups of lightweight threads running on each computing node in order to take advantage of multicore/manycore processor architectures. This hybrid model combining MPI/OpenMP, uses MPI to communicate among nodes and uses OpenMP based shared memory programming in each node.
In pure MPI, if the number of MPI ranks keeps increasing on each node, the memory consumed by all MPI ranks eventually exceeds the node memory, leading to performance decrease. Also due to the limited bandwidth of the data transmission devices (InfiniBand, Gigabit Ethernet), the scalability of MPI is limited. Load balancing is another challenge. Because the architecture of the Xeon and Xeon Phi coprocessor are different, workload imbalance is common. You can analyze workload balancing and improve it using the Intel® Trace Analyzer and Collector.
In hybrid model where MPI programming combines with OpenMP programming, the memory in each node is shared by OpenMP threads, thus the total memory consumed is much less than using MPI ranks exclusively. Scalability is also better due to less MPI communication among the nodes. Also, within a node, communication among lightweight threads is much faster than MPI communication sends/receives. However, to maximize the use of hardware, you need to map the threads to existing cores efficiently. Also, the number of OpenMP threads per MPI rank must be carefully chosen and the MPI blocking mechanism carefully applied since it can lead to deadlock.
Using the Hybrid Model on Coprocessors:
The use of Intel® MPI libraries with Intel® Xeon Phi™ coprocessors supports three programming models: offload, coprocessor-only and symmetric.
In the offload programming model, MPI ranks run only on the Xeon host, offloading parallel sections onto the coprocessors (using coprocessors as accelerators). These parallel sections often use OpenMP. The offload model is used for highly-parallel phases where a MPI rank running on a host can offload the highly parallel code to the coprocessors. This highly parallel code is executed by OpenMP threads on the coprocessors. Note that in order to use the offload programming model, applications written for the Xeon host need to be restructured so that parallel portions can be offloaded to coprocessors.
The coprocessor-only programming model is used when the MPI program is built specifically for natively executing on a set of one or more coprocessors connected through their Xeon hosts. MPI ranks on the coprocessors launch OpenMP threads to execute the workload. This programming model is very useful when the workload can be shared equally and fairly among a group of homogenous nodes, and is particularly attractive for highly-parallel applications. Note that in the coprocessor-only programming model, the same applications running on the Xeon host can be ported to the coprocessor environment by just recompiling for natively executing on the coprocessors (using the Intel compiler and Intel MPI Library); one doesn’t need to modify the original code at all. Even so, achieving exceptional performance will require performance tuning.
In the symmetric programming model, the MPI program is built for both host and coprocessor, thus the program runs in a heterogeneous environment. This programming model can be used to maximize the computing power of both Xeon and Xeon Phi processors. In the symmetric programming model, MPI ranks run on both the Xeon hosts and the coprocessors; in each node, OpenMP threads are launched by MPI ranks to execute the assigned workload.
When designing a hybrid MPI/OpenMP program, the designer should consider the communication latency since MPI is used to send/receive messages among computing nodes. Carefully choosing the appropriate fabrics can maximize the workload performance. PCIe communications should be taken into account in the case of the coprocessor. Always try to maximize the use of CPUs available, maximize the use of available cores on each coprocessor, and maximize the use of available bandwidth among the computing nodes.
In the hybrid approach, MPI ranks normally do the communication part and OpenMP threads do the computation part. Communication and computation parts are often arranged in serial: OpenMP threads run the workload while MPI ranks wait, and when the MPI ranks get the results, they exchange the results while OpenMP threads take a short break! One can improve the code so that all OpenMP threads and MPI ranks are always busy.
For more details on using OpenMP on the coprocessor, readers are encouraged to read the following articles: Best Known Methods for Using OpenMP* on Intel® Many Integrated Core (Intel® MIC) Architecture, OpenMP* Thread Affinity Control, and New KMP_PLACE_THREADS OpenMP* affinity variable in Update 2 compiler.
In summary, the hybrid MPI/OpenMP model is a common method for computing in HPC. Both Xeon host and Xeon Phi coprocessors can be used to run this hybrid computing model. The Intel MPI library and Intel compiler offer three programming models: offload, native only and symmetric. Depending on the nature of the workload, one model may work better than the others. Remember to tune the application to get the best performance.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804