Using MPI and Xeon Phi™ Offload Together

Overview

Using the Intel® MPI Library along with the offload capabilities of the Intel® Xeon Phi™ coprocessor allows a user to access the capabilities of the coprocessor without the need for direct filesystem access on the coprocessor.  This can lead to administrative benefits by not requiring additional security on the coprocessor's filesystem.

Offloading of MPI functions

Calling MPI functions within an offload region is not supported.

Offloading within MPI applications

The offload programming model is supported by the Intel® MPI Library.  However, no attempt is made to coordinate coprocessor resource usage amongst the MPI ranks.  For example, if you are running 12 ranks on a node, with each rank offloading 16 threads to the coprocessor, by default each rank will offload to the first 16 threads of the first coprocessor.  Obviously, this will very quickly lead to resource conflicts.  To avoid these conflicts, there are two approaches that can be used.

Limit Offloading
By only running one offloading rank per host, there is no chance of multiple ranks offloading to the same coprocessor.  This requires a code to either only run with one rank per host or to be heterogeneous, with processes arranged to avoid multiple offloads on a host.  This method is more restrictive, but can be easily implemented.

Explicit Pinning
Setting the pinning on a per-process basis will allow control of where each thread is offloaded.  This method can completely prevent core oversubscription, but also requires significant manual setup specific to the run configuration. For example,

mpiexec.hydra -env MIC_OMP_NUM_THREADS 16 -env MIC_KMP_AFFINITY \
     granularity=fine,proclist[1-16],explicit -n 1 ./a.out : \
     -env MIC_OMP_NUM_THREADS 16 -env MIC_KMP_AFFINITY \
     granularity=fine,proclist[17-32],explicit -n 1 ./a.out

will run two copies of ./a.out, with rank 0's 16 threads restricted to cores 1-16, and rank 1's 16 threads restricted to cores 17-32. For any significantly sized job, this will very quickly become difficult to manage.  However, this method can allow full utilization of the coprocessor when using MPI and offload.

Additional Information

For more information about how to use offloading, see the Intel® Compiler Documentation.

如需更全面地了解编译器优化,请参阅优化注意事项.