Performance Tips of Using Intel® MKL on Intel® Xeon Phi™ Coprocessor

This page documents specific tips and the best known methods of using the Intel® Math Kernel Library on the Intel® Xeon Phi™ coprocessor. For general performance improvement recommendations of using Intel® Math Kernel Library, please see the related topics in the Intel® Math Kernel Library User's Guide.

Related documentation: Using Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessors.

All examples below assume a 61-core coprocessor, but the techniques should apply to other variations of coprocessors with proper adjustments according to the actual core count.

Environment settings for Native Execution

Native execution means the whole program runs on the coprocessor as if it were an independent compute node. We can use all available cores and all threads on each core to get the maximum performance. FFT functions are an exception, see notes at the end of this article.


In addition, using 2MB pages for memory allocation is often necessary to get good performance for native execution. See this article (PDF) to learn more.

Environment settings for Compiler-Assisted Offload

In the case of compiler-assisted offload, an application runs on the host system but may offload MKL functions to be executed on the coprocessor. The environment variable settings discussed above for native execution still apply. However, there are now two sets of envrionment variables, one for the host environment and the other for the coprocessor environment. The environment variables intended for the coprocessor need a prefix to distinguish themselves from the host environment variables. This is done by setting MIC_ENV_PREFIX. Another particularity about compiler-assisted offload is that one core (with all 4 threads on it) should be reserved for data transfer tasks and be exempted from computing tasks. This consideration is reflected in the setting of the number of OpenMP threads to be used. Last, 2MB pages are also needed. But unlike the 2MB pages in the native execution, the 2MB pages in compiler-assisted offload are used to improve data transfer performance, and they should be enabled using the MIC_USE_2MB_BUFFERS variable.

Putting all together and taking an example of offloading BLAS functions:


Environment settings for Automatic Offload

The automatic offload feature in Intel MKL automatically and transparently splits the workload between the host and the coprocessor. Relevant environment variables must be set on both sides. On the coprocessor side, one core needs to be reserved for data transfer tasks. But the 2MB buffers are taken care of by the MKL runtime, so explicit seting is unnecessary. Reserving and keeping memory allocated on the coprocessor duing Automatic Offload also enhances performance. Putting all together:




Automatic offload is currently only available for some BLAS Level 3 functions and a small set of LAPACK functions: LU, QR, Cholesky. For complete documentation on environment variables and controls for automatic offload, see here.

Other Considerations

  • Large problem sizes are typically needed to fully exploit the highly parallel capability of the hardware. However, also note that the memory limit on the coprocessor. For example, the Intel Xeon Phi 7120 coprocessor has up to 16 GB memory.
  • It is critical to align data against the 64-byte boundary to get the full potential of vectorization (The vector units on the coprocessor have a 512-bit width). 
  • If your application uses both Compiler Assisted Offload and Automatic Offload then it is strongly recommended to set OFFLOAD_ENABLE_ORSL=1. This env-variable enables the two offload modes to synchronize their accesses to coprocessors.
  • To improve performance of Intel MKL FFT functions, follow these guidelines:
    • Align the first element of the input data on 64-byte boundaries.
    • For two- or higher-dimensional single-precision transforms, use a leading dimension (stride) divisible by 8 but not by 16.
    • For two- or higher-dimensional double-precision transforms, use a leading dimension (stride) divisible by 4 but not by 8.
    • For small transform sizes (less than num-of-phi-cores/2MB), set MIC_OMP_NUM_THREADS to a power-of-two value, for example, 128.
    • For larger transform sizes (no less than num-of-phi-cores/2MB), set MIC_OMP_NUM_THREADS=244 (for native execution), or MIC_OMP_NUM_THREADS=240 (for offload execution).

Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor


For more complete information about compiler optimizations, see our Optimization Notice.