This page documents specific tips and the best known methods of using the Intel® Math Kernel Library on the Intel® Xeon Phi™ coprocessor. For general performance improvement recommendations of using Intel® Math Kernel Library, please see the related topics in the Intel® Math Kernel Library for Linux* OS User's Guide.
Related documentation: Using Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessors.
All examples below assume a 61-core coprocessor, but the techniques should apply to other variations of coprocessors with proper adjustments according to the actual core count.
Environment settings for Native Execution
Native execution means the whole program runs on the coprocessor as if it were an independent compute node. We can use all available cores and all threads on each core to get the maximum performance. FFT functions are an exception because they tend to peform better if the number of threads is a power of 2. (The problem size had better be a power of 2 as well).
- For BLAS, LAPACK, and Sparse BLAS functions:
- For FFT functions:
In addition, using large pages (2MB) for memory allocation is often necessary to get good performance for native execution. See this article (PDF) to learn more.
Environment settings for Compiler-Assisted Offload
In the case of compiler-assisted offload, an application runs on the host system but may offload MKL functions to be executed on the coprocessor. The environment variable settings discussed above for native execution still apply. However, there are now two sets of envrionment variables, one for the host environment and the other for the coprocessor environment. The environment variables intended for the coprocessor need a prefix to distinguish themselves from the host environment variables. This is done by setting MIC_ENV_PREFIX. Another particularity about compiler-assisted offload is that one core (with all 4 threads on it) should be reserved for data transfer tasks and be exempted from computing tasks. This consideration is reflected in the setting of the number of OpenMP threads to be used. Last, 2MB pages are also needed. But unlike the 2MB pages in the native execution, the 2MB pages in compiler-assisted offload are used to improve data transfer performance, and they should be enabled using the MIC_USE_2MB_BUFFERS variable.
Putting all together and taking an example of offloading BLAS functions:
Environment settings for Automatic Offload
The automatic offload feature in Intel MKL automatically and transparently splits the workload between the host and the coprocessor. Relevant environment variables must be set on both sides. On the coprocessor side, one core needs to be reserved for data transfer tasks. But the 2MB buffers are taken care of by the MKL runtime, so explicit seting is unnecessary. Putting all together:
MKL_MIC_MAX_MEMORY=4096 (for BLAS functions), 7600M (for LAPACK functions)
Automatic offload is currently only available for the BLAS Level 3 functions and a small set of LAPACK functions: LU, QR, Cholesky. For complete documentation on environment variables and controls for automatic offload, see here.
- Large problem sizes are typically needed to fully exploit the highly parallel capability of the hardware. However, also note that the memory limit on the coprocessor. For example, the current generation of Intel Xeon Phi produts have 6 to 8 GB memory on each coprocessor.
- It is critical to align data against the 64-byte boundary to get the full potential of vectorization (The vector units on the coprocessor have a 512-bit width).
Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor