Recommendations to choose the right MKL usage model for Xeon Phi

Intel(R) Math Kernel Library (Intel(R) MKL) has full support for the Intel(R) Xeon Phi(TM) Co-processor and supports the following compute models, one of which has the capability to use both multicore host and many-core co-processors at the same time. Here is a short summary of the execution models. In this article, we will describe data sizes and parallel programming / offload techniques best benefit from a particular MKL execution model:

-          Automatic Offload – offers transparent heterogenous computing

-          Compiler Assisted Offload – allows fine offloading control

-          Native Execution – use the coprocessors as independent nodes

Automatic Offload – runs on both host and co-processor (target) by default

Sample Build Script: icc –O3 –mkl sgemm.c –o sgemm.exe

Calling Automatic Offload is done by calling a function mkl_mic_enable() or by seeing the environment variable MKL_MIC_ENABLE=1.

Here are the MKL Functions that are automatic offload enabled:

  •       A selective set of MKL functions are AO enabled.

–         Only functions with sufficient computation to offset data transfer overhead are subject to AO

  •        In 11.0, AO enabled functions include:

–         Level-3 BLAS: ?GEMM, ?TRSM, ?TRMM

–         LAPACK 3 amigos: LU, QR, Cholesky

In this model, offloading is automatic and transparent.

By default, Intel MKL decides when to offload, along with the work division between host and targets. Users can still control the work division to fine tune performance.

Automatic Offload works only when matrix sizes are sufficiently large.

  • •       ?GEMM: Offloading only when M, N > 2048
  • •       ?TRSM/TRMM: Offloading only when M, N > 3072
  • •       Square matrices may give better performance

How do you disable Automatic Offload after it is enabled?

  • •       mkl_mic_disable( ), or
  • •       mkl_mic_set_workdivision(MIC_TARGET_HOST, 0, 1.0), or
  • •       MKL_HOST_WORKDIVISION=100

Compiler Assisted Offload

Sample Build Script: Using -offload-option

icc –O3 -openmp -mkl \

  –offload-option,mic,ld, “-L$MKLROOT/lib/mic -Wl,\

  --start-group -lmkl_intel_lp64 -lmkl_intel_thread \

  -lmkl_core -Wl,--end-group” sgemm.c –o sgemm.exe

  • •       Offloading is explicitly controlled by compiler pragmas or directives.
  • •       All MKL functions can be offloaded in CAO.
    • •       In comparison, only a subset of MKL is subject to AO.
  • •       Can leverage the full potential of compiler’s offloading facility.
  • •       More flexibility in data transfer and remote execution management.
    • •       A big advantage is data persistence: Reusing transferred data for multiple operations

Compiler Assisted Offload works best when you have pipelined operations that can take advantage of data persistence like so

__declspec(target(mic)) static float *A, *B, *C, *C1; 

// Transfer matrices A, B, and C to coprocessor and do not de-allocate matrices A and B

#pragma offload target(mic) \  

in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \ 

in(A:length(NCOLA * LDA) free_if(0)) \    

in(B:length(NCOLB * LDB) free_if(0)) \    

inout(C:length(N * LDC))

{   

    sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB, &beta, C, &LDC);

// Transfer matrix C1 to coprocessor and reuse matrices A and B

#pragma offload target(mic) \  

in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \ 

nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) \    

nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) \    

inout(C1:length(N * LDC1))

{   

    sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B, &LDB, &beta1, C1, &LDC1);

// Deallocate A and B on the coprocessor

#pragma offload target(mic) \  

nocopy(A:length(NCOLA * LDA) free_if(1)) \

nocopy(B:length(NCOLB * LDB) free_if(1)) \  {  }

 

Tips for Compiler Assisted Offload

Use data persistence to avoid unnecessary data copying and memory alloc/de-alloc

  • •       Thread affinity: avoid using the OS core. Example for a 60-core coprocessor:
    MIC_KMP_AFFINITY=explicit,granularity=fine,proclist=[1-236:1]
  • •       Use huge (2MB) pages for memory allocation in user code:
  • •       MIC_USE_2MB_BUFFERS=64K
  • •       The value of MIC_USE_2MB_BUFFERS is a threshold. E.g., allocations of 64K bytes or larger will use huge pages.

Native

Sample Build Script

Using –mmic

icc –O3 –mmic -mkl sgemm.c –o sgemm.exe

  • •       Programs can be built to run only on the coprocessor by using the –mmic build option.
  • •       Tips of using MKL in native execution:
  • •       Use all threads to get best performance (for 60-core coprocessor)
    MIC_OMP_NUM_THREADS=240
  • •       Thread affinity setting
    KMP_AFFINITY=explicit,proclist=[1-240:1,0,241,242,243],granularity=fine

Use huge pages for memory allocation.

Which Model to Choose

Choose native execution if

-          Highly parallel code

-          Using coprocessors as independent compute nodes.

Choose AO when

-          Sufficient Byte/FLOP ratio makes offload beneficial

-          Using Level-3 BLAS functions: ?GEMM, ?TRMM, ?TRSM.

-          Using LU, QR, Cholesky factorization

Choose CAO when either

-          There is enough computation to offset data transfer overhead

-          Transferred data can be reused by multiple operations

You can always run on the host if offloading does not achieve better performance

MKL’s Most Optimized Functions for Xeon Phi

The following components are well optimized for Intel Xeon Phi coprocessors:

  • •       BLAS Level 3, and much of Level 1 & 2
  • •       Sparse BLAS: ?CSRMV, ?CSRMM
  • •       LU, Cholesky, and QR factorization
  • •       FFTs: 1D/2D/3D, SP and DP, r2c, c2c
  • •       VML (real floating point functions)
  • •       Random number generators:
    • •       MT19937, MT2203, MRG32k3a
    • •       Discrete Uniform and Geometric

Code Samples

$MKLROOT/examples/mic_samples

–         ao_sgemm   AO example

–         dexp       VML example (vdExp)

–         dgaussian  double precision Gaussian RNG

–         fft        complex-to-complex 1D FFT

–         sexp       VML example (vsExp)

–         sgaussian  single precision Gaussian RNG

–         sgemm      SGEMM example

–         sgemm_f    SGEMM example(Fortran 90)

–         sgemm_reuse     SGEMM with data persistence

–         sgeqrf          QR factorization

–         sgetrf          LU factorization

–         spotrf          Cholesky

–         solverc    PARDISO examples

Please refer other articles related to Intel MKL on Intel Xeon Phi at Intel® Math Kernel Library on the Intel® Xeon Phi™ Coprocessor

 

For more complete information about compiler optimizations, see our Optimization Notice.