Improve Intel MKL Performance for Small Problems: The Use of MKL_DIRECT_CALL

One of the big new features introduced in the Intel MKL 11.2 is the greatly improved performance for small problem sizes. In 11.2, this improvement focuses on xGEMM functions (matrix multiplication). Out of the box, there is already a version-to-version improvement (from Intel MKL 11.1 to Intel MKL 11.2). But on top of it, Intel MKL introduces a new control that can lead to further significant performance boost for small matrices. Users can enable this control when linking with Intel MKL by specifying "-DMKL_DIRECT_CALL" or "-DMKL_DIRECT_CALL_SEQ". At the run time, the execution will be dispatched to a fast path for small input matrices. The fast path skips error checking and multiple layers of function calls, therefore improves performance by reducing associated overhead. The matrix sizes have to be small, for example, only a few dozens of rows and columns. For larger matrices the regular execution path is taken. MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ do not help, but do not do any harm either.

The chart below is a comparison between 4 scenarios of computing double-precision matrix-matrix multiplication for small matrices:

  • A naive implementation using triple-nested loops, compiled with flags "-O3 -xCORE-AVX2" using Intel C++ Compiler 15.0.
  • Using DGEMM from Intel MKL 11.1.1.
  • Using DGEMM from Intel MKL 11.2.
  • Using DGEMM from Intel MKL 11.2 and with "-DMKL_DIRECT_CALL" enabled.

The matrices used in this chart are all square. The version-to-version improvement of Intel MKL 11.2 over 11.1.1, as well as the additional benefit brought by MKL_DIRECT_CALL, are evident.

How to use MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ

These are the pre-processor macros to be defined to instruct Intel MKL to pick the fast path for small matrices. The first macro, MKL_DIRECT_CALL, is used when you plan to link to the parallel Intel MKL library. The second, MKL_DIRECT_CALL_SEQ, is used when you plan to link to the sequential Intel MKL library. These macros do not have effects on larger matrices.

For a program in the C language on Linux system, simply add -DMKL_DIRECT_CALL or -DMKL_DIRECT_CALL_SEQ. On Windows, the syntax is /DMKL_DIRECT_CALL or /DMKL_DIRECT_CALL_SEQ. Usually, the flag -std=c99 (/Qstd=c99 on Windows) is also needed. This has been tested on mainstream C and C++ compilers such as Intel C++ Compiler, GCC, Microsoft Visual Studio, etc. Note that this also works for the CBLAS interface (Intel MKL 11.2 Update 2 and later).

For a program in Fortran, first inlcude "mkl_direct_call.fi". See below for an example from the "Intel MKL User's Guide". Then, add -DMKL_DIRECT_CALL (/DMKL_DIRECT_CALL on Windows) or -DMKL_DIRECT_CALL_SEQ (/DMKL_DIRECT_CALL_SEQ on Windows). If you are using Intel Fortran Compiler then pass -fpp (/fpp on Windows) to enable Fortran pre-processing. If you are using PGI Fortran compiler then pass -Mpreprocess instead. This feature does not work with GNU Fortran compiler.

#     include "mkl_direct_call.fi"
      program   DGEMM_MAIN
....
*      Call Intel MKL DGEMM
....
      call sub1()
      stop 1
      end

*     A subroutine that calls DGEMM 
      subroutine sub1
*      Call Intel MKL DGEMM

      end

Limitations

  • The performance gain is a result of skipping error checking and function inlining. There will be no error reported if incorrect parameters are passed to the function call. For this reason, users should not use this feature during code development and debugging. Users should only enable this feature when the code is ready for deployment.
  • The "verbose mode" (another new feature introduced in Intel MKL 11.2) does not work for functions that take the fast path enabled by this feature.
  • BLAS95 interfaces are not supported.
  • CNR (Conditional Numerical Reproducibility) is not supported.
  • For Fortran programs, the GNU Fortran compiler is not supported.
  • For C Application you must enable mixing declarations and user code by providing the /Qstd=99 option for Intel(R) compilers.
  • In a fixed format Fortran source code compiled with PGI compilers, the lines containing Intel MKL functions must end at least seven columns before the line ending column, usually, in a column with the index not greater than 72 – 7 = 65.
  • The feature is only available for the following functions:
    • BLAS: ?gemm, ?gemm3m, ?syrk, ?trsm, ?axpy, and ?dot
    • LAPACK: ?getrf, ?getrs, ?getri, ?potrf, and ?geqrf. (available for C applications only)

 

For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Top
dkokron's picture

I have built an application that uses dgemm, ddot and daxpy via the PETSc library which was itself configured to use MKL (see below).  I also used the MKL_VERBOSE option to confirm that the DGEMM calls use very small matrices (9x9), so I figured disabling error checking would improve performance.

I built PETSc with and without the -DMKL_DIRECT_CALL_SEQ flag.

icc -fPIC -wd1572 -g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -shared -Wl,-soname,libpetsc.so

icc -fPIC -wd1572 -g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ -shared -Wl,-soname,libpetsc.so

Yet a performance profile shows no change in any of the dgemm, ddot and daxpy.

How can I prove that the direct path is actually being taken?

 

icc version 15.0.3.187

MKL version=11.2.3

PETSc configure command

./configure --prefix=${PETSC_DIR}/${PETSC_ARCH}/install --with-debugging=0 --with-shared-libraries=1 --with-cc=icc --with-fc=ifort --with-cxx=icpc --with-blas-lapack-dir=/nasa/intel/Compiler/2015.3.187/mkl/lib/intel64 --with-scalapack-include=/nasa/intel/Compiler/2015.3.187/mkl/include --with-scalapack-lib="/nasa/intel/Compiler/2015.3.187/mkl/lib/intel64/libmkl_scalapack_lp64.so /nasa/intel/Compiler/2015.3.187/mkl/lib/intel64/libmkl_blacs_sgimpt_lp64.so" --with-cpp=/usr/bin/cpp --with-gnu-compilers=0 --with-vendor-compilers=intel -COPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ" -CXXOPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -DMKL_DIRECT_CALL_SEQ" -FOPTFLAGS="-g -O3 -axCORE-AVX2,AVX -xSSE4.2 -diag-disable=cpu-dispatch -fpp -DMKL_DIRECT_CALL_SEQ" --with-mpi-exec=mpiexec --with-mpi-compilers=0 --with-precision=double --with-sclar-type=real --with-dynamic-loading --with-x=0 --with-x11=0 --download-mumps --download-ptscotch --download-hypre

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.