Improve Intel® MKL Performance for Small Problems: The Use of MKL_DIRECT_CALL

By Zhang Zhang, Gennady Fedorov, Published: 08/28/2014, Last Updated: 11/19/2017

One of the big new features introduced in the Intel® Math Kernel Library (Intel® MKL) 11.2 is the greatly improved performance for small problem sizes. In 11.2, this improvement focuses on xGEMM functions (matrix multiplication). Out of the box, there is already a version-to-version improvement (from Intel MKL 11.1 to Intel MKL 11.2). But on top of it, Intel MKL introduces a new control that can lead to further significant performance boost for small matrices. Users can enable this control when linking with Intel MKL by specifying "-DMKL_DIRECT_CALL" or "-DMKL_DIRECT_CALL_SEQ". At the run time, the execution will be dispatched to a fast path for small input matrices. The fast path skips error checking and multiple layers of function calls, therefore improves performance by reducing associated overhead. The matrix sizes have to be small, for example, only a few dozens of rows and columns. For larger matrices the regular execution path is taken. MKL_DIRECT_CALL and MKL_DIRECT_CALL_SEQ do not help, but do not do any harm either.

The chart below is a comparison between 4 scenarios of computing double-precision matrix-matrix multiplication for small matrices:

  • A naive implementation using triple-nested loops, compiled with flags "-O3 -xCORE-AVX2" using Intel C++ Compiler 15.0.
  • Using DGEMM from Intel MKL 11.1.1.
  • Using DGEMM from Intel MKL 11.2.
  • Using DGEMM from Intel MKL 11.2 and with "-DMKL_DIRECT_CALL" enabled.

The matrices used in this chart are all square. The version-to-version improvement of Intel MKL 11.2 over 11.1.1, as well as the additional benefit brought by MKL_DIRECT_CALL, are evident.


These are the pre-processor macros to be defined to instruct Intel MKL to pick the fast path for small matrices. The first macro, MKL_DIRECT_CALL, is used when you plan to link to the parallel Intel MKL library. The second, MKL_DIRECT_CALL_SEQ, is used when you plan to link to the sequential Intel MKL library. These macros do not have effects on larger matrices.

For a program in the C language on Linux system, simply add -DMKL_DIRECT_CALL or -DMKL_DIRECT_CALL_SEQ. On Windows, the syntax is /DMKL_DIRECT_CALL or /DMKL_DIRECT_CALL_SEQ. Usually, the flag -std=c99 (/Qstd=c99 on Windows) is also needed. This has been tested on mainstream C and C++ compilers such as Intel C++ Compiler, GCC, Microsoft Visual Studio, etc. Note that this also works for the CBLAS interface (Intel MKL 11.2 Update 2 and later).

For a program in Fortran, first inlcude "". See below for an example from the "Intel MKL User's Guide". Then, add -DMKL_DIRECT_CALL (/DMKL_DIRECT_CALL on Windows) or -DMKL_DIRECT_CALL_SEQ (/DMKL_DIRECT_CALL_SEQ on Windows). If you are using Intel Fortran Compiler then pass -fpp (/fpp on Windows) to enable Fortran pre-processing. If you are using PGI Fortran compiler then pass -Mpreprocess instead. This feature does not work with GNU Fortran compiler.

#     include ""
      program   DGEMM_MAIN
*      Call Intel MKL DGEMM
      call sub1()
      stop 1

*     A subroutine that calls DGEMM 
      subroutine sub1
*      Call Intel MKL DGEMM



  • The performance gain is a result of skipping error checking and function inlining. There will be no error reported if incorrect parameters are passed to the function call. For this reason, users should not use this feature during code development and debugging. Users should only enable this feature when the code is ready for deployment.
  • The "verbose mode" (another new feature introduced in Intel MKL 11.2) does not work for functions that take the fast path enabled by this feature.
  • BLAS95 interfaces are not supported.
  • CNR (Conditional Numerical Reproducibility) is not supported.
  • For Fortran programs, the GNU Fortran compiler is not supported.
  • For C Application you must enable mixing declarations and user code by providing the /Qstd=99 option for Intel(R) compilers.
  • In a fixed format Fortran source code compiled with PGI compilers, the lines containing Intel MKL functions must end at least seven columns before the line ending column, usually, in a column with the index not greater than 72 – 7 = 65.
  • The feature is only available for the following functions:
    • BLAS: ?gemm, ?gemm3m, ?syrk, ?trsm, ?axpy, and ?dot
    • LAPACK: ?getrf, ?getrs, ?getri, ?potrf, and ?geqrf. (available for C applications only)


Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804