Intel® Math Kernel Library Improved Small Matrix Performance Using Just-in-Time (JIT) Code Generation for Matrix Multiplication (GEMM)

    The most commonly used and performance-critical Intel® Math Kernel Library (Intel® MKL) functions are the general matrix multiply (GEMM) functions. Intel® MKL 2019 extends earlier optimizations for small problem sizes (MKL Direct Call, Batch API, Compact API) by introducing Just-In-Time (JIT) code generation for the {S,D}GEMM functions on Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures. The new just-in-Time (JIT) feature generates optimized GEMM kernels at runtime, tailored to the inputs you provide.

   Intel® MKL 2019 introduces JIT functionality for significantly accelerating small matrix multiplications. It supports JIT kernel generation for:

  • Real precisions (SGEMM and DGEMM);
  • m, n, k ≤ 16, any alpha and beta, and transposition of the A and B matrices;
  • Intel® Advanced Vector Extensions 2 (Intel® AVX2) and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures.

 Intel® MKL 2019 Update 3 extends the above functionality, supporting JIT kernel generation for

  • Real and complex precisions ({S,D,C,Z}GEMM);
  • Any matrix sizes, as long as one of m, n, k is less than 128; any alpha and beta; transposition of the A and B matrices;
  • Intel® Advanced Vector Extensions (Intel® AVX), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) architectures.

Note that applications using Intel® MKL 2019’s JIT functionality will run on all architectures; in case JIT is not supported for the current architecture or problem size, Intel® MKL will transparently fall back on standard GEMM routines.

Intel MKL v.2019 provides two interfaces for users to access JIT functionality. The first is completely transparent to the user and is activated through an extension of the existing MKL_DIRECT_CALL interface (see the MKL_DIRECT_CALL documentation for more details). The second interface involves new dedicated Application Programming Interfaces (APIs) for advanced users willing to change their code, and provides the ability to extract even more performance by removing most of the associated call overhead.  

The following sections describe how to use these features in more detail in C/C++. The new JIT functionality may also be accessed from Fortran 90, using a very similar interface. See the Fortran documentation or the examples bundled with Intel® MKL for more information on using JIT GEMM in Fortran 90.

Using JIT Without Changing Your Code: MKL_DIRECT_CALL_JIT

    The simplest way to take advantage of Intel® MKL 2019’s new JIT GEMM capabilities is to define the preprocessor macro MKL_DIRECT_CALL_JIT. No other changes are required. (If you are using MKL in sequential mode, define MKL_DIRECT_CALL_SEQ_JIT instead.)

When MKL_DIRECT_CALL_JIT is active and the user calls GEMM, MKL will decide whether JIT code generation could be beneficial for the GEMM problem given. If so, it will generate size and architecture-specific kernels, tailored to the given parameters (layout, transa, transb, m, n, k, alpha, lda, ldb, beta, ldc). These kernels are then cached and reused every time GEMM is called with the same set of parameters. If MKL decides JIT is not beneficial, the standard GEMM routine will be called as usual.

The MKL_DIRECT_CALL_JIT and MKL_DIRECT_CALL_SEQ_JIT preprocessor macros allow you to quickly evaluate whether the JIT feature provides performance benefits to your applications, particularly those that call GEMM many times for small problem sizes.

Example

The following loop calls DGEMM repeatedly with different A, B and C matrices, but the same parameters layout, transa, transb, m, n, k, alpha, lda, ldb, beta, and ldc.

for (it = 0; it < nb; it++) {

    …

    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans, 5, 3, 12, 1.0,
                a[it], 8, b[it], 8, 0.0, c[it], 8);

    …

}

Let’s assume that this code is extracted from a file named bench.c which is compiled using the following command line (Linux):

$ icc bench.c –o bench -DMKL_ILP64 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm –ldl –DMKL_DIRECT_CALL_SEQ_JIT

When running the generated executable bench on an Intel® AVX2 or Intel® AVX-512 system, the first iteration of the loop will generate a dedicated size and architecture-specific GEMM kernel and store it. Subsequent loop iterations retrieve the stored kernel and reuse it.

Performance 

   The chart below illustrates the performance benefits of the MKL_DIRECT_CALL_JIT feature on an Intel® Xeon® Platinum 8180 processor for some small square matrix multiplications. The performance shown in this chart does not include the kernel generation time; we are assuming here that the kernel generation cost is amortized by a large number of SGEMM calls using the same input parameters, but different matrices. The performance of conventional SGEMM and SGEMM with MKL_DIRECT_CALL are provided for comparison. Data on other precisions can be found in the next section of this article.

These tests were conducted with alpha = beta = 1, lda = m, ldb = k, ldc = 16, and both A and B non-transpose.

To conclude, MKL_DIRECT_CALL_JIT allows the user to speed up small matrix multiplies with no code modifications. However, caching kernels and looking them up at runtime does incur some overhead. For best performance, the user should turn to Intel® MKL’s dedicated JIT API, described in the next section.

Explicit using the Just-In-Time (JIT) GEMM API for Maximum Performance

For best performance, the user can explicitly create optimized GEMM kernels generated for specific problem sizes and call them using function pointers. The new API consists of three groups of functions:

  • mkl_jit_create_{s,d,c,z}gemm (creates a JIT kernel)
  • mkl_jit_get_{s,d,c,z}gemm_ptr (gets a pointer to the kernel function)
  • mkl_jit_destroy (destroys a JIT kernel)

The real versions were introduced in Intel® MKL 2019; The complex versions were introduced in Intel® MKL 2019 Update 3. These functions and their associated data types are defined in mkl.h.

The GEMM JIT kernel and the required runtime code generator are generated and stored by calling mkl_jit_create_{s,d,c,z}gemm, which takes as inputs the standard GEMM input parameters (except the pointers to matrices A, B, and C) and a pointer where a handle to the code generator (an opaque pointer) will be stored. The mkl_jit_create_{s,d,c,z}gemm function returns a status code of type mkl_jit_status_t, whose value may be one of the following:

  • MKL_JIT_SUCCESS – indicates that a GEMM kernel has been generated;
  • MKL_NO_JIT – a GEMM kernel was not generated and standard GEMM will be used instead;
  • MKL_JIT_ERROR – an error occurred due to lack of memory.

There are several reasons MKL_NO_JIT may be returned:

  • JIT is not available for the current instruction set architecture;
  • Prior to Intel® MKL 2019 Update 3, the matrices were larger than the maximum supported size;
  • For Intel® MKL 2019 Update 3, the matrices are large enough that JIT may not be beneficial.

After creating the code generator, call  mkl_jit_get_{s,d,c,z}gemm_ptr to retrieve a function pointer to the generated GEMM kernel. This function pointer performs the requested GEMM operation, taking four parameters: a handle to the code generator, and pointers to the A, B, and C matrices. Note that a valid pointer is returned even when mkl_jit_create_{s,d,c,z}gemm returns MKL_NO_JIT; in this case, standard GEMM will be used, rather than a JIT generated kernel.

Finally, when the kernel is no longer needed, the mkl_jit_destroy function frees memory associated to the code generator and GEMM kernel.

Example

The code sample below illustrates how the loop in the previous example can be rewritten to use calls to the mkl_jit_create_dgemm, mkl_jit_get_dgemm_ptr, and mkl_jit_destroy functions.

// declare a handle on the code generator
void* jitter;

// create the code generator and generate the tailored GEMM kernel
// the first parameter is the address of the code generator handle

mkl_jit_status_t status = mkl_jit_create_dgemm(&jitter, CblasColMajor, CblasNoTrans, CblasTrans, 5, 3, 12, 1.0, 8, 8, 0.0, 8);

// check if the code generator has been successfully created
if (MKL_JIT_ERROR == status) {

    fprintf(stderr, “Error: insufficient memory to JIT and store the DGEMM kernel\n”);

    return 1;

}

// retrieve the function pointer to the DGEMM kernel
// void my_dgemm(void*, double*, double*, double*)
// it is safe to call mkl_jit_get_dgemm_ptr only if status != MKL_JIT_ERROR

dgemm_jit_kernel_t my_dgemm = mkl_jit_get_dgemm_ptr(jitter);

for (it = 0; it < nb; it++) {

    …

    // replace cblas_dgemm calls by calls to the generated DGEMM kernel
    // the first parameter is the handle on the code generator
    // followed by the three matrices

    my_dgemm(jitter, a[it], b[it], c[it]);

    …

}

// when the DGEMM kernel is not needed, free the memory.
// the DGEMM kernel and the code generator are deleted

mkl_jit_destroy(jitter);

Performance

The charts below illustrate the performance benefits of the new JIT APIs on Intel® Xeon® Platinum 8180 Processor. The performance shown in the chart assumes that the kernel generation cost can be completely amortized by a large number of {S,D}GEMM calls using same input parameters (except matrices). The performance of the MKL_DIRECT_CALL feature, conventional {S,D}GEMM calls, and  the MKL_DIRECT_CALL_JIT feature described earlier for various m, n, and k’s are provided for comparison.

All tests were conducted with alpha = beta = 1, lda = m, ldb = k, ldc = 16, and both A and B non-transpose.

Deciding When To Use JIT

Here are some guidelines on when to use JIT, and which API to use:

  • If m, n, and k are all small (≤ 32), JIT is likely to be beneficial if your code will reuse the generated kernel at least 100-1000 times. The JIT API is recommended for best performance, due to overheads in MKL_DIRECT_CALL_JIT.
  • If one or two of m, n, and k are small (≤ 32) and the others are larger, MKL_DIRECT_CALL_JIT will not introduce much overhead and can be used to quickly determine whether JIT is useful. If so, you may also consider refactoring your code to use the JIT API for extra performance gains.
  • If m, n, and k are all larger (> 32), JIT may provide little speedup to moderate speedup, depending on the exact problem. Try using MKL_DIRECT_CALL_JIT to gauge whether JIT is appropriate for your application.
  • MKL_DIRECT_CALL_JIT and the JIT API both use heuristics to determine whether or not to generate a JIT GEMM kernel.  MKL_DIRECT_CALL_JIT is more conservative: it will generate kernels only when JIT is predicted to increase performance. However, the JIT API will generate kernels unless it is certain JIT will not increase performance. In Intel® MKL 2019 Update 3, the JIT API will not generate kernels if m, n, and k are all at least 128.

Currently, all JIT GEMM kernels are single-threaded. However, it is safe to create, call, and destroy kernels from multi-threaded code. If you have a GEMM problem where one of m, n, and k is very large, and other two are small, it may be worthwhile to divide the problem between multiple threads and use JIT on the subproblems.

Amortizing Code Generation Time

Because generating JIT kernels requires some time, it is typically advisable to use JIT only when you can reuse the generated kernels many times; as a rule of thumb, hundreds of times or more. (Recall that the performance charts in this article assume that the cost of JIT kernel generation is amortized across a large number of calls to the same GEMM kernel.)

Code generation time is only a significant issue for smaller problems, where a single GEMM call is hundreds of times faster than generating a JIT kernel. As the input matrices become larger, a single GEMM call requires more time, and code generation time becomes less and less important.

The tables below provide estimates for the number of function calls required to justify the up-front kernel generation costs of JIT for some small problems, rather than using conventional Intel® MKL GEMM. For each problem, a range is given. The upper number in the range is representative of a program generating only one GEMM kernel. The lower number is representative of a program generating multiple GEMM kernels. In the tests, all leading dimensions were aligned to multiples of 64 bytes.

The data above should be taken as rough guidance only. The exact number of calls required to amortize kernel generation cost for your application will depend on many factors: the processor used, input parameters, kernel, and cache usage (data and instruction), among others.

Tips for Best Performance

  • As with other BLAS routines, align your data to a multiple of 64 bytes (the cache line width) for best performance, if possible. Aligning leading dimensions, especially ldc, to a multiple of 64 bytes may further improve performance.
  • If m is very small (approx. m ≤ 16, depending on precision and ISA) and k is moderate to large, storing matrix A transposed can improve performance due to increased vectorization efficiency.

Summary

The new JIT capabilities in Intel® MKL 2019 significantly accelerate small matrix multiplies by generating GEMM kernels tailored to the inputs you provide. Two interfaces allow you to leverage JIT in your own applications. The MKL_DIRECT_CALL_JIT extension transparently enables JIT when it may be beneficial, allowing you to quickly gauge whether JIT can accelerate your application. Or, for best performance, a new dedicated JIT API gives you direct access to generated kernels. Refer to the Intel MKL Developer Reference for further details.

 

For more complete information about compiler optimizations, see our Optimization Notice.

1 comment

Top

It would be great to add zgemm and zgemm3m to that list!

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.