Tips to Measure the Performance of Matrix Multiplication Using Intel® MKL

Introduction

Intel® Math Kernel Library (Intel® MKL) is a highly optimized and extensively threaded math library suitable for use with computationally intensive applications. Intel® MKL has been optimized for a broad range of Intel processors enabling users to evaluate the performance of their applications, and Intel® MKL on those processors.

The General Matrix-Matrix Multiplication (GEMM) function is a commonly used function in scientific, engineering, numerical computing, data analytics and machine learning workloads. Intel® Math Kernel Library (Intel® MKL) provides single and double precision optimized matrix multiplication functions, [S,D]GEMM, which have been vectorized and parallelized to take advantage of the latest Intel architecture instruction sets and their numerous computational cores. Interest in the performance of SGEMM, for example, has grown significantly due to its ability to accelerate Deep Learning Neural Network (DNN) computations.

Factors such as the problem size, cache size, number for cores, and the underlying computer architecture influence the performance of almost all Intel® MKL functions. In this article, we provide some tips on how to measure the single-precision general matrix multiply (SGEMM) function performance on Intel® Xeon processors. These same tips can be applied to measure the performance of many other Intel® MKL functions as well.

To begin, review the Intel® MKL Reference Manual for a detailed description of SGEMM function and its parameters:

cblas_sgemm is the C interface of SGEMM, it's syntax

Computes a matrix-matrix product with general matrices
void cblas_sgemm (const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE transa, const
CBLAS_TRANSPOSE transb, const MKL_INT m, const MKL_INT n, const MKL_INT k, const float
alpha, const float *a, const MKL_INT lda, const float *b, const MKL_INT ldb, const
float beta, float *c, const MKL_INT ldc);

the ?gemm routines compute a scalar-matrix-matrix product and add the result to a scalar-matrix product, with general matrices. The operation is defined as
C := alpha*op(A)*op(B) + beta*C,
where:
op(X) is one of op(X) = X, or op(X) = XT, or op(X) = XH,

Measuring the Performance of SGEMM

Tips 1: ignore the time required for the first call
The Intel® Math Kernel Library (Intel® MKL) is multi-threaded and employs internal buffers for fast memory allocation. Typically the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time usually insignificant compared to the execution time of SGEMM for large matrices, it can be substantial when timing SGEMM for small matrices. To remove the initialization time from the performance measurement, we recommend making a call to SGEMM with sufficiently bigger parameters (for example, M=N=K=100) and ignoring the time required for the first call. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices.

Tips 2: Select appropriate time function and measure the average performance
Intel MKL provides the timing function, dsecnd(), which measures the execution time in seconds. The execution time of a subroutine may vary from one call to another, and small problems are especially susceptible to the time variations due to system artifacts. Therefore, for functions with small execution times, it is a common practice to measure the average performance by placing the function within a loop. The total elapsed time divided by the loop count gives the average time required for a single function call. The loop count should be large enough to get a representative average for small problems. On the other hand, if a large loop count is chosen, then the execution time of the benchmark may be prohibitive for large problems. By practice, we expect the loop will take at least >1 second.

Developer may often use the system time measure function gettimeofday(&end_time,NULL), which give the number of seconds and microseconds(us). We use the time function in the article. 

Tips 3: Measure Performance in giga floating point operations per second (Gflops)

Floating point operations per second (FLOPS) is a useful metric to compare the performance of compute-bound subroutines like SGEMM with the theoretical peak performance of a machine. For the multiplication of an M×K A matrix and a K×N B matrix, 2K - 1 operations (K-1 additions and K multiplications) are required to compute each element of the result matrix. Since there are MN entries in the C matrix, MN(2K-1) operations are required for the multiplication of the two matrices. 2MN additional operations are required for adding scaled C to AB. Therefore, the total number of floating point operations for a typical SGEMM call is approximately 2MNK. Dividing the number of operations by the average time gives the average Flops rate for SGEMM. Typically, the performance is reported as Gigaflops(GFlops), which is 10Flops. An example code that determines the time and GFlops for SGEMM is provided below.

Theoretical Peak GFLOPS of test machine

The Peak GFLOPS numbers for Intel processor can be calculated in theory

The total number of cores on the machine (cores) * CPU frequency in GHz per core(hz) * float operations per CPU cycle (fpc)

Where float operations per CPU cycle (fpc) = the floating point numbers in one SIMD register x  number of SIMD operands * SIMD operands per CPU cycle

Some examples are below:

- Intel® Core™ i5-6300U  CPU@2.4G 2 core, which is 6th Generation Intel® Core™ i5 Processors and support AVX2, 2 FMA (1 mutiply and 1 add). So for the single-precision floating point (32bit), the fpc = 256/32 x2x2 =32. And Peak GFLOPS=2x2.4x32=153.6Gflops

- Intel® Xeon® Scalable Processors : Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz, 56 cores, AVX512 , 2 FMA support

The fpc = 512/32 x2 x2 =64. The Peak GLOPS=56x2.5x64=8.96TFLOPS.

- Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, AVX512 supported 32 cores, the Peak GLOPS=4.3TFLOPS, all of tests result are based on the machines. 

The GFLOPS numbers for some Intel processors reported here.1

Example Code

Below code measures the performance of SGEMM using dsecnd() function in Intel MKL. The return value of the first dsecnd() may be slightly off, therefore we recommend discarding the return value of the first dsecnd() call.   

/* mkl.h is required for dsecnd and SGEMM */
#include <mkl.h>
/* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */
float alpha = 1.0, beta = 1.0;
/* first call which does the thread/buffer initialization */
SGEMM(“N”, “N”, &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
/* start timing after the first GEMM call */
double time_st = dsecnd();
for (i=0; i<LOOP_COUNT; ++i)
{
     SGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
}
double time_end = dsecnd();
double time_avg = (time_end - time_st)/LOOP_COUNT;
double gflop = (2.0*m*n*k)*1E-9;
printf("Average time: %e secs n", time_avg);
printf("GFlop       : %.5f  n", gflop);
printf("GFlop/sec   : %.5f  n," gflop/time_avg); 

Configuration Information of Test machine 

Hardware: Intel(R) Xeon(R) Silver 4110 CPU Processor, 2.10GHz, 2 sockets X 8 core/socket x 2 hyper threading threads, total 32 logical CPUS, AVX512 supported, 11264K  Level3 Cache, 32G Memory

Operation System: Ubuntu 16.04.3 LTS, x86_64 GNU/Linux

Versions: Compilers_and_libraries_2018.1.163, Intel® MKL 2018 update 1,  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5)

Building the example code

Intel® Parallel Composer XE include both Intel® Compiler and Intel® MKL. Try Intel® C/C++ Compiler first. 

  • Build the sample code on the Intel® Xeon coprocessor with Intel® C/C++ compiler

         >source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64

         > icc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_icc

    run the application:

     > ./sgemm_icc

The test is using matrix A with 2000x2048  * matrix B 2048x2000 and get the Gflops : 467.39181

  • $ ./sgemm_icc 

    LOOP COUNT           : 100

    m=2000,n=2048,k=2048 cores=32

    Average time:0.035895 secs

    GFlop       :16.77722

    GFlop/sec   :467.39181 GFlops

  • Build the sample code natively on the Intel® Xeon coprocessor with GNU C++ compiler

    > source /opt/intel/compilers_and_libraries/linux/mkl/bin/mklvars.sh intel64

    > gcc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_gcc

    >./sgemm_gcc

    Average time:0.037357 secs

    GFlop       :16.77722

    GFlop/sec   :449.10237 GFlops

Note:  Both Intel® C/C++ compiler and GNU GCC compiler have similar performance

With GNU GCC build , we get the Gflops : 449.10.  It seems less performance than the binary build from intel® C++ compiler

Actually the result of sgemm_icc and sgemm_gcc varied from time to time on test machine, about from  377.73991 GFlops  ~ 498.21456 GFlops . As Intel MKL are linked as binary, both Intel® C/C++ compiler and GNU GCC compiler will have same binary linked, so they have similar performance within the sgemm function loop.

In order to align with general usage scenario, we use GNU C++ compiler and system time measure function gettimeofday in following test. 

Tips 4. Set KMP_AFFINITY to avoid thread migration

In order to get more stable or better performance test result, we need consider hardware architecture and OS feature. To best performance on systems with multi-core processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. For the gemm performance test, KMP_AFFINITY environment variable are useful:

Intel Hyper-Threading Technology Enabled:

Linux*/macOS*: export KMP_AFFINITY=compact,1,0,granularity=fine

Windows*: set KMP_AFFINITY=compact,1,0,granularity=fine

Intel Hyper-Threading Technology Disabled:

Linux*/macOS*: export KMP_AFFINITY=compact

Windows*: set KMP_AFFINITY=compact

Test the performance:  achieve better and stable result ~559.9 GFlops

> export KMP_AFFINITY=compact,1,0,granularity=fine

> Average time:0.029963 secs

GFlop       :16.77722

GFlop/sec   :559.93467 GFlops

Tip 5: Data Alignment

In present Intel Processor, the cache line is usually 64 byte and the data read/write is aligned for cache line. To improve performance of one application that calls Intel® MKL, align test arrays on 64-byte boundaries, use mkl_malloc and mkl_free for allocating and freeing aligned memory

A = (float*) mkl_malloc(sizeof(float)*lda*m, MKL_MEM_ALIGNMENT);

>gcc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_align -DMKL_ALIGN

Average time:0.029767 secs

GFlop       :16.77722

GFlop/sec   :563.61550 GFlops

Tips 6: the last but least important :  Avoid leading dimensions that are multiples of 256

A leading dimension of matrix mxn, lda, is the distance between successive columns in memory, or stride between two rows.

As we mentioned, the present process have cascade caches. And read/write cache are 64byte (multiple of 16 for single precision) related, in order to avoid cache conflict, we expected the Leading dimensions should be multiple of cache line (multiple of 16 for single precision), but not the power of 2, like not multiple of 256, not multiple of 128 etc..  

To improve performance of your application that calls Intel® MKL, ensure that the leading dimensions of the arrays are divisible by 64/element_size, where element_size is the number of bytes for the matrix elements (4 for single-precision real, 8 for double-precision real and single precision complex, and 16 for double-precision complex)

But as present processor use cache-cascading structure: set->cache line.  In order to avoid the cache stall issue, we suggest to avoid leading dimension are multiples of 128, If ldim % 128 = 0, then add 16 to the leading dimension

Generally, set the leading dimension to the following integer expression:
(((n * element_size + 511) / 512) * 512 + 64) /element_size,

Or

#define FIX_LD(x)   (((x + 255) & ~255) + 16)
where n is the matrix dimension along the leading dimension.

gcc sgemm_cblas.cpp -fopenmp -lmkl_rt -osgemm_align -DMKL_ALIGN –DMKL_LD

Average time:0.029493 secs

GFlop       :16.77722

GFlop/sec   :568.84858 GFlops

Other Tips : GEMM variations

Intel® MKL typically offers parallel high-performing GEMM implementations to leverage the concurrent threads supported by modern multi-core architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices or special usage scenery, however, classical GEMM calls may not optimally use all the cores. Therefore all GEMM variation are researched and  in the latest version, Intel® MKL provide  MKL_DIRECT_CALL, Compact GEMM, Batch APIs and Packed API etc. for try them, please see Intel® MKL Developer Reference Manual

Summary

One challenging―but also very important―measure the performance of general matrix-matrix multiplication (GEMM) for computationally intensive applications.  Intel® MKL provides a highly optimized and extensively threaded GEMM functions. In this article, we explain how to design and measure of the performance using Intel® MKL SGEMM,  and outline  about 7 tips to help developers to perform the performance test and quickly evaluate the floating pointing computing capability (FLOPS) on specified processor.  

Tips 1: ignore the time required for the first call

Tips 2: Select appropriate time function and measure the average performance

Tips 3: Measure Performance in giga floating point operations per second (Gflops)

Tips 4. Set KMP_AFFINITY to avoid thread migration

Tips 5: Data Alignment

Tips 6: Avoid leading dimensions that are multiples of 256

Tips 7: GEMM variations

Reference

https://software.intel.com/en-us/mkl-windows-developer-guide-using-mkl-direct-call-in-c-applications

https://software.intel.com/en-us/articles/intelr-math-kernel-library-introducing-vectorized-compact-routines

https://software.intel.com/en-us/articles/introducing-batch-gemm-operations

https://software.intel.com/en-us/articles/introducing-the-new-packed-apis-for-gemm

Notices

All GFLOPS, CTP, and APP calculations contained were based on specifications taken from Intel datasheets and are subject to change without notice. Intel makes no representation or warranty as to the accuracy or reliability of such specifications. THESE CALCULATIONS ARE PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including liability for infringement of any proprietary rights, relating to use of information in these calculations. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted.

 

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

This sample source code is released under the Intel Sample Source Code License Agreement.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation

There are downloads available under the Intel Sample Source Code License Agreement license. Download Now
For more complete information about compiler optimizations, see our Optimization Notice.

Comments



Hello Ying,

Hello Ying,

Thanks for the info on being able to issue 1 multiply and 1 add in one cycle. Few more questions: (1) Is it also possible to issue two FMA instructions in the same clock? If so, wouldn't the FMA based implementation be 4 times faster than the SSE based implementation? (2) You mention

.. as Max and Tim said, ...

where exactly is their statement?


Hello Sergey,

Hello Sergey,

The problem is not about FMA (officially, it was supported in AVX2), but as Max and Tim said, current processor can issue 1 mulply and 1 add at one cycle (or you can take it as two SSE units). W5580 is from Nehalem (core i7) family, it can perform 4 double floating point operations.

Best Regards,
Ying


Hello Ying,

Hello Ying,

Thanks for the response, but please clarify: SSE3 does _not_ have fused multiply-add, so how come it can perform 4 double floating point operations?:

one sse3 (128bit) have 4 double floating point operation


Hi Sergey,

Hi Sergey,

Thanks you comments, the sample was attached in the paper (please click the Download button at the end of pape). You may test it with 2048x2048.

@SG.
Thanks you for asking I check them in details.
first, Right, the GLOPS in the article is based on double floating point, .
Actually, i7-2600k can run two avx (265bit) in one cycle, one MUL, one ADD, so totally, 2x4=8 double floating point operation, so peak Performance of 3.4 GHz avx chip using 4 threads is 3.4x8x4 =108.8 GFlops.
for that paper, the processor: w5580, one sse3 (128bit) have 4 double floating point operation, the peak Performance of 3.2 sse3 chip using 4 threads is 3.2x4x4=51.2 GFlops.
But the figure on have 8 threads, which should mean that the test is based on 2 packages ( we will ask the author to confirm it).

So you can expect chip with avx to perform ~2x on FLOPS than chip without AVX.

Best Regards,
Ying
Please see formal doc : peak FLOP about i7-2600K: http://www.intel.com/support/processors/sb/CS-032814.htm?wapkw=peak+flops

and peak FLOP for Xeon 5580 : http://www.intel.com/support/processors/xeon/sb/CS-020863.htm?wapkw=peak+flops

and forum dicussion in http://software.intel.com/en-us/forums/topic/291765

.


The performance reported here

The performance reported here seems to be half as good as it should be. Here's what I mean: graph on page http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library is for comparable CPU without AVX. Graph on this page is with AVX. But both graphs report similar performance -- but one would expect chip with AVX to perform twice as better than chip without AVX.

Here's my guess about the discrepancy: For the Graph on this page, the "F" in "GFlops" referrs to double-precision floating point, but for the Graph on the other page, the "F" is single-precision floating point. One can confirm the interpretation of "F" for this page based on the equation provided in this page. For the intrepretation of "F" on the other page, note that the other page says Peak Performance of 3.2 GHz SSE3 chip using 8 threads is 102.4 GFlops; dividing 102.4 by 3.2 GHz gives 32 floating point operations per clock per 8 threads -- or 4 floating point operations per clock per core; 4 floating point operations per clock per core on SSE3 chip means single precision floating point operations. But this para is just my guess -- would be nice if someone from Intel confirmed it or clarified the apparent missing performance on using AVX.