Introduction
Intel® Math Kernel Library (Intel^{®} MKL) is a highly optimized and extensively threaded math library suitable for use with computationally intensive applications. Intel^{®} MKL has been optimized for a broad range of Intel processors enabling users to evaluate the performance of their applications, and Intel^{®} MKL on those processors.
The General MatrixMatrix Multiplication (GEMM) function is a commonly used function in scientific, engineering, numerical computing, data analytics and machine learning workloads. Intel® Math Kernel Library (Intel^{®} MKL) provides single and double precision optimized matrix multiplication functions, [S,D]GEMM, which have been vectorized and parallelized to take advantage of the latest Intel architecture instruction sets and their numerous computational cores. Interest in the performance of SGEMM, for example, has grown significantly due to its ability to accelerate Deep Learning Neural Network (DNN) computations.
Factors such as the problem size, cache size, number for cores, and the underlying computer architecture influence the performance of almost all Intel^{®} MKL functions. In this article, we provide some tips on how to measure the singleprecision general matrix multiply (SGEMM) function performance on Intel® Xeon processors. These same tips can be applied to measure the performance of many other Intel® MKL functions as well.
To begin, review the Intel® MKL Reference Manual for a detailed description of SGEMM function and its parameters:
cblas_sgemm is the C interface of SGEMM, it's syntax
Computes a matrixmatrix product with general matrices
void cblas_sgemm (const CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE transa, const
CBLAS_TRANSPOSE transb, const MKL_INT m, const MKL_INT n, const MKL_INT k, const float
alpha, const float *a, const MKL_INT lda, const float *b, const MKL_INT ldb, const
float beta, float *c, const MKL_INT ldc);
the ?gemm routines compute a scalarmatrixmatrix product and add the result to a scalarmatrix product, with general matrices. The operation is defined as
C := alpha*op(A)*op(B) + beta*C,
where:
op(X) is one of op(X) = X, or op(X) = XT, or op(X) = XH,
Measuring the Performance of SGEMM
Tips 1: ignore the time required for the first call
The Intel® Math Kernel Library (Intel® MKL) is multithreaded and employs internal buffers for fast memory allocation. Typically the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time usually insignificant compared to the execution time of SGEMM for large matrices, it can be substantial when timing SGEMM for small matrices. To remove the initialization time from the performance measurement, we recommend making a call to SGEMM with sufficiently bigger parameters (for example, M=N=K=100) and ignoring the time required for the first call. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multithreaded code only for sufficiently large matrices.
Tips 2: Select appropriate time function and measure the average performance
Intel MKL provides the timing function, dsecnd(), which measures the execution time in seconds. The execution time of a subroutine may vary from one call to another, and small problems are especially susceptible to the time variations due to system artifacts. Therefore, for functions with small execution times, it is a common practice to measure the average performance by placing the function within a loop. The total elapsed time divided by the loop count gives the average time required for a single function call. The loop count should be large enough to get a representative average for small problems. On the other hand, if a large loop count is chosen, then the execution time of the benchmark may be prohibitive for large problems. By practice, we expect the loop will take at least >1 second.
Developer may often use the system time measure function gettimeofday(&end_time,NULL), which give the number of seconds and microseconds(us). We use the time function in the article.
Tips 3: Measure Performance in giga floating point operations per second (Gflops)
Floating point operations per second (FLOPS) is a useful metric to compare the performance of computebound subroutines like SGEMM with the theoretical peak performance of a machine. For the multiplication of an M×K A matrix and a K×N B matrix, 2K  1 operations (K1 additions and K multiplications) are required to compute each element of the result matrix. Since there are MN entries in the C matrix, MN(2K1) operations are required for the multiplication of the two matrices. 2MN additional operations are required for adding scaled C to AB. Therefore, the total number of floating point operations for a typical SGEMM call is approximately 2MNK. Dividing the number of operations by the average time gives the average Flops rate for SGEMM. Typically, the performance is reported as Gigaflops(GFlops), which is 10^{9 }Flops. An example code that determines the time and GFlops for SGEMM is provided below.
Theoretical Peak GFLOPS of test machine
The Peak GFLOPS numbers for Intel processor can be calculated in theory
The total number of cores on the machine (cores) * CPU frequency in GHz per core(hz) * float operations per CPU cycle (fpc)
Where float operations per CPU cycle (fpc) = the floating point numbers in one SIMD register x number of SIMD operands * SIMD operands per CPU cycle
Some examples are below:
 Intel® Core™ i56300U CPU@2.4G 2 core, which is 6th Generation Intel® Core™ i5 Processors and support AVX2, 2 FMA (1 mutiply and 1 add). So for the singleprecision floating point (32bit), the fpc = 256/32 x2x2 =32. And Peak GFLOPS=2x2.4x32=153.6Gflops
 Intel® Xeon® Scalable Processors : Intel(R) Xeon(R) Platinum 8180M CPU @ 2.50GHz, 56 cores, AVX512 , 2 FMA support
The fpc = 512/32 x2 x2 =64. The Peak GLOPS=56x2.5x64=8.96TFLOPS.
 Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz, AVX512 supported 32 cores, the Peak GLOPS=4.3TFLOPS, all of tests result are based on the machines.
The GFLOPS numbers for some Intel processors reported here.^{1}
Example Code
Below code measures the performance of SGEMM using dsecnd() function in Intel MKL. The return value of the first dsecnd() may be slightly off, therefore we recommend discarding the return value of the first dsecnd() call.
/* mkl.h is required for dsecnd and SGEMM */ #include <mkl.h> /* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */ float alpha = 1.0, beta = 1.0; /* first call which does the thread/buffer initialization */ SGEMM(“N”, “N”, &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m); /* start timing after the first GEMM call */ double time_st = dsecnd(); for (i=0; i<LOOP_COUNT; ++i) { SGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m); } double time_end = dsecnd(); double time_avg = (time_end  time_st)/LOOP_COUNT; double gflop = (2.0*m*n*k)*1E9; printf("Average time: %e secs n", time_avg); printf("GFlop : %.5f n", gflop); printf("GFlop/sec : %.5f n," gflop/time_avg);
Configuration Information of Test machine
Hardware: Intel(R) Xeon(R) Silver 4110 CPU Processor, 2.10GHz, 2 sockets X 8 core/socket x 2 hyper threading threads, total 32 logical CPUS, AVX512 supported, 11264K Level3 Cache, 32G Memory
Operation System: Ubuntu 16.04.3 LTS, x86_64 GNU/Linux
Versions: Compilers_and_libraries_2018.1.163, Intel® MKL 2018 update 1, gcc version 5.4.0 20160609 (Ubuntu 5.4.06ubuntu1~16.04.5)
Building the example code
Intel® Parallel Composer XE include both Intel® Compiler and Intel® MKL. Try Intel® C/C++ Compiler first.
 Build the sample code on the Intel® Xeon coprocessor with Intel® C/C++ compiler:
>source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
> icc sgemm_cblas.cpp fopenmp lmkl_rt osgemm_icc
run the application:
> ./sgemm_icc
The test is using matrix A with 2000x2048 * matrix B 2048x2000 and get the Gflops : 467.39181

$ ./sgemm_icc
LOOP COUNT : 100
m=2000,n=2048,k=2048 cores=32
Average time:0.035895 secs
GFlop :16.77722
GFlop/sec :467.39181 GFlops
 Build the sample code natively on the Intel® Xeon coprocessor with GNU C++ compiler
> source /opt/intel/compilers_and_libraries/linux/mkl/bin/mklvars.sh intel64
> gcc sgemm_cblas.cpp fopenmp lmkl_rt osgemm_gcc
>./sgemm_gcc
Average time:0.037357 secs
GFlop :16.77722
GFlop/sec :449.10237 GFlops
Note: Both Intel® C/C++ compiler and GNU GCC compiler have similar performance
With GNU GCC build , we get the Gflops : 449.10. It seems less performance than the binary build from intel® C++ compiler
Actually the result of sgemm_icc and sgemm_gcc varied from time to time on test machine, about from 377.73991 GFlops ~ 498.21456 GFlops . As Intel MKL are linked as binary, both Intel® C/C++ compiler and GNU GCC compiler will have same binary linked, so they have similar performance within the sgemm function loop.
In order to align with general usage scenario, we use GNU C++ compiler and system time measure function gettimeofday in following test.
Tips 4. Set KMP_AFFINITY to avoid thread migration
In order to get more stable or better performance test result, we need consider hardware architecture and OS feature. To best performance on systems with multicore processors by requiring that threads do not migrate from core to core. To do this, bind threads to the CPU cores by setting an affinity mask to threads. For the gemm performance test, KMP_AFFINITY environment variable are useful:
Intel HyperThreading Technology Enabled:
Linux*/macOS*: export KMP_AFFINITY=compact,1,0,granularity=fine
Windows*: set KMP_AFFINITY=compact,1,0,granularity=fine
Intel HyperThreading Technology Disabled:
Linux*/macOS*: export KMP_AFFINITY=compact
Windows*: set KMP_AFFINITY=compact
Test the performance: achieve better and stable result ~559.9 GFlops
> export KMP_AFFINITY=compact,1,0,granularity=fine
> Average time:0.029963 secs
GFlop :16.77722
GFlop/sec :559.93467 GFlops
Tip 5: Data Alignment
In present Intel Processor, the cache line is usually 64 byte and the data read/write is aligned for cache line. To improve performance of one application that calls Intel® MKL, align test arrays on 64byte boundaries, use mkl_malloc and mkl_free for allocating and freeing aligned memory
A = (float*) mkl_malloc(sizeof(float)*lda*m, MKL_MEM_ALIGNMENT);
>gcc sgemm_cblas.cpp fopenmp lmkl_rt osgemm_align DMKL_ALIGN
Average time:0.029767 secs
GFlop :16.77722
GFlop/sec :563.61550 GFlops
Tips 6: the last but least important : Avoid leading dimensions that are multiples of 256
A leading dimension of matrix mxn, lda, is the distance between successive columns in memory, or stride between two rows.
As we mentioned, the present process have cascade caches. And read/write cache are 64byte (multiple of 16 for single precision) related, in order to avoid cache conflict, we expected the Leading dimensions should be multiple of cache line (multiple of 16 for single precision), but not the power of 2, like not multiple of 256, not multiple of 128 etc..
To improve performance of your application that calls Intel® MKL, ensure that the leading dimensions of the arrays are divisible by 64/element_size, where element_size is the number of bytes for the matrix elements (4 for singleprecision real, 8 for doubleprecision real and single precision complex, and 16 for doubleprecision complex)
But as present processor use cachecascading structure: set>cache line. In order to avoid the cache stall issue, we suggest to avoid leading dimension are multiples of 128, If ldim % 128 = 0, then add 16 to the leading dimension
Generally, set the leading dimension to the following integer expression:
(((n * element_size + 511) / 512) * 512 + 64) /element_size,
Or
#define FIX_LD(x) (((x + 255) & ~255) + 16)
where n is the matrix dimension along the leading dimension.
gcc sgemm_cblas.cpp fopenmp lmkl_rt osgemm_align DMKL_ALIGN –DMKL_LD
Average time:0.029493 secs
GFlop :16.77722
GFlop/sec :568.84858 GFlops
Other Tips : GEMM variations
Intel® MKL typically offers parallel highperforming GEMM implementations to leverage the concurrent threads supported by modern multicore architectures. This strategy works well when multiplying large matrices because all cores are used efficiently. When multiplying small matrices or special usage scenery, however, classical GEMM calls may not optimally use all the cores. Therefore all GEMM variation are researched and in the latest version, Intel® MKL provide MKL_DIRECT_CALL, Compact GEMM, Batch APIs and Packed API etc. for try them, please see Intel® MKL Developer Reference Manual
Summary
One challenging―but also very important―measure the performance of general matrixmatrix multiplication (GEMM) for computationally intensive applications. Intel^{®} MKL provides a highly optimized and extensively threaded GEMM functions. In this article, we explain how to design and measure of the performance using Intel^{®} MKL SGEMM, and outline about 7 tips to help developers to perform the performance test and quickly evaluate the floating pointing computing capability (FLOPS) on specified processor.
Tips 1: ignore the time required for the first call
Tips 2: Select appropriate time function and measure the average performance
Tips 3: Measure Performance in giga floating point operations per second (Gflops)
Tips 4. Set KMP_AFFINITY to avoid thread migration
Tips 5: Data Alignment
Tips 6: Avoid leading dimensions that are multiples of 256
Tips 7: GEMM variations
Reference
https://software.intel.com/enus/mklwindowsdeveloperguideusingmkldirectcallincapplications
https://software.intel.com/enus/articles/introducingbatchgemmoperations
https://software.intel.com/enus/articles/introducingthenewpackedapisforgemm
Notices
All GFLOPS, CTP, and APP calculations contained were based on specifications taken from Intel datasheets and are subject to change without notice. Intel makes no representation or warranty as to the accuracy or reliability of such specifications. THESE CALCULATIONS ARE PROVIDED "AS IS" WITH NO WARRANTIES WHATSOEVER, INCLUDING ANY WARRANTY OF MERCHANTABILITY, NONINFRINGEMENT, FITNESS FOR ANY PARTICULAR PURPOSE OR ANY WARRANTY OTHERWISE ARISING OUT OF ANY PROPOSAL, SPECIFICATION OR SAMPLE. Intel disclaims all liability, including liability for infringement of any proprietary rights, relating to use of information in these calculations. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and noninfringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.
Copies of documents which have an order number and are referenced in this document may be obtained by calling 18005484725 or by visiting www.intel.com/design/literature.htm.
This sample source code is released under the Intel Sample Source Code License Agreement.
Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
© 2017 Intel Corporation
Comments
Hello Ying
Hello Ying
could you reupload your result plot?
Thanks
Patrick
Hello SG
Hello SG
As the place is not for discussion, i move our discussion to MKL Forum : http://software.intel.com/enus/forums/topic/484720.
So more developer can involve.
Best Regards,
Ying
Hello Ying,
Hello Ying,
Thanks for the info on being able to issue 1 multiply and 1 add in one cycle. Few more questions: (1) Is it also possible to issue two FMA instructions in the same clock? If so, wouldn't the FMA based implementation be 4 times faster than the SSE based implementation? (2) You mention
where exactly is their statement?
Hello Sergey,
Hello Sergey,
The problem is not about FMA (officially, it was supported in AVX2), but as Max and Tim said, current processor can issue 1 mulply and 1 add at one cycle (or you can take it as two SSE units). W5580 is from Nehalem (core i7) family, it can perform 4 double floating point operations.
Best Regards,
Ying
Hello Ying,
Hello Ying,
Thanks for the response, but please clarify: SSE3 does _not_ have fused multiplyadd, so how come it can perform 4 double floating point operations?:
Hi Sergey,
Hi Sergey,
Thanks you comments, the sample was attached in the paper (please click the Download button at the end of pape). You may test it with 2048x2048.
@SG.
Thanks you for asking I check them in details.
first, Right, the GLOPS in the article is based on double floating point, .
Actually, i72600k can run two avx (265bit) in one cycle, one MUL, one ADD, so totally, 2x4=8 double floating point operation, so peak Performance of 3.4 GHz avx chip using 4 threads is 3.4x8x4 =108.8 GFlops.
for that paper, the processor: w5580, one sse3 (128bit) have 4 double floating point operation, the peak Performance of 3.2 sse3 chip using 4 threads is 3.2x4x4=51.2 GFlops.
But the figure on have 8 threads, which should mean that the test is based on 2 packages ( we will ask the author to confirm it).
So you can expect chip with avx to perform ~2x on FLOPS than chip without AVX.
Best Regards,
Ying
Please see formal doc : peak FLOP about i72600K: http://www.intel.com/support/processors/sb/CS032814.htm?wapkw=peak+flops
and peak FLOP for Xeon 5580 : http://www.intel.com/support/processors/xeon/sb/CS020863.htm?wapkw=peak+flops
and forum dicussion in http://software.intel.com/enus/forums/topic/291765
.
The performance reported here
The performance reported here seems to be half as good as it should be. Here's what I mean: graph on page http://software.intel.com/enus/articles/parallelismintheintelmathkernellibrary is for comparable CPU without AVX. Graph on this page is with AVX. But both graphs report similar performance  but one would expect chip with AVX to perform twice as better than chip without AVX.
Here's my guess about the discrepancy: For the Graph on this page, the "F" in "GFlops" referrs to doubleprecision floating point, but for the Graph on the other page, the "F" is singleprecision floating point. One can confirm the interpretation of "F" for this page based on the equation provided in this page. For the intrepretation of "F" on the other page, note that the other page says Peak Performance of 3.2 GHz SSE3 chip using 8 threads is 102.4 GFlops; dividing 102.4 by 3.2 GHz gives 32 floating point operations per clock per 8 threads  or 4 floating point operations per clock per core; 4 floating point operations per clock per core on SSE3 chip means single precision floating point operations. But this para is just my guess  would be nice if someone from Intel confirmed it or clarified the apparent missing performance on using AVX.
I wish a better C/C++ example
I wish a better C/C++ example ( with the full source codes! ) and test results for a matrices 2048x2048 and larger,