Tips to measure the performance of Intel® MKL with small matrix sizes

Introduction

Intel® Math Kernel Library (Intel® MKL) is a highly optimized and extensively threaded math library especially suitable for computationally intensive applications. Developers often want to evaluate the performance of Intel MKL. Many factors contribute to the performance of an Intel MKL subroutine, such as problem size, memory size, parallelism, states of caches, branch prediction logic, and so on. In this article, we provide a simple recommendation for improving the accuracy of performance measurements: ignore the time required by the first Intel MKL call. We use performance measurement of DGEMM, double-precision general matrix multiplication, as an example. Please refer to the BLAS section of the Intel MKL Reference Manual for a detailed description of DGEMM.

Measuring the Performance of DGEMM

Intel MKL is multi-threaded and employs internal buffers for fast memory allocation. Typically the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time usually insignificant compared to the execution time of DGEMM for large matrices, it can be substantial when timing DGEMM for small matrices. To remove the initialization time from the performance measurement, we recommend making a call to DGEMM with sufficiently large parameters (for example, M=N=K=100) and ignoring the time required for the first call. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices.

Intel MKL provides the timing function, dsecnd(), which measures the execution time in seconds. The execution time of a subroutine may vary from one call to another, and small problems are especially susceptible to the time variations due to system artifacts. Therefore, for functions with small execution times, it is a common practice to measure the average performance by placing the function within a loop. The total elapsed time divided by the loop count gives the average time required for a single function call. The loop count should be large enough to get a representative average for small problems. On the other hand, if a large loop count is chosen, then the execution time of the benchmark may be prohibitive for large problems.

Performance measured in Gflops

One may wish to calculate the number of floating point operations required for DGEMM as well as the performance in terms of floating point operations per second (Flops). Flops is a useful metric to compare the performance of compute-bound subroutines like DGEMM with the theoretical peak performance of a machine. For the multiplication of an M×K A matrix and a K×N B matrix, 2K - 1 operations (K-1 additions and K multiplications) are required to compute each element of the result matrix. Since there are MN entries in the C matrix, MN(2K-1) operations are required for the multiplication of the two matrices. 2MN additional operations are required for adding scaled C to AB. Therefore, the total number of floating point operations for a typical DGEMM call is approximately 2MNK. Dividing the number of operations by the average time gives the average Flops rate for DGEMM. Typically, the performance is reported as GFlops, which is 109 Flops. An example code that determines the time and GFlops for DGEMM is provided below.

Example Code

Below code measures the performance of DGEMM using dsecnd() function in Intel MKL. The return value of the first dsecnd() may be slightly off, therefore we recommend discarding the return value of the first dsecnd() call.   

/* mkl.h is required for dsecnd and DGEMM */
#include <mkl.h>

/* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */

double alpha = 1.0, beta = 1.0;
/* first call which does the thread/buffer initialization */
DGEMM(“N”, “N”, &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
/* start timing after the first GEMM call */
double time_st = dsecnd();
for (i=0; i<LOOP_COUNT; ++i)
{
     DGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
}
double time_end = dsecnd();
double time_avg = (time_end - time_st)/LOOP_COUNT;
double gflop = (2.0*m*n*k)*1E-9;
printf("Average time: %e secs n", time_avg);
printf("GFlop       : %.5f  n", gflop);
printf("GFlop/sec   : %.5f  n," gflop/time_avg); 

Performance Result

The following plot is generated using the results of the example dgemm performance measurement code:

Performance_test2.png

There are downloads available under the Intel Sample Source Code License Agreement license. Download Now
For more complete information about compiler optimizations, see our Optimization Notice.

Comments



Hello Ying,

Hello Ying,

Thanks for the info on being able to issue 1 multiply and 1 add in one cycle. Few more questions: (1) Is it also possible to issue two FMA instructions in the same clock? If so, wouldn't the FMA based implementation be 4 times faster than the SSE based implementation? (2) You mention

.. as Max and Tim said, ...

where exactly is their statement?


Hello Sergey,

Hello Sergey,

The problem is not about FMA (officially, it was supported in AVX2), but as Max and Tim said, current processor can issue 1 mulply and 1 add at one cycle (or you can take it as two SSE units). W5580 is from Nehalem (core i7) family, it can perform 4 double floating point operations.

Best Regards,
Ying


Hello Ying,

Hello Ying,

Thanks for the response, but please clarify: SSE3 does _not_ have fused multiply-add, so how come it can perform 4 double floating point operations?:

one sse3 (128bit) have 4 double floating point operation


Hi Sergey,

Hi Sergey,

Thanks you comments, the sample was attached in the paper (please click the Download button at the end of pape). You may test it with 2048x2048.

@SG.
Thanks you for asking I check them in details.
first, Right, the GLOPS in the article is based on double floating point, .
Actually, i7-2600k can run two avx (265bit) in one cycle, one MUL, one ADD, so totally, 2x4=8 double floating point operation, so peak Performance of 3.4 GHz avx chip using 4 threads is 3.4x8x4 =108.8 GFlops.
for that paper, the processor: w5580, one sse3 (128bit) have 4 double floating point operation, the peak Performance of 3.2 sse3 chip using 4 threads is 3.2x4x4=51.2 GFlops.
But the figure on have 8 threads, which should mean that the test is based on 2 packages ( we will ask the author to confirm it).

So you can expect chip with avx to perform ~2x on FLOPS than chip without AVX.

Best Regards,
Ying
Please see formal doc : peak FLOP about i7-2600K: http://www.intel.com/support/processors/sb/CS-032814.htm?wapkw=peak+flops

and peak FLOP for Xeon 5580 : http://www.intel.com/support/processors/xeon/sb/CS-020863.htm?wapkw=peak+flops

and forum dicussion in http://software.intel.com/en-us/forums/topic/291765

.


The performance reported here

The performance reported here seems to be half as good as it should be. Here's what I mean: graph on page http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library is for comparable CPU without AVX. Graph on this page is with AVX. But both graphs report similar performance -- but one would expect chip with AVX to perform twice as better than chip without AVX.

Here's my guess about the discrepancy: For the Graph on this page, the "F" in "GFlops" referrs to double-precision floating point, but for the Graph on the other page, the "F" is single-precision floating point. One can confirm the interpretation of "F" for this page based on the equation provided in this page. For the intrepretation of "F" on the other page, note that the other page says Peak Performance of 3.2 GHz SSE3 chip using 8 threads is 102.4 GFlops; dividing 102.4 by 3.2 GHz gives 32 floating point operations per clock per 8 threads -- or 4 floating point operations per clock per core; 4 floating point operations per clock per core on SSE3 chip means single precision floating point operations. But this para is just my guess -- would be nice if someone from Intel confirmed it or clarified the apparent missing performance on using AVX.