Hello,

I'm performing some benchmark using DGEMM from MKL and OpenBLAS (GotoBLAS successor). I'm using a piece of code similar to (I don't know why, but I can't put links in the post, but the piece of code comes from this MKL forum)

/* mkl.h is required for dsecnd and DGEMM */ #include <mkl.h> /* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */ double alpha = 1.0, beta = 1.0; /* first call which does the thread/buffer initialization */ DGEMM(“N”, “N”, &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m); /* start timing after the first GEMM call */ double time_st = dsecnd(); for (i=0; i<LOOP_COUNT; ++i) { DGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m); } double time_end = dsecnd(); double time_avg = (time_end - time_st)/LOOP_COUNT; double gflop = (2.0*m*n*k)*1E-9; printf("Average time: %e secs n", time_avg); printf("GFlop : %.5f n", gflop); printf("GFlop/sec : %.5f n," gflop/time_avg);

I change only the timing function when OpenBLAS is used, and I run the program using square matrices (and several repetitions) of size from 1000 to 5000.

Also, I take as reference the theoretical peak performance for my processor from *intel com/support/processors/sb/CS-032819 htm* (sorry the ugly link format). For the Core2 Duo P9600 (P9000 series) 2.66GHz, the theoretical peak using 2 cores is 21.328 GFLOPS/s. Running my program I obtain relative performances (R/Rmax) of about 95.2% using sizes between 3000 to 5000. This is a very good performance, so I congratulate Intel. Using OpenBLAS, the performance is very similar.

Then I've tested also the performance using only one thread. The document about theoretical peak does not inform about the performance using one thead, so I use as rmax the value 21.328/2 = 10.664 GFLOPS/s. Running the benchmark program I obtain results of about (for sizes 3000 to 5000) 10.68 to 10.76 GFLOPS/s, i.e. R/Rmax = 100.15% to 100.9% (!!!!). For OpenBLAS similar results are obtained too.

How it can be possible? How it can be possible to reach the theoretical peak performance? Is correct the way to calculate the theoretical peak for 1 thread as R2thread/2? How it can be explained the extrange value R/Rmax > 100% for 1 thread? Has anyone tested DGEMM using a similar processor?

The FLOP count for DGEMM is 2*M*N*K, that is divided between M*N*K products and M*N*K additions. Takes the same time a product as an addition or is slower?

Thanks