I am running cblas routines on an older Ubuntu 12.04 (64bit) machine, Intel Core 2 Duo (E6600@2.4 GHz) using the latest 11.0 MKL.
For data of size > 10MB, the performance of
saxpy is 0.9 Gflops, e.g. n = 16777216, t = 0.039717s, where the opcount = 2 * n.
sdot is 1.4 Gflops,e.g. n = 16777216, t = 0.024379s where the opcount = 2 * n - 1.
sgemv is 2.5 Gflops, e.g. m,n = 4096, t = 0.021503s where the opcount = (2 * n - 1) * m.
However in case of
sgemm the performance exceeds 35 Gflops, e.g. m,n,k = 4096, t = 4.114639s where the opcount = (2*k-1)*m*n.
Yet this should be impossible as the peak performance of the E6600 is 19.2 Gflops for single precision.
lda,ldb,ldc = 4096, alpha=1,beta=0 and
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, 4096, 4096, 4096, 1.0, A, 4096, B, 4096, 0.0, C, 4096);
I have verirfied the results for smaller sizes.
Could someone please tell me how this is possible ?
Thanks a lot,