I am observing a weird performance behavior when using MKL 11.0 on an AMD Opteron 6220 processor. I wrote a small benchmark that measures the Gflop rate of the _GEMM operation in S, D, C and Z arithmetics and here is the result I get:
SGEMM Gflop/s: 18.10 DGEMM Gflop/s: 17.68 CGEMM Gflop/s: 4.78 ZGEMM Gflop/s: 2.31
The theoretical peak performance is 48, 24, 12 and 6, respectively. For real, double precision the obtained speed seems reasonable but for all the others (especially S) is way too low. With ACML, instead, I have
SGEMM Gflop/s: 37.58 DGEMM Gflop/s: 18.77 CGEMM Gflop/s: 5.37 ZGEMM Gflop/s: 3.70
Has anybody already observed anything like this? Any idea how to solve this problem?