I have found that MKL 6.1 BLAS zgemm seems to perform significantly more slowly than netlib BLAS source for small (6x6) matrices. One source of the slowdown appears (according to Rational Quantify) to be a memcpy operation occurring in MKL zgemm.
I understand that MKL would be much faster when the matrices get larger ( and in other parts of my code does help quite a lot), but perhaps the MKL engineers could look at some way to avoid time consuming set-up etc and skip to simple serial code when matrices are 'small'
MKL 6.1 - slow BLAS (zgemm) performance with small matrices
For more complete information about compiler optimizations, see our Optimization Notice.


