Is there a function in the MKL that just does simple matrix multiplication Y = A*B? I could only find Y = alpha*A*B + beta*C
It seems like a waste of cycles to do the extra computation.
No, the extra effort is insignificant, when the matrix is large enough for use of MKL to pay off. The tradeoff was made by the BLAS people long before MKL came along.In Fortran, the MATMUL function allows for some optimizations for situations where MKL would be inefficient.
tim18:No, the extra effort is insignificant, when the matrix is large enough for use of MKL to pay off. The tradeoff was made by the BLAS people long before MKL came along.In Fortran, the MATMUL function allows for some optimizations for situations where MKL would be inefficient.
In my code I have the following types of vector and matrix operations
1) V2 = M V1
2) M3 = I - M1 M2
3) M3 = M1 M2
4) M5 = M1 + M2 M3 M4
The sizes can vary. Typically the vectors can range from 4 to 1000's, thus the corresponding matrix can be fairly large. For operations 2-4, the matrices can range from 4x4 to 64x64. In some cases, the matrix for type 1 operations is sparse. The matrices are not, in general, symmetric.
When the vectors are small, the performance of the type 1 operations is not critical since small sizes are only used for validating code. For operations 2-4, the performance of the code is important over the entire range of sizes.
I have been doing all the operations using MATMUL with the intent of using BLAS (specifically MKL) where it makes sense.
1) At what size does MKL become a better deal?
2) In what situations is MATMUL (as implemented by IVF) more efficient than MKL?
I was intending to write a benchmark, however, any answers you can provide will save me time. Thanks
I implemented 2 & 4 using mkl_ddiamm and the MKL routines were at least an order of magnitude faster for 128x128 matrices. It was difficult to get reliable timing information since the microsecond resolution of CPU_TIME() was not fine enough. For 64x64 matrices there was no difference in timing.
OK, I'll start with some quotations:http://gcc.gnu.org/ml/fortran/2004-11/msg00124.htmlhttp://gcc.gnu.org/ml/gcc-patches/2006-04/msg00096.html
So it appears that MKL or similar is likely to out-perform matmul() when the smaller matrix dimension exceeds 10 or so. Evidently, the exact cutover point will depend on various factors; which platform, which compilers used, which BLAS library version, etc.gfortran's own matmul runs quite well on Intel platforms if compiled (with minimum required changes) by icc, and that would increase the size where it first becomes useful to throw the option -fexternal-blas=.
ifort, and predecessors back to the last CVF, expand matmul in-line when the compiler can see that the dimensions of the matrix are small enough to make it pay off. This is clearly worth while up to dimension 6. If a compiler does not optimize matmul this way, better performance might be gained by avoiding matmul.
At the other extreme, MKL includes optimizations which are useful only for large cases. These include adding threads, which I have seen pay off only for dimension 50 or more, and cache blocking, which is useful only when the matrix is large compared to L1 capacity. The additional time required to check for these cases, to organize so as to allow the possibility of these optimizations, and extra code size, all probably contribute to slowdown for those matrices below dimension 10.