LAPACK DGBTRF MKL implementation slower than Fortran compiled version?

LAPACK DGBTRF MKL implementation slower than Fortran compiled version?

I have been testing the speed of the Fortran compiled (with -O2 optimization using the Intel 10 Fortran compiler) lapack (from netlib)version of DGBTRF (LAPACK square dense linear solver) against the MKL version.

I was very suprised to discoverthat MKL is up to 4 or 5 times slower than the Fortran compiled in the case where the matrix becomes sparse.

I am guessing that the MKL implementation of this routine uses machine code and exploitation of the cache and pipeline, but at the cost of not avoiding arithmetic operations for zero entries in the matrix. This works fine for purelydense ("full")problems, but there will be a cross-over point where the Fortran compiled LAPACK version will beat the MKL impementation when the matrix becomes sparse.My questions are: is my theory correct and what is the crossover point?

Of course, for sparse matrices I shouldnt be using LAPACKat all and instead be using a sparse solver. However, our use case of our application is that the matrix size and sparsity is model and application dependent - we dont know until run time what the size and sparsity is going to be. Ideallywe would like to know under what size N and what number of non-zeros NZ we should switch from MKL LAPACK to the Fortran compiled LAPACK and then to sparse solver outright.

Any advice or comments on my theory are most appreciated.

Thank you!


6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Your guess is likely to be correct. Among the typical ways to optimize the BLAS level 3 functions is to remove the tests which skip inner loops with zero operations. This would enable multi-level loop optimizations and often double the performance with full matrices. When the matrix is sparse enough that the original code could skip over 50% of the internal loops, the MKL optimization might not pay off.

Suggestion: if you are speed-testing on a machine which has hyperthreading technology enabled, don't rely on MKL to choose the number of threads, but specify it yourself equal to the number of processors you have.

If you don't specify a number of threads, then (beginning with v10) MKL will try to choose a number for you, but it gets it wrong when hyperthreading technology is enabled. It will choose twice as many threads as you should have, and this may be at least a partial cause of yourslowdown.

With version 10.0 update 3, Intel MKL will make a better determination of the number of threads to use. That includes systems with Hyperthreading Technology enabled.


Do you mean that this has been done in the current MKL update 3 that is already available for download, or we have to wait for a subsequent one?


Yes, this is done in the already available update 3. There was some notice in the release notes, but I notice this thread was not updated. Sorry for the late notice.

Leave a Comment

Please sign in to add a comment. Not a member? Join today