Threaded SVD using MKL/LAPACK

Threaded SVD using MKL/LAPACK

I am trying to optimize software that uses the Intel MKL to perform an SVD of a large complex-valued matrix, using calls to the zgesvd() Lapack driver routine. The MKL documentation states that ?gesvd routines "make effective use of parallelism", and two of the three computational routines used by zgesvd are listed as threaded (?gebrd, ?bdsqr but not ?ungbr).
I profiled my program using the Vtune Amplifier XE. Here is a screenshot of the Vtune timeline, with the 2 calls to zgesvd() marked by the user events set through the User Event API. I notice that the majority of the call to SVD routine is spent with only a single active thread. The rest of the time, all 8 threads are engaged.
I have several questions regarding this:
1) Does this behaviour (sub-optimal utilization of threads) seem normal? Is there anything I can do to improve it?
2) Can I get any improvement by directly using the computational routines to compute the SVD (i.e. calling ?gebrd, ?bdsqr and ?ungbr), instead of using the driver routine?
Thanks in advance for your time.
P.S. If you need more specific information about my code, please let me know.
P.P.S. There are 7 threads that are suspended during the entire call to the SVD routine (rows 2-8 in the screenshot). These are used by another section of the algorithm and (ideally) should be combined with the threads used by SVD. However, having these extra threads suspended should not affect the computation of SVD, as far as I know.

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

SVD it is approximately equal BLAS Level 2 + BLAS Level 3
BLAS Level 2 - a single active thread.
BLAS Level 3 - 8 active thread.
Part of operation with a two-scalar matrix - a single active thread.
Though on new processors BLAS Level 2 partially parallelized.

Thanks for your reply.>>>Though on new processors BLAS Level 2 partially parallelized.Do you know if that includes XeonE5450?

>>Do you know if that includes XeonE5450?

Yes.But parallel algorithms should be programmed. About it address to developers. For example, in algorithms of diagonalization parallel algorithms for BLAS Level 2 are programmed. My interests belong to diagonalization area:


Have you got the chance to check MKL 11.2 release? The SVD function has significant improvement:

Feel free to let us know if you have any feedback on this.


Leave a Comment

Please sign in to add a comment. Not a member? Join today