I am trying to optimize software that uses the Intel MKL to perform an SVD of a large complex-valued matrix, using calls to the zgesvd() Lapack driver routine. The MKL documentation states that ?gesvd routines "make effective use of parallelism", and two of the three computational routines used by zgesvd are listed as threaded (?gebrd, ?bdsqr but not ?ungbr).
I profiled my program using the Vtune Amplifier XE. Here is a screenshot of the Vtune timeline, with the 2 calls to zgesvd() marked by the user events set through the User Event API. I notice that the majority of the call to SVD routine is spent with only a single active thread. The rest of the time, all 8 threads are engaged.
I have several questions regarding this:
1) Does this behaviour (sub-optimal utilization of threads) seem normal? Is there anything I can do to improve it?
2) Can I get any improvement by directly using the computational routines to compute the SVD (i.e. calling ?gebrd, ?bdsqr and ?ungbr), instead of using the driver routine?
Thanks in advance for your time.
P.S. If you need more specific information about my code, please let me know.
P.P.S. There are 7 threads that are suspended during the entire call to the SVD routine (rows 2-8 in the screenshot). These are used by another section of the algorithm and (ideally) should be combined with the threads used by SVD. However, having these extra threads suspended should not affect the computation of SVD, as far as I know.
Threaded SVD using MKL/LAPACK