Help with understanding vtune results

Help with understanding vtune results

Hello,
I ran a lightwieight hotspot analysis on my code. I
get the result attached as csv file. Can you please help me with
pointers to what i can do now to improve the speed of the program. Major
portion of the time is spent in zgemm3m for amtrix multiplications and
matrix inverse using zgesv (or getrf and getri ). I am not able to
understand the timing information obtained.

My computer has dual quad core(E5240) 2.493 GHz

AttachmentSize
Downloadtext/csv LightweightHotspot_RunG.csv44.66 KB
7 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The first line of the .CSV file contains column names. If the contents of the columns are still not clear to you, you may look up those column names in the documentation for the tool that produced the .CSV file.

Hello,
I did look a the documentation, I understand the columns mentioned in first line. I am not sure what i should do with this profile information. The timing information for the functions in row 2 to 13 in first coulomn do not directly refer to any functions in my code. My understanding is that they are refering to operations in mkl subroutines zgemm3m [complex double precesion] which is called a lot of times in the code. For example,
Function
CPU Time
Instructions Retired
CPI Rate
Function (Full)

mkl_blas_mc_zgemm3m_copyan
416.3016446
1.43E11
7.779492
mkl_blas_mc_zgemm3m_copyan

this operation is about 20 % of the total CPU time. I do not know what to do to reduce the total compuation time. Vtune seem to indiacate the CPI rate of 7.7 for this operation is bad.
I looked at the assembly code for the above function, most of the time is spent in instructions movhpdq or movsdq which seem to take about 1 to 4 nano seconds per Retired Instructuion.
I posted my intial post on vtune, mkl and ifort forums. I really appreciate any insights or help.

My aim is to optimize the code as much as i can. However, if the profile information shows that the most time consuming part is due to instructions in mkl realted functions, what should one do?

Thanks
Reddy

My aim is to optimize the code as much as i can. However, if the profile
information shows that the most time consuming part is due to
instructions in mkl realted functions, what should one do?

A half-facetious answer is "Do not call the MKL function, or call it less often".

Indeed, with the limited information you have given, that is probably a good answer.

You do not have the source code of the MKL routine, which you have identified as the "bottleneck". Therefore, you can do little to optimize the MKL routine (other than to use a faster machine, create more favorable memory access patterns or employ multiple threads).

Often, profiling data helps one to remove inefficient sections of code. If the code has already been optimized, can there be scope for further improvement?

As mecej4 implied, once you have determined where the bulk of the time is spent, you might consider concentrating your efforts there. Assuming you're not planning to write your own version of the MKL functions, that leaves you with run-time settings. In particular, you didn't say whether you investigated the effect of setting KMP_AFFINITY and (possibly) number of threads.

Hello,
Thanks for you replays and sorry about the posts in multiple forums. I have not thought about the runtime options for mkl, I will try to look into it. Also as mecej4 suggested, i should try to call mkl lesser number of times i.e., need to take a relook at the algorithm. I was trying to optimize the what is often refered to as recursive greens function procedure in the device physics communtity. Basically the problem is to invert a block tridiagoanl matrix to calcalulate only the block diagonal, upper and lower diagnoals parts of the inverted matrix.

A majority of the operations are of the form
A = B + C*D*E
where A, B, C, D are the elements of the block matrix.
Thanks
Reddy

> ... the problem is to invert a block tridiagoanl matrix...

It is rare in computational work to actually compute a matrix inverse and store the inverted matrix, for at least two reasons: the number of operations is doubled, and the inverse matrix is more dense than the original matrix or its L-U factors.

Therefore, while it is usual mathematical convention to write the solution of A x = b as x = A-1 b, no good software will compute A-1 and then multiply the inverse into the vector b.

Leave a Comment

Please sign in to add a comment. Not a member? Join today