Measure Speedup with VTune not possible?

Measure Speedup with VTune not possible?

I am using the VTune Aplifier XE for Windows in order to support the parallelization of a given program.VTune is a good help in showing me the hotspots of the program, however I am curious how it can help me to measure the improvement after parallelization.For example: I have function A which is identified as a hotspot. After parallelization it becomes executed concurrently on multiple processors which speeds everything up. What the analysis of VTune then shows me is the CPU Time over all busy processors which is more or less the same as in the sequential case - this is not a surprise as the actual work was not reduced by parallelization.I guess measuring the (inclusive/exclusive) time of a given function is just not possible with sampling... am I right here?One more thing: in your VTune tutorial (https://wiki.engr.illinois.edu/download/attachments/114688007/amplifier_xe_linux.pdf?version=1&modificationDate=1296056455000) on page 27 the author mentions two options how the code can be improved:* sequential tuning* parallelizationand in the tutorial they choose the first option. This leaves the impresssion that you could have also used VTune to support the second option, wich seems to be not true as I have described above.Or did I miss something and you can use VTune to measure the speedup of a funtion after parallelization?Constantin

5 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

VTune Amplifier XE can help to identify the hotspots, and have two kinds of improvement usually:
1. The workload of hot function can be parallelized (you've done), soit's best utilize the multi-core system,as result itreduced the execution time. You can use Concurrency Analysis to know ifconcurrency level gets better. You are right- all workloads arenot reduced, butparallelized, so execution time of program is reduced in Summary report.Youmight review bottom-up report by using grouping "Thread / Function / Call stack" to know parallel workload in each thread. Observe them - imbalanced? adjust algorithm again?

2. After completing parallelling work, we can step into Microarchitecture level turning - such Branch Misprediction issue, Cache Misses, etc. Your adjust code or use Intel C++ compiler's advanced optimization options. As result, execution time of hot functions will be reduced - that is quite different from parallelism optimization.

Regards, Peter

Hi Peter,thanks for your Reply!To sum it up: there is no way to figure out the speedup of a hotspot after parallelization with VTune - is this correct?

Best Reply

Yes. Workload is no change, there is no direct indicator to compare them, so I suggested to use CL& execution time.

Sometime you can use Critical Pathdata to compare with serial result. I assume that you have reassigned work to different thread, and start them at almost same time stamp. So works terminated in threads at different time:
T1 T2T3 T4 T5
w1
w2
w3

So CP = T4, to compare this with serial result.

Thanks, Peter

Thanks for your help, Peter!

Leave a Comment

Please sign in to add a comment. Not a member? Join today