You made sure that removing the critical section gave you 203 ms of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results:
Compare Results Before and After Optimization
Run the Locks and Waits analysis on the modified code.
Click the Compare Results button on the VTune Amplifier toolbar. If you are using VTune Amplifier within Microsoft* Visual Studio, select Tools > Intel VTune Amplifier XE <version> > Compare.
The Compare Results window opens.
Specify the Locks and Waits analysis results you want to compare:
The Summary window opens providing the statistics for the difference between collected results.
Identify the Performance Gain by Metrics
The Result Summary section of the Summary window shows that after optimization all critical metric values has reduced significantly. The Elapsed Time data shows the optimization of 6 seconds for the whole application. Wait Time decreased by 25.171 seconds, Wait Count - by 4,756.
The Locks and Waits analysis adds an overhead to the application execution. The overhead often depends on the number of threads and synchronization objects used in the application. This is the reason why Elapsed time data provided in the Summary window may differ from the data reported after the application launch outside of the VTune Amplifier.
According to the Thread Concurrency histogram, after optimization (an orange bar) 15 threads ran in parallel effectively utilizing CPU cores for almost 5 seconds, which is categorized by the VTune Amplifier as Ideal processor utilization. The previous version of the application ran on 15 threads for 7 seconds.
In the Bottom-up pane, locate the OpenMP* critical section you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result does not show any performance data for this synchronization object. If you collapse the Wait Time: Difference by Thread Concurrency column by clicking the button, you see that with the optimized result you got 0.979 seconds of optimization in Wait time. Using dynamic scheduling for the threads barrier gave you almost 40 seconds of optimization in Wait time.
Compare Timeline Data
Open the optimized result of the Locks and Waits analysis, open the Bottom-up tab, and analyze the Timeline pane.
The optimized result does not have multiple transitions anymore. Though the threads are not fully balanced, the wait regions have reduced.
Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier command-line interface and run the
amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier online help.