You made sure that removing the critical section gave you 268 ms of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results:
Compare Results Before and After Optimization
Run the Locks and Waits analysis on the modified code.
Click the Compare Results button on the VTune Amplifier toolbar.
The Compare Results window opens.
Specify the Locks and Waits analysis results you want to compare:
The Summary window opens providing the statistics for the difference between collected results.
Identify the Performance Gain by Metrics
The Result Summary section of the Summary window shows that after optimization all critical metric values has reduced significantly. The Elapsed Time data shows the optimization of 4 seconds for the whole application. Wait Time decreased by 20.5 seconds, Wait Count - by 24,570.
The Locks and Waits analysis adds an overhead to the application execution. The overhead often depends on the number of threads and synchronization objects used in the application. This is the reason why Elapsed time data provided in the Summary window may differ from the data reported after the application launch outside of the VTune Amplifier.
According to the Thread Concurrency histogram, after optimization (an orange bar) 4 threads ran in parallel effectively utilizing CPU cores for 14 seconds, which is categorized by the VTune Amplifier as the Ideal processor utilization. The previous version of the application ran on 4 threads for 11.5 seconds.
In the Bottom-up pane, locate the OpenMP* critical section you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result r003lw does not show any performance data for this synchronization object. If you collapse the Wait Time:Difference by Utilization column by clicking the button, you see that with the optimized result you got almost 12 seconds of optimization in Wait time. Using dynamic scheduling for the threads barrier gave you 4.5 seconds of optimization in Wait time.
Compare Timeline Data
Open the optimized result of the Locks and Waits analysis r003lw, click the Bottom-up tab and analyze the Timeline pane.
The optimized result does not have transitions anymore. Though the threads are not fully balanced, the wait regions have reduced.
Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier command-line interface and run the
amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier online help.