You made sure that removing the mutex gave you 14.196 seconds of optimization in the application execution time. To understand the impact of your changes and how the CPU utilization has changed, re-run the Locks and Waits analysis on the optimized code and compare results:
Compare Results Before and After Optimization
Identify the Performance Gain at the Application Level
The Elapsed time data in the Summary window shows the optimization of 13 seconds for the whole application execution and Wait time has decreased by 143.5 seconds. Spin Time value has decreased significantly though it is still above the threshold.
According to the Thread Concurrency histogram, before optimization (blue bar) the application ran serially for 17.680 seconds poorly utilizing available processor cores but after optimization (orange bar) it ran serially only for 1.5 seconds.
Identify the Performance Gain Per Program Unit
Click the Bottom-up tab to see the list of synchronization objects used in the code, Wait time utilization across the two results, and the differences side by side:
In the Bottom-up pane, locate the Critical Section you identified as a bottleneck in your code. Since you removed it during optimization, the optimized result r001lw does not show any performance data for this synchronization object. You see that with the optimized result you got almost 121 seconds of optimization in Wait time.
Compare analysis results regularly to look for regressions and to track how incremental changes to the code affect its performance. You may also want to use the VTune Amplifier command-line interface and run the
amplxe-cl command to test your code for regressions. For more details, see the Command-line Interface Support section in the VTune Amplifier online help.