When the sample application exits, the Intel® VTune™ Amplifier finalizes the results and opens the Hotspots by CPU Usage viewpoint where each window or pane is configured to display data on application parallelism and usage of processor cores. To interpret the data on the sample code performance, do the following:
The screenshots and execution time data provided in this tutorial are created on a system with 4 CPU cores. Your data may vary depending on the number and type of CPU cores on your system.
Explore Application-level Concurrency
Start analysis with the Summary window that opens by default when data collection completes. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means.
You see that after optimization, the Elapsed time has increased from 9.684 seconds to 15.261 seconds. Most of the CPU time was spent in Spin Time rather than Effective Time.
The Concurrency analysis adds an overhead to the application execution. The overhead often depends on the number of threads and synchronization objects used in the application. This is the reason why Elapsed time data provided in the Summary window may differ from the data reported after the application launch outside of the VTune Amplifier.
The CPU Usage Histogram shows that the average concurrency level of the sample application is about 3 while the target concurrency level for this application on the 4-core system is 32. If you hover over the highest bar, you see that this application has run 0 threads for almost 6 seconds, which is categorized by the VTune Amplifier as Idle processor utilization on this system.
Identify the Most Time-consuming Function
Click the Bottom-up tab to switch to the Bottom-up window and analyze application performance by function. By default, the grid is sorted by the CPU Time metric in the descending order. Select the Process/Function/Thread/Call Stack grouping level from the Grouping menu. This granularity enables you to visualize threads where the hotspots functions were executed.
After initial optimization, the
nqueens_IP_setqueen function is still a bottleneck. Click the arrow sign at the
nqueens_IP_setqueen function. You see that this function's execution was parallelized among fifteen threads.
Select these threads in the grid, right-click and choose the Filter In by Selection context menu option. The Timeline pane below is updated to display data for the selected threads only.
The Timeline pane for the sample application shows that workload balance is poor since many of the threads were waiting for
OMP Worker Thread #7 to complete execution.
Run the Locks and Waits analysis to understand what prevents the sample code from effective thread concurrency and processor utilization.