When the sample application exits, the Intel® VTune™ Amplifier finalizes the results and opens the Hotspots by CPU Usage viewpoint where each window or pane is configured to display data on application parallelism and usage of processor cores. To interpret the data on the sample code performance, do the following:
The screenshots and execution time data provided in this tutorial are created on a system with 4 CPU cores. Your data may vary depending on the number and type of CPU cores on your system.
Explore Application-level Concurrency
Start analysis with the Summary window that opens by default when data collection completes. To interpret the data, hover over the question mark icons to read the pop-up help and better understand what each performance metric means.
You see that after optimization the Elapsed time has reduced from 32.440 seconds to 18.351 seconds.
The Concurrency analysis adds an overhead to the application execution. The overhead often depends on the number of threads and synchronization objects used in the application. This is the reason why Elapsed time data provided in the Summary window may differ from the data reported after the application launch outside of the VTune Amplifier.
The CPU Usage Histogram shows that the average concurrency level of the sample application is about 3.2 while the target concurrency level for this application on the 4-core system is 4. If you hover over the highest bar, you see that this application has run 4 threads for almost 11 seconds, which is categorized by the VTune Amplifier as Ideal processor utilization. The application has run one and two threads simultaneously for more than 4 seconds, which is classified as Poor parallelization.
Identify the Most Time-consuming Function
Click the Bottom-up tab to switch to the Bottom-up window and analyze application performance by function. By default, the grid is sorted by the CPU Time by Utilization metric in the descending order. Select the Process/Function/Thread/Call Stack grouping level from the Grouping menu. This granularity enables you to visualize threads where the hotspots functions were executed.
After initial optimization, the
NQUEENS_ip_SETQUEEN function is still a bottleneck. Click the plus sign at the
NQUEENS_ip_SETQUEEN function. You see that this function's execution was parallelized among four threads.
Select these threads in the grid, right-click and choose the Filter In by Selection context menu option. The Timeline pane below is updated to display data for the selected threads only.
The Timeline pane for the sample application shows a large number of transitions between threads, which means that the threads spent noticeable time transferring execution to each other. If you uncheck the Transitions display option on the right, you see that workload balance is also poor since three of four threads were waiting for
OMP Worker Thread #1 to complete execution.
Run the Locks and Waits analysis to understand what prevents the sample code from effective thread concurrency and processor utilization.