Tutorial

Interpret Hotspots Result Data

When the sample application exits, the
Intel® VTune™
Profiler
finalizes the results and opens the
Hotspots by CPU Utilization
viewpoint where each window or pane is configured to display code regions that consumed a lot of CPU time. To interpret the data on the sample code performance, do the following:
The screenshots and execution time data provided in this tutorial are created on a system with
4 cores and 8 threads
. Your data may vary depending on the number and type of CPU cores on your system.

Understand the Hotspots Metrics

Start your investigation with the
Summary
window. To learn more about a particular metric, hover over the question mark icons to read the pop-up help and better understand what each performance metric means.
Summary window showing elapsed time metrics, top hotspots list, and CPU utilization histogram
Note that
CPU Time
for the sample application is equal to about
395
seconds. It is the sum of CPU time for all application threads.
Total Thread Count
is
9
, so the sample application is multi-threaded.
The
Top Hotspots
section of the
Summary
window provides data on the most time-consuming functions (hotspot functions) sorted by CPU time spent on their execution. For the sample application, the
multiply1
function, which took
394.478
seconds to execute, shows up at the top of the list as the hottest function.
The
Effective CPU Utilization Histogram
lower on the
Summary
window represents the Elapsed Time and usage level for the available logical processors and provides a graphical look at how many logical processors were used during the application execution. Ideally, the highest bar of your chart should match the Target Utilization level. The
matrix
application ran mostly on all logical CPUs.
The
Insights
pane highlights on the most critical issues with the application and provides recommendations based on the collected results. In this case, it recommends reviewing the per-function performance statistics on the
Bottom-up
pane for the identified hotspots, such as the
multiply1
function.
As an additional insight,
VTune
Profiler
flagged an issue with the Microarchitecture Usage. The metric value is below the threshold, which indicates low code efficiency on this hardware platform. Possible causes of low performance can include memory stalls, instruction starvation, branch misprediction, or long latency instructions. After analyzing or resolving the algorithm issues for hotspot functions, run the
Microarchitecture Exploration
analysis type to identify the root cause of the Microarchitecture Usage issues.

Analyze the Most Time-Consuming Functions

To view per-function hotspots analysis, switch to the
Bottom-up
tab. By default, the data in the grid is sorted by Function. You may change the grouping level using the
Grouping
drop-down menu at the top of the grid.
Analyze the
CPU Time
column values. Functions that took most CPU time to execute are listed on top.
The
multiply1
function took the maximum time to execute,
394.468
seconds, and had the longest poor CPU utilization (red bars). This means that the processor cores were underutilized during a portion of the time spent executing this function.
Hotspots analysis bottom-up tab showing matrix1 function with both ideal and poor effective time by utilization bars
To get the detailed CPU utilization information per function, use the
Expand
button in the
Bottom-up
pane to expand the
Effective Time by Utilization
column.
Effective time by utilization column with idle, por, ok, ideal, and over metrics shown
Select the
multiply1
function in the grid and explore the data provided in the
Call Stack
pane on the right. The
Call Stack
pane displays full stack data for each hotspot function, which enables you to navigate between function call stacks and understand the impact of each stack to the function CPU time. The stack functions in the
Call Stack
pane are represented in the following format:
<module>!<function> - <file>:<line number>
, where the line number corresponds to the line calling the next function in the stack.
Call stack information for multiply1 function
For the sample application, the hottest function
multiply1
is called at line 48 of the
ThreadFunction
function in the
thrmodel.c
file.

Analyze Performance per Thread

If you change the grouping level in the
Bottom-up
pane from
Function/Call Stack
to
Thread/Function/Call Stack
, you see that the
multiply1
function belongs to the
ThreadFunction
thread.
Grid view with grouping set to Thread / Function / Call Stack
To get detailed information on the thread performance, explore the
Timeline
pane.
Timeline showing graph of CPU time for each thread
3
Timeline
area. When you hover over the graph element, the timeline tooltip displays the time passed since the application has been launched.
2
Threads
area that shows the distribution of CPU time utilization per thread. Hover over a bar to see the CPU time utilization in percent for this thread at each moment of time. Green zones show the time threads are active.
3
CPU Utilization
area that shows the distribution of CPU time utilization for the whole application. Hover over a bar to see the application-level CPU time utilization in percent at each moment of time.
VTune
Profiler
calculates the overall
CPU Utilization
metric as the sum of CPU time per each thread of the
Threads
area. Maximum
CPU Utilization
value is equal to
[number of processor cores] x 100%
.

View Source Code

Double-click the
multiply1
function on the
Bottom-up
pane grid to open the
Source
window and analyze the source code.
Source view window showing source code on the left and CPU utilization values on the right
The table below explains some of the features available in the
Source
window.
1
Source window toolbar. Use the hotspot navigation buttons to switch between most performance-critical code lines. Use the
Source
/
Assembly
buttons to toggle the
Source
/
Assembly
panes (if both of them are available) on/off.
2
Source
pane displaying the source code of the application if the function symbol information is available. The hottest code line in the function is highlighted. The source code in the
Source
pane is not editable.
If the function symbol information is not available, the
Assembly
pane opens displaying assembler instructions for the selected hotspot function. To enable the
Source
pane, make sure to build the target properly.
3
Processor time attributed to a particular code line. If the hotspot is a system function, its time, by default, is attributed to the user function that called this system function.
Drag-and-drop the columns to organize the view for your convenience.
VTune
Profiler
remembers your settings and restores them each time you open the viewpoint.
4
Heat map markers to quickly identify performance-critical code lines (hotspots). The bright blue markers indicate hot lines for the function you selected for analysis. Light blue markers indicate hot lines for other functions. Scroll to a marker to locate the hot code line it identifies.
By default, when you double-click the hotspot in the
Bottom-up
pane,
VTune
Profiler
opens the source file positioned at the most time-consuming code line of this function. For the
multiply1
function, this is line 51, which operates over three arrays: a, b, and c.
Depending on the sample code version, your source line numbers may slightly differ from the numbers provided in this tutorial.
According to the
Insights
data on the
Summary
pane, the
matrix
application may use microarchitecture resources ineffectively. To learn more about possible issues, run the
Microarchitecture Exploration
analysis and identify the affected part of the core pipeline.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804