Interpret Microarchitecture Exploration Analysis Result Data
When the sample application exits, the
Intel® VTune™
finalizes the results and opens the
Profiler
Microarchitecture Exploration
viewpoint, which provides a high-level performance overview of the interaction between the application and the available hardware.
To interpret the data on the sample code performance, do the following:
Understand the Event-based Metrics
Start with the
Summary
pane for an overview of application performance.

The µPipe diagram provides a graphical representation of CPU microarchitecture metrics showing inefficiencies in hardware usage. Treat the diagram as a pipe with an output flow equal to the ratio:
Actual Instructions Retired/Possible Maximum Instruction Retired
(pipe efficiency). The µPipe is based on CPU pipeline slots that represent hardware resources needed to process one micro-operation. Usually there are several pipeline slots available on each cycle (pipeline width). If a pipeline slot does not retire, this is considered a stall and the µPipe diagram represents this as an obstacle making the pipe narrow.
See the
Microarchitecture Pipe page of the online User Guide for a more detailed explanation of the µPipe.
In this case, the
Memory Bound
metric is high, so only a small fraction (approximately
11%
) of pipeline slots are being retired. Hover over each section for a description and percentage of the total pipeline or refer to the metrics on the left.
The hierarchy of event-based metrics in the Microarchitecture Exploration viewpoint depends on your hardware architecture. Each metric is an event ratio defined by Intel architects and has its own predefined threshold.
VTune
analyzes a ratio value for each aggregated program unit (for example, function). When this value exceeds the threshold, it signals a potential performance problem.
Profiler

The
Elapsed Time
section shows metrics related to hardware event ratios for your hardware. Hover over the flagged metrics to get a description of the issues, possible causes, and suggestions for resolving the issue. This result shows issues with both
CPI Rate
(Clockticks per Instructions Retired rate) and
Back-End Bound
. Both issues were identified as possible causes for slow execution by the original
Hotspots
analysis. In the expanded
Back-End Bound
section, there are issues with the application being
Memory Bound
, which matches the µPipe diagram. The
Bottom-up
pane can help identify the program units responsible for the memory issues.
Identify Hardware Usage Bottlenecks
Switch to the
Bottom-up
pane to see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots.
By default, the
VTune
sorts data in the descending order by CPU Time and provides the hotspots at the top of the list. The metric values for event ratios show up as numbers and/or bars.
Profiler

As was identified when running the
Hotspots
analysis, the
multiply1
function is the most obvious hotspot in the
matrix
application. It has the highest event count (Clockticks
and
Instructions Retired
events) and most of the hardware issues were also detected during the execution of this function.
The
Back-End Bound
metric describes a portion of the pipeline where the out-of-order scheduler dispatches ready µOps into their respective execution units, and, once completed, these µOps get retired according to program order. Identify slots where no µOps are delivered due to a lack of required resources for accepting more µOps in the bad-end of the pipeline. Stalls due to data-cache misses or stalls due to the overloaded divider unit are examples of back-end bound issues.
Expand the
Back-End Bound
column to discover that the code is memory bound with the most percentage of stalls occurring on the main memory (DRAM). Hover over the highlighted cells to learn more about optimization opportunities.

Analyze Code
Double-click the
multiply1
function to open the
Source
window and analyze the source code.

When you drill-down from the grid to the source view, the
VTune
automatically highlights the code line that has the highest event count. In the
Profiler
Source
pane for the
multiply1
function, you see that line 51 took the most of the Clockticks event samples during execution and was also highlighted as the top hotspot line in the Hotspots result. This code section multiplies matrices in the loop but ineffectively accesses the memory. Expand the
Back-End Bound
column to learn more. Focus on this section and try to reduce the memory issues.
For advanced users looking for a different way to identify and diagnose memory issues in your application, try running the
Memory Access
analysis type. An example of how to define which data structure induces inefficient memory access is available from the
VTune
Cookbook.
Profiler