Interpret Results

When the application exits, the Intel® VTune™ Amplifier finalizes the results and opens the Hardware Issues viewpoint that consists of the Summary window, Bottom-up window, and Timeline pane. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following:

  1. Understand the event-based metrics

  2. Identify the hardware issues that affect the performance of your application


Understand the Event-based Metrics

Click the Summary tab to explore the data provided in the Summary window for the whole application performance.

Hardware Issues Viewpoint: Summary Window

Elapsed time is the wall time from the beginning to the end of the collection. Treat this metric as your basic performance baseline against which you will compare subsequent runs of the application. The goal of your optimization is to reduce the value of this metric.

Event-based performance metrics. Each metric is an event ratio provided by Intel architects. Mouse over the yellow icon to see the metric description and formula used for the metric calculation. You can also right-click the column and select the What's This Column? context menu option to read the metric description and learn possible performance issues.

Values calculated for each metric based on the event count. VTune Amplifier highlights those values that exceed the threshold set for the corresponding metric. Such a value highlighted in pink signifies an application-level hardware issue.

The text below a metric with the detected hardware issue describes the issue, potential cause and recommendations on the next steps, and displays a threshold formula used for calculation. Mouse over the truncated text to read a full description.

Quick look at the summary results discovers that the matrix application has the following issues:

  • CPI (Clockticks per Instructions Retired) Rate

  • Retire Stalls

  • LLC Miss

  • LLC Load Misses Serviced by Remote DRAM

  • Execution Stalls

  • Data Sharing

Identify the Hardware Issues

Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot.

By default, the VTune Amplifier sorts data in the descending order by Clockticks and provides the hotspots at the top of the list.

Hardware Issues Viewpoint: Bottom-up Window

You see that the multiply1 function is the most obvious hotspot in the matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function.


Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit.

For the multiply1 function, the VTune Amplifier highlights the same issues (except for the Data Sharing issue) that were detected as the issues affecting the performance of the whole application:

  • CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window.

  • The Retire Stalls metric shows that during the execution of the multiply1 function, about 90% (0.902) of CPU cycles were waiting for data to arrive. This may result from branch misprediction, instruction starvation, long latency operations, and other issues. Once you have located the stalled instructions in your code, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data Sharing, and Contested Accesses. You can also look for long-latency instructions like divisions and string operations to understand the cause.

  • LLC misses metric shows that about 60% (0.592) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system.

  • LLC Load Misses Serviced by Remote DRAM metric shows that 34% (0.340) of cycles were spent servicing memory requests from remote DRAM. Wherever possible, try to consistently use data on the same core or at least the same package, as it was allocated on.

  • Execution Stalls metric shows that 54% (0.543) of cycles were spent with no micro-operations executed. Look for long-latency operations at code regions with high execution stalls and try to use alternative methods or lower latency operations. For example, consider replacing div operations with right-shifts or try to reduce the latency of memory accesses.

Next Step

Analyze Code