When the application exits, the Intel® VTune™ Amplifier finalizes the results and opens the Hardware Issues viewpoint. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following:
For a detailed tuning methodology based on the events and metrics in General Exploration, visit https://software.intel.com/en-us/articles/processor-specific-performance-analysis-papers.
The screenshots and execution time data provided in this tutorial are created on a system with 4 CPU cores, based on the Intel microarchitecture code name Nehalem. Your data may vary depending on the number and type of CPU cores on your system.
Understand the Event-based Metrics
The Elapsed time metric shows the wall time from the beginning to the end of the collection. Treat this metric as your basic performance baseline against which you will compare subsequent runs of the application. The goal of your optimization is to reduce the value of this metric. All other metrics in this section are hardware event ratios provided by Intel architects. Mouse over the icon to see the metric description and formula used for the metric calculation. VTune Amplifier highlights metrics values that exceed the threshold set for the corresponding metric. Such a value highlighted in pink signifies an application-level hardware issue. The text below a metric with the detected hardware issue describes the issue, potential cause and recommendations on the next steps, and displays a threshold formula used for calculation. Mouse over the truncated text to read a full description.
Identify the Hardware Issues
Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot.
By default, the VTune Amplifier sorts data in the descending order by Clockticks and provides the hotspots at the top of the list. The metric values for event ratios show up as numbers or bars. To change the data format, right-click a column and select Show Data As > format.
You see that the
multiply1 function is the most obvious hotspot in the
matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function.
Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit.
CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window.
LLC miss metric shows that about 70% (0.703) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system.