When the application exits, the Intel® VTune™ Amplifier finalizes the results and opens the Hardware Issues viewpoint that consists of the Summary window, Bottom-up window, and Timeline pane. To interpret the collected data and understand where you should focus your tuning efforts for the specific hardware, do the following:
The screenshots and execution time data provided in this tutorial are created on a system with four CPU cores. Your data may vary depending on the number and type of CPU cores on your system.
Understand the Event-based Metrics
Identify the Hardware Issues
Click the Bottom-up tab to open the Bottom-up window and see how each program unit performs against the event-based metrics. Each row represents a program unit and percentage of the CPU cycles used by this unit. Program units that take more than 5% of the CPU time are considered hotspots. This means that by resolving a hardware issue that, for example, took about 20% of the CPU cycles, you can obtain 20% optimization for the hotspot.
You see that the
multiply1 function is the most obvious hotspot in the
matrix application. It has the highest event count (Clockticks and Instructions Retired events) and most of the hardware issues were also detected during execution of this function.
Mouse over a column header with an event-based metric name to see the metric description. Mouse over a highlighted cell to read the description of the hardware issue detected for the program unit.
CPI Rate is high (>1). Potential causes are memory stalls, instruction starvation, branch misprediction, or long-latency instruction. To define the cause for your code, explore other metrics in the Bottom-up window.
The Retire Stalls metric shows that during the execution of the
multiply1function, about 90% (0.902) of CPU cycles were waiting for data to arrive. This may result from branch misprediction, instruction starvation, long latency operations, and other issues. Once you have located the stalled instructions in your code, analyze metrics such as LLC Miss, Execution Stalls, Remote Accesses, Data Sharing, and Contested Accesses. You can also look for long-latency instructions like divisions and string operations to understand the cause.
LLC misses metric shows that about 60% (0.592) of CPU cycles were spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but beware that they can increase latency by interfering with normal loads and can increase pressure on the memory system.
LLC Load Misses Serviced by Remote DRAM metric shows that 34% (0.340) of cycles were spent servicing memory requests from remote DRAM. Wherever possible, try to consistently use data on the same core or at least the same package, as it was allocated on.
Execution Stalls metric shows that 54% (0.543) of cycles were spent with no micro-operations executed. Look for long-latency operations at code regions with high execution stalls and try to use alternative methods or lower latency operations. For example, consider replacing
divoperations with right-shifts or try to reduce the latency of memory accesses.