General Exploration Analysis

The General Exploration analysis type uses event-based sampling collection.

This analysis is a good starting point to triage hardware issues in your application. Once you have used Basic Hotspots or Advanced Hotspots analysis to determine hotspots in your code, you can perform General Exploration analysis to understand how efficiently your code is passing through the core pipeline. During General Exploration analysis, the VTune Amplifier collects a complete list of events for analyzing a typical client application. It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems. The list of events and metrics collected during the General Exploration analysis depends on your microarchitecture.

To see the list of events collected for your system:

  1. Click the New Analysis toolbar button.

    The Analysis Type window opens.

  2. From the left pane, select Microarchitecture Analysis > General Exploration.

    The General Exploration configuration pane opens on the right. The Details section provides a table with the processor events used for this analysis type.

Note

For event descriptions, see the Intel Processor Event Reference available from the Help menu.

The General Exploration analysis strategy also varies by microarchitecture:

Intel® Core™ 2 Processor Family

The General Exploration analysis for Intel Core 2 processors collects events that help locate basic performance issues and understand their impact on the performance. The following issues can be identified:

  • overall number of cycles when no instructions were dispatched for execution

  • instruction starvation caused either by branch misprediction or other issues

  • long latency memory accesses due to bad locality of accesses or bandwidth saturation

Precise memory events used in this analysis type help identify the problems in data access patterns and with specific instructions.

Intel® Microarchitectures Code Name Nehalem and Westmere

During General Exploration analysis on Intel microarchitectures code name Nehalem and Westmere, the VTune Amplifier collects metrics that help identify such hardware-level performance problems as:

  • Front End stall and its causes

  • Stalls at execution and retirement: particularly those caused by stalls due to the various high latency loads, wasted work caused by branch misprediction, or long latency instructions.

Intel® Microarchitecture Code Name Sandy Bridge

Superscalar processors can be conceptually divided into the front-end, where instructions are fetched and decoded into the operations that constitute them, and the back-end, where the required computation is performed. Each cycle, the front-end generates up to four of these operations. It places them into pipeline slots that then move through the back-end. Thus, for a given execution duration in clock cycles, it is easy to determine the maximum number of pipeline slots containing useful work that can be retired in that duration. The actual number of retired pipeline slots containing useful work, though, rarely equals this maximum. This can be due to several factors: some pipeline slots cannot be filled with useful work, either because the front-end could not fetch or decode instructions in time (Front-end bound execution) or because the back-end was not prepared to accept more operations of a certain kind (Back-end bound execution). Moreover, even pipeline slots that do contain useful work may not retire due to bad speculation. Front-end bound execution may be due to a large code working set, poor code layout, or microcode assists. Back-end bound execution may be due to long-latency operations or other contention for execution resources. Bad speculation is most frequently due to branch misprediction.

Each cycle, each core can fill up to four of its pipeline slots with useful operations. Therefore, for some time interval, it is possible to determine the maximum number of pipeline slots that could have been filled in and issued during that time interval. This analysis performs this estimate and breaks up all pipeline slots into four categories:

  • Pipeline slots containing useful work that issued and retired (Retired)

  • Pipeline slots containing useful work that issued and cancelled (Bad speculation)

  • Pipeline slots that could not be filled with useful work due to problems in the front-end (Front-end Bound)

  • Pipeline slots that could not be filled with useful work due to a backup in the back-end (Back-end Bound)

To use General Exploration analysis on Intel microarchitecture code name Sandy Bridge, first determine which top-level category dominates for hotspots of interest. You can then dive into the dominating category by expanding its column. There, you can find many issues that may contribute to that category.

Intel® Atom™ Processors

The General Exploration analysis strategy for Intel Atom processors is similar to the strategy applied to the analysis on Intel microarchitecture code name Sandy Bridge.

Viewing Analysis Data

You can choose to view General Exploration analysis results in any of the following viewpoints:

Viewpoint

Description

General Exploration

Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware events. The Summary window reports the overall metrics for the entire execution along with explanations of the metrics. From the Bottom-up and Top-down Tree windows you can locate the hardware issues in your application. Cells are highlighted when potential opportunities to improve performance are detected. Hover over the highlighted metrics in the grid to see explanations of the issues.

Hardware Event Counts

Displays the event count for all collected processor events. While the Hardware Event Sample Counts viewpoint provides the actual number of samples collected for an event, Hardware Event Count viewpoint estimates the number of times this event occurred during the collection.

Hardware Event Sample Counts

Displays the sample count for all collected processor events. While the Hardware Event Counts viewpoint estimates the number of times an event occurred during the collection, the Hardware Event Sample Counts viewpoint provides the actual number of samples collected for this event.

Hardware Issues

Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware performance counters. Hover over the highlighted metrics values in the grid to read why the extreme value might represent a performance problem.

Hotspots

Helps identify hotspots - code regions in the application that consume a lot of CPU time.

Bandwidth

Helps identify where the application is generating significant bandwidth to DRAM. Memory bandwidth, in GB/sec, is plotted in the timeline, while events often associated with DRAM requests are shown in the grid. In the timeline, select a region of high bandwidth, and filter that region in. Use the grid to discover where in the code DRAM accesses are being generated.

Task Time

Visualizes tasks, logical units of work on specific threads, based on ITT API annotations. Identify tasks with the highest execution time and analyze threads responsible for a particular task.

Note

Depending on your microarchitecture, the General Exploration configuration pane provides the Analyze memory bandwidth option that enables collecting both General Exploration and Bandwidth data and view the statistics in the Bandwidth viewpoint.

These viewpoints may include the following windows:

  • Summary window displays statistics on the overall application execution.

  • Bottom-up pane displays performance data per metric (event ratio/event count/sample count) for each hotspot function.

  • Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).

  • Caller/Callee window displays parent and child functions of the selected focus function. This window is available only if stack collection was enabled during analysis configuration.

  • PMU Events window displays a count of PMU events selected for the analysis.

  • Uncore Events window displays a count of uncore events selected for the analysis. If there are no uncore events, the upper pane of the window is empty.

  • Tasks, Tasks over Time, and Tasks by Threads windows provide details on tasks specified in your code with the Task API.

For more complete information about compiler optimizations, see our Optimization Notice.