Use the Microarchitecture Exploration analysis (formerly known as General Exploration) to triage hardware usage issues in your application.
Once you have used Hotspots analysis to determine hotspots in your code, you can perform Microarchitecture Exploration analysis to understand how efficiently your code is passing through the core pipeline. During Microarchitecture Exploration analysis, the VTune Amplifier collects a complete list of events for analyzing a typical client application. It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems.
How It Works
The Microarchitecture Exploration analysis strategy varies by microarchitecture. For modern microarchitectures starting with Intel microarchitecture code name Ivy Bridge, the Microarchitecture Exploration analysis is based on the Top-Down Microarchitecture Analysis Method using the Top-Down Characterization methodology, which is a hierarchical organization of event-based metrics that identifies the dominant performance bottlenecks in an application.
Superscalar processors can be conceptually divided into the front-end, where instructions are fetched and decoded into the operations that constitute them, and the back-end, where the required computation is performed. Each cycle, the front-end generates up to four of these operations. It places them into pipeline slots that then move through the back-end. Thus, for a given execution duration in clock cycles, it is easy to determine the maximum number of pipeline slots containing useful work that can be retired in that duration. The actual number of retired pipeline slots containing useful work, though, rarely equals this maximum. This can be due to several factors: some pipeline slots cannot be filled with useful work, either because the front-end could not fetch or decode instructions in time (Front-end bound execution) or because the back-end was not prepared to accept more operations of a certain kind (Back-end bound execution). Moreover, even pipeline slots that do contain useful work may not retire due to bad speculation. Front-end bound execution may be due to a large code working set, poor code layout, or microcode assists. Back-end bound execution may be due to long-latency operations or other contention for execution resources. Bad speculation is most frequently due to branch misprediction.
Each cycle, each core can fill up to four of its pipeline slots with useful operations. Therefore, for some time interval, it is possible to determine the maximum number of pipeline slots that could have been filled in and issued during that time interval. This analysis performs this estimate and breaks up all pipeline slots into four categories:
Pipeline slots containing useful work that issued and retired (Retired)
Pipeline slots containing useful work that issued and cancelled (Bad speculation)
Pipeline slots that could not be filled with useful work due to problems in the front-end (Front-end Bound)
Pipeline slots that could not be filled with useful work due to a backup in the back-end (Back-end Bound)
To use Microarchitecture Exploration analysis, first determine which top-level category dominates for hotspots of interest. You can then dive into the dominating category by expanding its column. There, you can find many issues that may contribute to that category.
You can also run the Microarchitecture Exploration analysis on other microarchitectures that are NOT covered with the Top-Down Method in the VTune Amplifier:
Intel Microarchitecture Code Name Sandy Bridge: This microarchitecture is already partially based on the top-down method and the VTune Amplifier provides a hierarchical analysis of the hardware metrics based on the following categories: Filled Pipeline Slots and Unfilled Pipeline Slots (Stalls).
Intel Microarchitectures Code Name Nehalem and Westmere: During Microarchitecture Exploration analysis on these microarchitectures, the VTune Amplifier collects metrics that help identify such hardware-level performance problems as:
Front End stall and its causes
Stalls at execution and retirement: particularly those caused by stalls due to the various high latency loads, wasted work caused by branch misprediction, or long latency instructions.
For a detailed tuning methodology behind the Microarchitecture Exploration analysis and some of the complexities associated with this analysis, see Understanding How General Exploration Works in Intel® VTune™ Amplifier.
For architecture-specific Tuning Guides, visit https://software.intel.com/en-us/articles/processor-specific-performance-analysis-papers.
Configure and Run Analysis
To configure options for the Microarchitecture Exploration analysis:
Prerequisites: Create a project and specify an analysis target.
Click the (standalone GUI)/ (Visual Studio IDE) Configure Analysis button on the Intel® VTune™ Amplifier toolbar.
The Configure Analysis window opens.
From HOW pane, click the Browse button and select Microarchitecture Exploration.
Configure the following options:
CPU sampling interval, ms spin box
Specify an interval (in milliseconds) between CPU samples.
Possible values - 1-1000.
The default value is 1 ms.
Extend granularity for the top-level metrics selection area
You may limit the data collection by selecting particular top-level metrics. In this case, the VTune Amplifier extends the level of granularity and collects additional sub-metrics only for the selected top-level metrics. For example, if you select the Memory Bound top-level metric, the VTune Amplifier collects additional data and provides Memory Bound sub-metrics (such as DRAM Bound, Store Bound, and so on), which helps narrow down the analysis to particular microarchitecture levels.
Limiting the amount of data collected simultaneously may also improve profiling accuracy due to less multiplexing. This may be particularly helpful for short-running application or applications with short phases.
Analyze memory bandwidth check box
Collect the data required to compute memory bandwidth.
The option is disabled by default.
Evaluate max DRAM bandwidth check box
Evaluate maximum achievable local DRAM bandwidth before the collection starts. This data is used to scale bandwidth metrics on the timeline and calculate thresholds.
The option is enabled by default.
Collection mode drop-down menu
Choose the Detailed sampling-based collection mode (default) to view a data breakdown per function and other hotspots. Use the Summary counting-based mode for an overview of the whole profiling run. This mode has a lower collection overhead and faster post-processing time.
Expand/collapse a section listing the default non-editable settings used for this analysis type. If you want to modify or enable additional settings for the analysis, you need to create a custom configuration by copying an existing predefined configuration. VTune Amplifier creates an editable copy of this analysis type configuration.
Click the Start button to run the analysis.
To analyze the collected data, use the default Microarchitecture Exploration viewpoint that provides a high-level performance overview based on the Top-Down Microarchitecture Analysis Method. To easier understand where you could focus your optimization efforts and which part of the microarchitecture pipeline introduces inefficiencies, start with the Microarchitecture Pipe.