User Guide

Contents

Reported Metrics Reference

The
Offloaded Regions
,
Non-Offloaded Regions
, and
Call Tree
sections report performance metrics and important information about certain code regions. The metric tables are broken down into several column groups, which are collapsed by default and show only critical metrics for a section. Double-click a title to expand a group and see additional columns.
Column groups and their sub-columns vary depending on the configuration used and the report section you are in.
To customize a metric table or a column, use the following controls:
  • Use the tabs in the right sidebar of the metrics table:
    • Click
      Column configurator
      tab and select columns to show or hide from the report. Some columns are hidden by default, you can enable them from this tab. Select a column group to show/hide all columns in it.
    • Click
      Custom filter
      tab and expand a column title to filter rows reported by custom value criteria.
  • Hover over a column header and click the menu button to open filters pane with the following tabs:
    • Size tab - Autosize a column(s) or column group(s).
    • Filter tab - Filter rows by column value criteria or by selecting/deselecting code regions from the
      Hierarchy
      column. This is the same as
      Custom filter
      , but for individual columns only.
    • Menu tab - Show or hide certain columns from the report. Some columns are hidden by default, you can enable them from this tab. This is the same as
      Column configurator
      .
      This tab is available only from the
      Hierarchy
      column.
  • Right-click a table cell to open the context menu, from which you can expand a loop nest, adjust a column view, locate a loop in the call tree, copy the value from the focused cell, or export the whole metrics table as a
    .csv
    ,
    .xlsx
    , or
    .xml
    file.

Enabled Metrics

Metrics enabled by default give a general performance overview and are helpful when investigating reasons why certain regions are profitable for offload and other are not offloaded. These metrics can be visible or collapsed. Double-click a column group title to expand it and see additional columns.
Loop/Function
The
Loop/Function
section reports basic information about loop and function execution and hierarchy.
Column Name
Description
Reported in
Elapsed Time (s)
Elapsed time, in seconds, for the offload head of a code region.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Total Time
Time in the loop nest hierarchy on the platform where the application binary is profiled. The time is reported in seconds and in percentage of total time.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Dependency Type
Type of loop dependency: proved (such as Dependency: raw or Dependency: waw), assumed, or parallel.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Execution Target
Target execution platform. For offloaded regions, execution target is
ACC
(a target device).
Offloaded Regions (hidden and collapsed)
Non-Offloaded Regions (collapsed)
Call Tree
Index
Collapsed. Unique ID assigned to loops and functions by
Intel® Advisor Beta
.
Offloaded Regions
Non-Offloaded Regions
Call Tree
CLI Loop ID
Collapsed. Unique loop ID used for filtering purposes in
Intel® Advisor Beta
.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Source Location
Collapsed. Place of a loop or function in a source files in the format
<file name>:<line number>
.
You can use the source location in CLI commands.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Offload Index
Collapsed. Unique offload region ID assigned by
Intel® Advisor Beta
.
Offloaded Regions
Call Tree
Top Node in Offload
Collapsed. Location of the offload head.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Location
Collapsed. Location of a loop or function in the source file.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Node Position in Offload
Collapsed. Position of a node: offload head, child loop, or child function.
Offloaded Regions (hidden and collapsed)
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Offload Information
The
Offload Information
reports performance details about code regions offloaded to a target device.
Column Name
Description
Reported in
Why Not Offloaded?
Reason why offloading a code region was unprofitable or impossible. For details and possible solutions, refer to Troubleshooting Why Not Offloaded.
Non-Offloaded Regions
Call Tree
Estimated Speedup
Estimated speedup after offloading to a target device in comparison to the original elapsed time.
Offloaded Regions
Call Tree
Estimated Time on Accelerator
Total estimated time spent on a target device. The time is reported in seconds and in percentage of total offload plus non-offload time.
Offloaded Regions
Call Tree
Total Execution Time by Compute
Execution time, in seconds, assuming the workload is bound only by compute throughput.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Total Execution Time by Memory BW Time (s)
Execution time, in seconds, assuming the workload is bound only by memory bandwidth time.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Total Execution Time by LLC BW (s)
Execution time, in seconds, assuming the workload is bound only by LLC bandwidth.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Total Execution Time by L3 BW (s)
Execution time, in seconds, assuming the workload is bound only by L3 cache bandwidth.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Parallel Threads
Number of parallel threads in an offloaded code region.
Offloaded Regions
Call Tree
Bounded by
Limitations that an offloaded region is bounded by.
Offloaded Regions
Non-Offloaded Regions (hidden)
Call Tree
Max Speedup for This Node (without offload tax)
Collapsed. Maximum possible speedup of offloaded code regions without the cost of offloading them to a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Estimated Execution Time on Accelerator (+ Host) (s)
Collapsed. Total estimated time, in seconds, spent on a target device and a base platform after offloading.
Non-Offloaded Regions
Call Tree
Fraction Offloaded (%)
Collapsed. Percentage of code regions offloaded to a target device.
Non-Offloaded Regions
Call Tree (hidden)
Is Offload Candidate?
Collapsed. Indicates if a code region was analyzed to check its profitability for offloading.
Non-Offloaded Regions
Call Tree
Estimated Non-Accelerable Time (s)
Collapsed. Total estimated time, in seconds, spent on serial execution on the host.
Offloaded Regions
Call Tree
Whole Loop or Function Fits on Accelerator?
Collapsed. Indicates if a whole loop/function fits on a target device or only some part(s) of it.
Call Tree
Total Time Spent in MPI Calls (s)
Collapsed. Total time, in seconds, spent in MPI calls.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Global Size
Collapsed. Total number of work items in a kernel executed.
Offloaded Regions
Call Tree
Overhead
The
Overhead
section reports costs of offloading to a target device.
Column Name
Description
Reported in
Offload Taxes
Cost of offloading a code region from host to a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Data Transfer Tax (s)
Cost of transferring data between host and target device, in seconds.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Invocation Tax (s)
Cost of offloading region to a target device assuming the tax is paid
each time
a kernel in invoked, in seconds. Does not include data transfer costs.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Configuration Tax (s)
Cost of offloading region to a target device assuming the tax is paid only for the
first time
a kernel is invoked, in seconds. Does not include data transfer costs.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Kernel Code Transfer Tax (s)
Collapsed. Cost of transferring kernel code to a target device, in seconds.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Data Transfer
The
Data Transfer
section reports data transfers to and from a target device.
Column Name
Description
Reported in
Total Data Transferred (MB)
Sum of the total incoming traffic to an accelerator and the total outgoing traffic from a target device, for an offload region, in megabytes. It is calculated as (MappedTo + MappedFrom + 2*MappedToFrom).
Offloaded Regions
Non-Offloaded Regions
Call Tree
Total Data Transferred from CPU to GPU (MB)
Data transferred from a base platform to a target device, for an offload region, in megabytes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Total Data Transferred from GPU to CPU (MB)
Data transferred from a target device to a base platform, for an offload region, in megabytes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Memory Mapped To Device (MB)
Collapsed. Data transferred to shared memory, for a loop or function, in megabytes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Memory Mapped From Device (MB)
Collapsed. Data transferred from shared memory, for a loop or function, in megabytes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Memory Mapped ToFrom Device (MB)
Collapsed. Data transferred both to and from shared memory, for the loop or function, in megabytes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Trip Counts
The
Trip Counts
section reports the number of loop iterations and calls.
Column Name
Description
Reported in
Average Trip Count
Number of times a loop iterates on average.
Offloaded Regions
Non-Offloaded Regions (hidden)
Call Tree (hidden)
Call Count
Number of times a loop or function is called.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Total Trip Count
Collapsed. Total number of times a loop iterates.
Offloaded Regions
Non-Offloaded Regions
Call Tree
L3 Cache
The
L3 Cache
section reports details on L3 cache utilization.
Column Name
Description
Reported in
Total L3 Traffic (GB)
Total data, in gigabytes, that accessed the L3 cache.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Read L3 Traffic (Bytes)
Collapsed. Total number of bytes read from L3 cache.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Write L3 Traffic (Bytes)
Collapsed. Total number of bytes written to the L3 cache.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Read L3 Bandwidth (Bytes/ck)
Collapsed. Number of bytes read from L3 cache during one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Write L3 Bandwidth (Bytes/ck)
Collapsed. Number of bytes written to L3 cache during one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
LLC
The
LLC
reports details on last level cache (LLC) utilization.
Column Name
Description
Reported in
Total LLC Access (GB)
Total data, in gigabytes, that accessed the last level cache (LLC).
Offloaded Regions
Non-Offloaded Regions
Call Tree
Read LLC Traffic (Bytes)
Collapsed. Total number of bytes read from the LLC.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Write LLC Traffic (Bytes)
Collapsed. Total number of bytes written to the LLC.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Read LLC Bandwidth (Bytes/ck)
Collapsed. Number of bytes read from LLC during one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Write LLC Bandwidth (Bytes/ck)
Collapsed. Number of bytes written to LLC during one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Memory
The
Memory
section reports details on memory utilization.
Column Name
Description
Reported in
Total Memory Traffic (GB)
Total data accessed from the memory, in gigabytes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Write Memory Traffic per tile (Bytes)
Collapsed. Total number of bytes written to memory per tile.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Read Memory Traffic per tile (Bytes)
Collapsed. Total number of bytes read from memory per tile.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Write Memory Bandwidth per tile (Bytes/clk)
Collapsed. Total number of bytes written to memory per tile during one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Read Memory Bandwidth per tile (Bytes/clk)
Collapsed. Total number of bytes read from memory per tile during one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Instruction & Traffic Counts
The
Instruction & Traffic Counts
reports details about floating-point traffic and execution of specific instructions.
This column group is hidden in the
Non-Offloaded Regions
tab. You need to manually enable it from the
Column Configurator
.
Column Name
Description
Reported in
FPU Util (GFLOP/s)
Number of billions of floating-point operations (GFLOP) transferred per second.
Offloaded Regions
Non-Offloaded Regions (hidden)
Call Tree
FLOP per Cycle
Number of floating-point operations (FLOP) transferred per one cycle of a target device.
Offloaded Regions
Non-Offloaded Regions (hidden)
Call Tree
ABS / Iteration
Collapsed. Number of ABS instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Add / Iteration
Collapsed. Number of Add instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Div / Iteration
Collapsed. Number of Div instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
FMA / Iteration
Collapsed. Number of FMA instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
MAX / Iteration
Collapsed. Number of MAX instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
MIN / Iteration
Collapsed. Number of MIN instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
MUL / Iteration
Collapsed. Number of ABS instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
RECCP / Iteration
Collapsed. Number of RECCP instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
SAD / Iteration
Collapsed. Number of SAD instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
SCALE / Iteration
Collapsed. Number of SCALE instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
SIGN / Iteration
Collapsed. Number of SIGN instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
SQRT / Iteration
Collapsed. Number of SQRT instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
SUB / Iteration
Collapsed. Number of SUB instructions per iteration.
Offloaded Regions
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Diagnostics
The
Diagnostics
reports detailed compiler diagnostics.
Column Name
Description
Reported in
Diagnostics
Compiler diagnostics messages about situations that can affect model accuracy. For details, refer to Troubleshooting Diagnostics.
Offloaded Regions
Non-Offloaded Regions
Call Tree
No Execution Count
Collapsed. Time spent in parts of an offload not modeled because there is no execution count.
Offloaded Regions
Non-Offloaded Regions
Call Tree
No Static Mixes
Collapsed. Time spent in parts of an offload not modeled because there are no static mixes.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Not in Bottom-Up Table
Collapsed. Time spent in parts of an offload not modeled because they are not in a bottom-up table.
Offloaded Regions
Non-Offloaded Regions
Call Tree
System Module
Collapsed. Time spent in parts of an offload not modeled because they are system modules.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Zero Execution Count
Collapsed. Time spent in parts of an offload not modeled because of zero execution count.
Offloaded Regions
Non-Offloaded Regions
Call Tree

Hidden Metrics

Hidden metrics can be useful for advanced performance analysis or for debug purposes. To see the hidden metrics:
  1. Open the
    Column Configurator
    pane on the right.
  2. Select required metrics from the list.
    Some of these metrics require that you expand the column group they are reported in first.
Loop/Function
The
Loop/Function
section reports basic information about loop and function execution and hierarchy.
Column Name
Description
Reported in
Parent Index
Hidden. Parent's unique ID assigned by
Intel® Advisor Beta
.
Offloaded Regions (hidden and collapsed)
Non-Offloaded Regions (hidden and collapsed)
Call Tree
Total Time (%)
Hidden. Percentage of total application time in the loop nest hierarchy.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Total Application Time (s)
Hidden. Total application time, in seconds.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Offload Information
The
Offload Information
reports performance details about code regions offloaded to a target device.
Column Name
Description
Reported in
Whole Loop or Function Offloaded?
Indicates if a whole loop/function is offloaded on a target device or only some part(s) of it.
Offloaded Regions
Call Tree
Estimated Time on Accelerator (%)
Hidden. Estimated time spent on a target device as a percentage of total offload plus non-offload time.
Offloaded Regions
Call Tree
Topmost Node of Offload
Hidden. Location of the offload head.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Head of Regions Offloaded Together
Hidden and collapsed. Head of several child loops/functions if they are offloaded to a target device together.
Call Tree
Node Position in Offload
Hidden and collapsed. Position of a node: child or head node.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Global Size Model Error (%)
Hidden and collapsed. Percentage of work items executed in a kernel with a possible error of loop compute time caused by different approaches of converting single-dimension loops to multi-dimension.
Offloaded Regions
Call Tree
Potential Offload
The
Potential Offload
reports details about
not offloaded
code regions. This section is hidden by default.
This column group is available only in the
Non-Offloaded Regions
and
Call Tree
tabs.
Column Name
Description
Reported in
Non-Offloaded Weight
Hidden and collapsed. Weight of a non-offloaded code region in the code tree.
Non-Offloaded Regions
Data Transfer
The
Data Transfer
section reports data transfers to and from a target device.
Column Name
Description
Reported in
Data Transferred (Shared) (MB)
Hidden and collapsed. Sum of data transferred to a target device, from the target device, and in both directions (to and from the target device).
Offloaded Regions
Non-Offloaded Regions
Call Tree
Memory Objects
Hidden and Collapsed. Details about memory objects used in a region in the following format:
<object size in bytes>/<memory allocation type>/<allocation address>/<source location file and line number>/<transfer direction>
.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Instruction & Traffic Counts
The
Instruction & Traffic Counts
reports details about floating-point traffic and execution of specific instructions.
This column group is hidden in the
Non-Offloaded Regions
tab. You need to manually enable it from the
Column Configurator
.
Column Name
Description
Reported in
Bytes / Iteration
Hidden and collapsed. Average number of bytes transferred per iteration.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Bytes Read / Iteration
Hidden and collapsed. Average number of bytes read per iteration.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Bytes Written / Iteration
Hidden and collapsed. Average number of bytes written per iteration.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Floating Operations / Iteration
Hidden and collapsed. Average number of floating-point operations per iteration.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Compute
Hidden and collapsed. Number of compute operations.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Compute with Memory
Hidden and collapsed. Number of compute with memory instructions.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Vector Compute
Hidden and collapsed. Number of vector compute instructions.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Vector Compute with Memory
Hidden and collapsed. Number of vector compute with memory instructions.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Compute Other / Iteration
Hidden and collapsed. Average number of compute other instructions per iteration.
Offloaded Regions
Non-Offloaded Regions
Call Tree
Non-Compute Other / Iteration
Hidden and collapsed. Average number of non-compute other instructions per iteration.
Offloaded Regions
Non-Offloaded Regions
Call Tree

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804