User Guide

Contents

Accelerator Metrics

This reference section describes the contents of data columns in reports of the
Offload Modeling
and
GPU Roofline Insights
perspectives.
# | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | XYZ

2 FPUs Active

Description:
Average percentage of time when both FPUs are used.
Collected
during the
Survey
analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Instructions
column group in the GPU pane of the GPU Roofline Regions tab.

Active

Description
: Percentage of cycles actively executing instructions on all execution units (EUs).
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Array
column group in the GPU pane of the GPU Roofline Regions tab.

Atomic Throughput

Description
: Total execution time by atomic throughput, in milliseconds.
Collected
during the Performance Modeling analysis in the
GPU Roofline Insights
perspective and
found
in the
Estimated Bounded By
column group in the
CPU+GPU
pane.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Average Time

Description:
Average amount of time spent executing one task instance.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions tab.

B

C

CARM Traffic, GB

Description:
Total data transferred to and from execution units, in gigabytes.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Compute

Description:
Estimated execution time assuming an offloaded loop is bound only by compute throughput.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Compute Task

Description:
Name of a compute task.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Compute Task Purpose

Description:
Action that a compute task performs.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Computing Threads Started

Description:
Total number of threads started across all execution units for a computing task.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

D

Data Transfer Tax

Description:
Estimated time cost, in milliseconds, for transferring loop data between host and target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Data Transfer Tax without Reuse

Description:
Estimated time cost, in milliseconds, for transferring loop data between host and target platform considering no data is reused. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Data Reuse Gain

Description:
Difference, in milliseconds, between data transfer time estimated with data reuse and without data reuse. This option is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Dependency Type

Description:
Dependency absence or presence in a loop across iterations.
Collected
during the Survey and Dependencies analyses in the
Offload Modeling
perspective and
found
in the
Measured
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Measured
column group.
Possible values:
  • Parallel: Explicit
    - The loop does not have dependencies because it is explicitly vectorized or threaded on CPU.
  • Parallel: Proven
    - A compiler did not detect dependencies in the loop at the compile time but did not vectorize the loop automatically for a certain reason.
  • Parallel: Programming Model
    - The loop does not have dependencies because it is parallelized for execution on a target platform using a performance model (for example, OpenMP*, oneAPI Treading Building Blocks, Intel® oneAPI Data Analytics Library, Data Parallel C++).
    This value is only available for an Offload Modeling HTML report.
  • Parallel: Workload
    - Intel Advisor did not find dependencies in the loop based on the workload analyzed during the Dependencies analysis.
  • Parallel: User
    - The loop is marked as not having dependencies with the
    --set-parallel=<string>
    option.
  • Parallel: Assumed
    - Intel Advisor does not have information about loop dependencies but it assumed all such loops are parallel (that is, not having dependencies).
  • Dependency:
    <dependency-type>
    - Intel Advisor found dependencies of specific types in the loop during the Dependencies analysis. Possible dependency types are RAW (read after write), WAR (write after read), WAW (write after read), Reduction.
  • Dependency: User
    - The loop is marked as having dependencies with the
    --set-dependency=<string>
    option.
  • Dependency: Assumed
    - Intel Advisor does not have information about dependencies for this loops but it assumes all such loops have dependencies.
Prerequisites for collection/display
:
Some values in this column can appear only if you select specific options when collecting data or run the Dependencies analysis:
For
Parallel: Workload
and
Dependency:
<dependency-type>
:
For
Parallel: User
:
  • GUI: Go to
    Project Properties
    Performance Modeling
    . In the
    Other parameters
    field, enter a
    --set-parallel=<string>
    and a comma-separated list of loop IDs and/or source locations to mark them as parallel.
  • CLI: Specify a comma-separated list of loop IDs and/or source locations with the
    --set-parallel=<string>
    option when modeling performance with
    advisor --collect=projection
    .
For
Dependency: User
:
  • GUI: Go to
    Project Properties
    Performance Modeling
    . In the
    Other parameters
    field, enter a
    --set-dependency=<string>
    and a comma-separated list of loop IDs and/or source locations to mark them as having dependencies.
  • CLI: Specify a comma-separated list of loop IDs and/or source locations with the
    --set-dependency=<string>
    option when modeling performance with
    advisor --collect=projection
    .
For
Parallel: Assumed
:
  • GUI: Disable
    Assume Dependencies
    under Performance Modeling analysis in the
    Analysis Workflow
    pane.
  • CLI: Use the
    --no-assume-dependencies
    option when modeling performance with
    advisor --collect=projection
    .
For
Dependencies: Assumed
:
  • GUI: Enable
    Assume Dependencies
    under Performance Modeling analysis in the
    Analysis Workflow
    pane.
  • CLI: Use the
    --assume-dependencies
    option when modeling performance with
    advisor --collect=projection
    .
Interpretation
:
  • Loops with
    no
    real dependencies (
    Parallel: Explicit
    ,
    Parallel: Proven
    ,
    Parallel: Programming Model
    , and
    Parallel: User
    if you know that marked loops are parallel) can be safely offloaded to a target platform.
  • If many loops have
    Parallel: Assumed
    or
    Dependencies: Assumed
    value, you are recommended to run the Dependencies analysis. See Check How Assumed Dependencies Affect Modeling for details.

Device

Description:
A host platform that application is executed on.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Measured
column group.

DRAM

Description
: Summary of DRAM memory usage, including DRAM bandwidth (in gigabytes per second) and total DRAM traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:

DRAM BW (Estimated Bounded By)

Description:
DRAM Bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by DRAM memory throughput.
Collected
during the Trip Counts analysis (Characterization) and the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

DRAM BW (Memory Estimates)

Description
: DRAM Bandwidth. Rate at which data is transferred to and from the DRAM, in gigabytes per second.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM BW Utilization

Description
: DRAM bandwidth utilization, in per cent.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM Read Traffic

Description
: Total data read from the DRAM memory.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM Traffic

Description
: A sum of data read from and written to the DRAM memory.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

DRAM Write Traffic

Description
: Total data written to the DRAM memory.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Elapsed Time

Description:
Wall-clock time from beginning to end of computing task execution.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

EU Threading Occupancy

Description:
Percentage of cycles on all execution units (EUs) and thread slots when a slot has a thread scheduled.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

Estimated Data Transfers with Reuse

Description:
Summary of data read from a target platform and written to the target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on the target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.

FP AI

Description:
Ratio of FLOP to the number of transferred bytes.
Collected
during the FLOP analysis (Characterization) enabled in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.

From Target

Description:
Estimated data transferred from a target platform to a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

GFLOP

Description:
Number of giga floating-point operations.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.

GFLOPS

Description:
Number of giga floating-point operations per second.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.

GINTOP

Description:
Number of giga integer operations.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.

GINTOPS

Description:
Number of giga integer operations per second.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions tab.

Global

Description:
Total number of work items in all work groups.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Work Size
column group in the GPU pane of the GPU Roofline Regions tab.

Global Size

Description:
Total estimated number of work items in a loop executed after offloaded on a target platform.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analyses in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Compute Estimates
column group.

GPU Shader Atomics

Description:
Total number of shader atomic memory accesses.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

GPU Shader Barriers

Description:
Total number of shader barrier messages.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions tab.

H

Idle

Description:
Percentage of cycles on all execution units (EUs), during which no threads are scheduled on a EU.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Array
column group in the GPU pane of the GPU Roofline Regions report.

Instances

Description:
Total estimated number of times a loop executes on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Compute Estimates
column group.

Instance Count

Description:
Total number of times a task is executed.
Collected
during the Trip Counts analysis (Characterization) analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions report.

INT AI

Description:
Ratio of INTOP to the number of transferred bytes.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
GPU Compute Performance
column group in the GPU pane of the GPU Roofline Regions report.

IPC Rate

Description:
Average rate of instructions per cycle (IPC) calculated for two FPU pipelines.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Instructions
column group in the GPU pane of the GPU Roofline Regions report.

J

Kernel Launch Tax

Description:
Total estimated time cost for invoking a kernel when offloading a loop to a target platform.
Does not include data transfer costs.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Latencies

Description:
Top uncovered latency in a loop/function, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.

L3 BW

Description:
L3 Bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by L3 cache throughput.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

L3 Cache

Description
: Summary of L3 cache usage, including L3 cache bandwidth (in gigabytes per second) and L3 cache traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:

L3 Cache BW

Description
: Average rate at which data is transferred to and from the L3 cache, in gigabytes per second.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache BW Utilization

Description
: L3 cache bandwidth utilization, in per cent.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache Read Traffic

Description
: Total data read from the L3 cache.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache Traffic

Description
: A sum of data read from and written to the L3 cache.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Cache Write Traffic

Description
: Total data written to the L3 cache.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

L3 Shader Bandwidth, GB/sec

Description:
Rate at which data is transferred between execution units and L3 caches, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the GPU pane of the GPU Roofline Regions report.

LLC BW

Description:
Last-level cache (LLC) bandwidth. Estimated execution time, in seconds, assuming an offloaded loop is bound only by LLC throughput.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

LLC Cache

Description
: Last-level cache (LLC) usage, including LLC cache bandwidth (in gigabytes per second) and total LLC cache traffic, which is a sum of read and write traffic.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:

LLC Cache BW

Description
: Rate at which data is transferred to and from the LLC cache, in gigabytes per second.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Cache BW Utilization

Description
: LLC cache bandwidth utilization, in per cent.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Cache Read Traffic

Description
: Total data read from the LLC cache.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Cache Traffic

Description
: A sum of data read from and written to the LLC cache.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

LLC Cache Write Traffic

Description
: Total data written to the LLC cache.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Load Latency

Description:
Uncovered cache or memory load latencies uncovered in a code region, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the
CPU+GPU
pane.
Prerequisite for display
:
Estimated Bounded By
column group.

Local

Description:
Number of work items in one work group.
Collected
during the
Survey
analysis in the
GPU Roofline Insights
perspective and
found
in the
Work Size
column group in the GPU pane of the GPU Roofline Regions report.

Local Size

Description:
Total estimated number of work items in one work group of a loop executed after offloaded on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Compute Estimates
column group.

Loop/Function

Description
: Name and source location of a loop/function in a region, where region is a sub-tree of loops/functions in a call tree.
Collected
during the Survey analysis in the
Offload Modeling
perspective.

M

N

Offload Tax

Description:
Total time spent for transferring data and launching kernel, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Offload Summary

Description:
Recommendation that indicates if a loop is profitable for offloading to a target platform.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the CPU+GPU pane of the Accelerated Regions tab.

Parallel Factor

Description
: Number of loop iterations or kernel work items executed in parallel on a target device for a loop/function.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Compute Estimates
column group in the CPU+GPU pane of the Accelerated Regions tab.

Private

Description:
Total estimated data transferred to a private memory from a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfers with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfers with Reuse
column group.

Programming Model

Description:
Programming model used in a loop/function, if any.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the
CPU+GPU
pane.
Prerequisite for display
: Expand the
Measured
column group.

Q

Read

Description:
Estimated data read from a target platform by an offload region, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) analysis in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfers with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfers with Reuse
column group.

Read (GPU Memory Bandwidth)

Description:
Rate at which data is read from GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions report.

Read (Shared Local Memory Bandwidth)

Description:
Rate at which data is read from shared local memory, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions report.

Read (Typed Memory Bandwidth)

Description:
Rate at which data Is read from typed buffers, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
Typed Local Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions report.

Read (Untyped Memory Bandwidth)

Description:
Rate at which data is read from untyped buffers, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
Untyped Local Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions report.

Read without Reuse

Description:
Estimated data read from a target platform by a code region considering no data is reused between kernels, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfers with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Data Transfers with Reuse
column group.

Send Active

Description:
Percentage of cycles on all execution units when EU Send pipeline is actively processed.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Instructions
column group in the GPU pane of the GPU Roofline Regions report.

SIMD Width

Description:
The number of work items processed by a single GPU thread.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions report.

Stalled

Description:
Percentage of cycles on al execution units (EUs) when at least one thread is scheduled, but the EU is stalled.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
EU Array
column group in the GPU pane of the GPU Roofline Regions report.

SVM Usage Type

Collected
during the Survey analysis in the
GPU Roofline Insights
perspective and
found
in the
Compute Task Details
column group in the GPU pane of the GPU Roofline Regions report.

Speed-up

Description:
Estimated speedup after a loop is offloaded to a target device, in comparison to the original elapsed time.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the CPU+GPU pane of the Accelerated Regions tab.

Taxes with Reuse

Description:
The highest estimated time cost and a sum of all other costs for offloading a loop from host to a target platform. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform. A
triangle
icon in a table cell indicates that this region reused data.
This decreases the estimates data transfer tax.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.

Throughput

Description:
Top two factors that a loop/function is bounded by, in milliseconds.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the
CPU+GPU
pane.

Time (Estimated)

Description:
Elapsed wall-clock time from beginning to end of loop execution measured on a host platform.
Collected
during the Survey analysis in the
Offload Modeling
perspective and
found
in the
Basic Estimated Metrics
column group in the CPU+GPU pane of the Accelerated Regions tab.

Time (Measured)

Description:
Estimated elapsed wall-clock time from beginning to end of loop execution estimated on a target platform after offloading.
Collected
during the Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Measured
column group in the CPU+GPU pane of the Accelerated Regions tab.

Time by DRAM BW

Description
: Loop/function execution time bounded by DRAM bandwidth, in seconds.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Time by L3 Cache BW

Description
: Loop/function execution time bounded by L3 cache bandwidth, in seconds.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

Time by LLC Cache BW

Description
: Loop/function execution time bounded by LLC cache bandwidth, in seconds.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Memory Estimations
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
Prerequisites for display
: Expand the
Memory Estimations
column group.

To Target

Description:
Estimated data transferred to a target platform from a shared memory by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

ToFrom Target

Description:
Sum of estimated data transferred both to/from a shared memory to/from a target platform by a loop, in megabytes. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

Total

Description:
Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform, for an offload loop, in megabytes. It is calculated as
(MappedTo + MappedFrom + 2*MappedToFrom)
. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Data Transfer with Reuse
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisites for collection:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Run the
    --collect=tripcounts
    action with the
    --data-transfer=[full | medium | light]
    action options.
Prerequisite for display
: Expand the
Estimated Data Transfer with Reuse
column group.

Total, GB/s

Description:
Average data transfer bandwidth between CPU and GPU.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.
Interpretation
: In some cases, for example,
clEnqueueMapBuffer
, data transfers might generate high bandwidth because memory is not copied but shared using L3 cache.

Total Size

Description:
Total data processed on a GPU.
Collected
during the FLOP analysis (Characterization) in the
GPU Roofline Insights
perspective and
found
in the
Data Transferred
column group in the GPU pane of the GPU Roofline Regions tab.

Total Time

Description:
Total amount of time spent executing a task.
Collected
during the Survey analysis in the
GPU Roofline Insights
perspective.

Total without Reuse

Description:
Sum of the total estimated traffic incoming to a target platform and the total estimated traffic outgoing from the target platform considering no data is reused, in megabytes. It is calculated as
(MappedTo + MappedFrom + 2*MappedToFrom)
. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

U

V

Write

Description:
Estimated data written to a target platform by a loop. If you enabled the data reuse analysis for the Performance Modeling, the metric value is calculated considering data reuse between code regions on a target platform.
Collected
during the Trip Counts analysis (Characterization) in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Light
    ,
    Medium
    , or
    Full
    .
  • CLI: Use the
    --data-transfer=[full | medium | light]
    option with the
    --collect=tripcounts
    action.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

Write (GPU Memory Bandwidth)

Description:
Rate at which data is written to GPU, chip uncore (LLC), and main memory, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
GPU Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions tab.

Write (Shared Local Memory Bandwidth)

Description:
Rate at which data is written to shared local memory, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
Shared Local Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions tab.

Write (Typed Memory Bandwidth)

Description:
Rate at which data Is read from typed buffers, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
Typed Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions tab.

Write (Untyped Memory Bandwidth)

Description:
Rate at which data is written to typed buffers, in gigabytes per second.
Collected
during the Characterization analysis in the
GPU Roofline Insights
perspective and
found
in the
Untyped Memory Bandwidth, GB/sec
column group in the GPU pane of the GPU Roofline Regions tab.

Write without Reuse

Description:
Estimated data written to a target platform by a code region considering no data is reused, in megabytes. This metric is available only if you enabled the data reuse analysis for the Performance Modeling.
Collected
during the Trip Counts analysis (Characterization) and Performance Modeling analysis in the
Offload Modeling
perspective and
found
in the
Estimated Bounded By
column group in the CPU+GPU pane of the Accelerated Regions tab.
Prerequisite for collection
:
  • GUI: From the
    Analysis Workflow
    pane, set the
    Data Transfer Simulation
    under
    Characterization
    to
    Full
    and enable the
    Data Reuse Analysis
    checkbox under
    Performance Modeling
    .
  • CLI: Use the action option with the
    --collect=tripcounts
    action and the
    --data-reuse-analysis
    option with the
    --collect=tripcounts
    and
    --collect=projection
    actions.
Prerequisite for display
: Expand the
Estimated Bounded By
column group.

X, Y, Z

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.