User Guide

Contents

Explore Performance Gain from GPU-to-GPU Modeling (Preview)

Enabled Analyses

Performance data collection for GPU kernels only + Performance modeling for GPU kernels only
This is a
preview
feature available only from command line with the
Intel® Advisor
Python* scripts. See Run GPU-to-GPU Offload Modeling.

Result Interpretation

By default, the GPU-to-GPU performance modeling results are generated to
<project-dir>
/e
<NNN>
/pp
<NNN>
/data.0
. To view the results, go to this directory or the directory that you specified with the
out-dir
option and open an interactive HTML report
report.html
.
The structure and controls of the HTML report generated for the GPU-to-GPU performance modeling are similar to the HTML report for the CPU-to-GPU offload modeling, but the content is different because for the GPU-to-GPU modeling,
Intel Advisor
models performance
only
for GPU-enabled parts of your application.
The report includes the following tabs
Summary
,
Offloaded Regions
,
Non-Offloaded Regions
,
Call Tree
,
Configuration
,
Logs
. You can switch the tabs using the links in the top left.
The
Non-Offloaded Regions
tab shows only GPU kernels that cannot be modeled. If all kernels are modeled, the tab is empty. For example, it might show kernels with some required metrics missing. For the GPU-to-GPU modeling, estimated speedup is not a reason for not offloading a kernel.
When you open the report, it first shows the
Summary
tab. In this tab, you can review the summary of the modeling results and estimated performance metrics for some GPU kernels in your application.
Summary of the GPU-to-GPU Offload Modeling HTML report
  • In the
    Program Metrics
    pane, compare the
    Time on Baseline GPU
    and
    Time on Target GPU
    and examine the
    Average Speedup
    to understand if GPU kernels in your application have a better performance on the target GPU.
    Time on Baseline GPU
    includes
    only
    execution time of GPU kernels and ignores the CPU parts of your application.
    Time on Target GPU
    includes estimated execution time for GPU kernels on the target and offload taxes.
    In the right-side pie chart, review the time on target GPU components and see where the GPU kernels spend most of the time: executing on the target GPU (
    Estimated Time on GPU
    ), transferring data between the host device and the target GPU (
    Data Transfer Tax
    ), or scheduling kernels on the target GPU (
    Kernel Launch Tax
    ).
  • In the
    Offloads Bounded by
    pane, examine what the GPU kernels are potentially bounded by on the target GPU. The parameters with the highest percentage mean that this is where the GPU kernels spend the most time. Review the detailed metrics for these parameters in other tabs to understand if you need to optimize your application for this.
  • In the
    Target Device Configuration
    pane (in the top-right), examine the target GPU parameters that were used to model the GPU kernel performance. You can also use the sliders to adjust the parameters and create your custom configuration.
    To model performance for the custom configuration, save the configuration settings from the report and rerun the performance modeling step with the
    analyze.py
    for the new configuration file. For details about using the custom configurations, go to the
    Configuration
    tab and review the comments.
  • In the
    Top offloaded
    pane, review the top five GPU kernels with the highest absolute offload gain (in seconds) estimated on the target GPU. The gain is calculated as
    (Time measured on the baseline GPU - Time estimated on the target GPU)
    . This pane shows
    all
    GPU kernels in your application and might also show kernels with the estimated speedup less than 1.
    For each kernel in the pane, you can review the speedup, time on the baseline and the target GPUs, main bounded-by parameters, and estimated amount of data transferred.
The
Top non offloaded
pane shows only GPU kernels that cannot be modeled. If all kernels are modeled, the pane is empty. For the GPU-to-GPU modeling, estimated speedup is not a reason for not offloading a kernel.
To see the details about each GPU kernel, go to the
Offloaded Regions
or the
Call Tree
tab. These tabs report the same metrics, but the
Offloaded Regions
shows only modeled kernels, while the
Call Tree
shows all kernels, including non-modeled ones.
Offloaded Regions tab of the GPU-to-GPU Offload Modeling HTML report
  • In the
    metrics table
    , examine the detailed performance metrics for the GPU kernels. The
    Measured
    column group shows metrics
    measured
    on the baseline GPU. Other column groups show metrics
    estimated
    for the target GPU. You can expand column groups to see more metrics.
    For example, to find a potential bottleneck, you can examine the
    Offload Information
    column group focusing on the
    Bounded by
    and
    Total Execution Time by
    metrics. For details about the bounding factor, scroll right to the column group corresponding to the value reported in the
    Bounded by
    column, for example,
    L3 Cache
    ,
    DRAM
    , or
    LLC
    . Expand the column group and examine the
    Total
    <name>
    Bandwidth Utilization
    column. The utilization is calculated as a
    relation
    of average memory level bandwidth to its peak bandwidth. High value means that the kernel does not use well this memory level and it is the potential bottleneck.
    You can also review the following data to find bottlenecks:
    • If you see high cache or memory bandwidth utilization (for example, in the
      L3 Cache
      ,
      SLM
      ,
      LLC
      column groups), consider optimizing cache/memory traffic to improve performance.
    • If you see high latency in the
      Offload Information
      column group, consider optimizing cache/memory latency by scheduling enough parallel work for this kernel to increase thread occupancy.
    • If you see high data transfer tax in the
      Overhead
      , consider optimizing data transfer taxes or using unified shared memory (USM).
    You can also focus on the most interesting data to analyze your problem using sort and filter controls:
    • To filter data in the column, hover over a column title and click the menu icon or click the right-side
      Custom filter
      button. In the filter tab, deselect values you want to hide from the table or specify filter criteria. For example, you can select to see only specific kernels of interest and hide all other kernels using the filter to the
      Hierarchy
      column.
    • To configure the table metrics, click the right-side
      Column configurator
      button and select columns to show in the table and/or deselect columns or column groups to hide from the table. For example, if you want to analyze how well your application uses memory resources on the target GPU, you can select to show only memory-related column groups.
  • In the right-side
    Source
    pane, see the source code associated with a kernel, if available. You need to select a kernel from the metrics table to see the source.
  • In the right-side
    Memory objects
    pane, see the details about memory objects transferred between the host device and a target GPU for a kernel. You need to select a kernel from the metrics table to see the memory objects data. Examine this pane if you see a high data transfer tax for a kernel. The pane includes two parts:
    • The memory object table lists all memory objects accessed by the kernel with details about each object, such as size, transfer direction (only to the host, only to the target, from the host to the target and back), object type. You can also apply filter to the table columns to focus on the most interesting data.
      Intel Advisor uses this data to estimate data transfer traffic and data transfers for each kernel.
    • The memory objects histogram shows a distribution of memory objects by size. If you see a lot of small-sized objects, this may result in high latency for the kernel. High latency might cause a high data transfer tax.
Go to the
Configuration
tab to review the detailed target device configuration used for modeling in a
read-only
mode. You can also review the comments for each parameter and their possible values.
Go to the
Logs
tab to see a command line used to run the analyses and all output messages reported in console during the script(s) execution. This tab reports four types of messages:
Error
,
Warning
,
Info
, and
Debug
in the order of their appearance in console during the script(s) execution.
By default, only
Error
,
Warning
, and
Info
messages are shown. To control types of messages shown, hover over the
Severity
column header and click the menu icon to open filters pane.

Next Steps

  • Based on collected data, rewrite your code to offload it to a different target GPU to improve performance and measure its performance with the
    GPU Roofline Insights
    perspective.
  • See optimization tips for oneAPI applications running on GPU in the oneAPI GPU Optimization Guide.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.