User Guide


Intel® Advisor Beta: GPU Roofline

GPU Roofline Analysis Purpose and Usage

To view a Roofline chart of the GPU kernels in your application, run a
GPU Roofline analysis
The GPU Roofline analysis helps estimate and visualize the actual performance of GPU kernels using benchmarks and hardware metric profiling against hardware-imposed performance ceilings, as well as determine the main limiting factor. When you run a GPU Roofline analysis, the
Intel® Advisor Beta
  • Collects OpenCL™ kernels timings and memory data using the
    Survey analysis
    with GPU profiling.
  • Measures the hardware limitations and collects floating-point and integer operations data using the
    Trip Counts and FLOP analysis
    with GPU profiling.
Starting 2021.1-beta10 release,
Intel® Advisor Beta
introduces a new UI with integrated GPU Roofline workflow. See
Intel® Advisor Beta
User Guide
for details.
Use the
chart to answer the following questions:
  • What is the maximum achievable performance with your current hardware resources?
  • Does your application work optimally on current hardware resources?
  • If not, what are the best candidates for optimization?
  • Is memory bandwidth or compute capacity limiting performance for each optimization candidate?
Memory Levels
Dots on the Roofline chart correspond to OpenCL kernels, as described further in the GPU Roofline Analysis Chart Data section. You can choose which memory levels (CARM, L3, SLM, GTI, L3 + SLM) to plot dots for on the Roofline chart:
  • CARM
    : Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
  • L3
    : Data transferred directly between execution units and L3 cache.
  • SLM
    : Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
  • GTI
    : Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory. Use this to get a sense of external memory traffic.
  • L3 + SLM
    : Summary traffic to/from L3 and Shared Local Memory.
Arithmetic intensity determines the order in which dots are plotted, which can provide some insight into your code’s performance. For example, the CARM dot is typically far to the right of the L3 dot, as read/write access is by cache lines and CARM traffic is the sum of actual bytes used in operations.

Run a GPU Roofline Analysis

Set Up Environment
Set-Up Tasks
Intel® Advisor Beta
/Linux* OS
The GPU Roofline analysis is available only on Linux* OS with a kernel version 4.14 or higher.
  1. Switch to a root mode to run the GPU Roofline analysis with root privileges.
    Alternatively, for example, if you want to run the analysis as a user without root privileges, add your username to the video group. To check if you are already in the video group, run:
    groups | grep video
    If you are not part of the video group, add your username to it:
    sudo usermod -a -G video <username>
    to verify that you successfully added your username to the video group . If video is not listed, log out and log back in.
  2. Set the value of the
    dev.i915.perf_stream_paranoid sysctl
    option to
    sysctl -w dev.i915.perf_stream_paranoid=0
    This command makes a temporary-only change that is lost on the next reboot. To change this option permanently, run:
    echo dev.i915.perf_stream_paranoid=0 > /etc/sysctl.d/60-mdapi.conf
Intel® Advisor Beta
/Windows* OS
Install a GPU driver for your system from
Run GPU Roofline Analysis
  • GPU profiling is applicable only to Intel® Processor Graphics.
  • This page explains how to run the GPU Roofline analysis from the
    Intel® Advisor Beta
    command-line interface. Starting 2021.1-beta10 release,
    Intel® Advisor Beta
    introduces a new UI with integrated GPU Roofline workflow. See
    Intel® Advisor Beta
    User Guide
    for details.
Your target application must use OpenCL™ either directly or indirectly - for example, using the current implementation of Data Parallel C++ (DPC++), SYCL*, or OpenMP* for GPU.
To generate a Roofline visual model for GPU kernel performance estimation:
  1. Collect the GPU Roofline data with the
    option. The
    option enables the analysis of OpenCL and Intel® Media SDK programs running on Intel® Processor Graphics. Do one of the following:
    • Run a shortcut
      command, which runs the Survey and Trip Count with FLOP analyses one by one:
      advixe-cl --collect=roofline --profile-gpu --project-dir=<project-dir> --search-dir src:r=<source-dir> -- <target> [target-options]
    • Run the Survey and Trip Counts with FLOP analyses separately:
      advixe-cl --collect=survey --profile-gpu --project-dir=<project-dir> --search-dir src:r=<source-dir> -- <target> [target-options]
      advixe-cl --collect=tripcounts --flop --profile-gpu --project-dir=<project-dir> --search-dir src:r=<source-dir> -- <target> [target-options]
  2. Generate a GPU Roofline report with the
    advixe-cl --report=roofline --gpu --project-dir=<project-dir> --report-output=<project-dir>/roofline.html
    To generate a GPU Roofline report for integer data, specify the
  3. Review the HTML report with the GPU Roofline model for your application.
You can use the Intel Advisor Python* API to collect GPU metrics for your project and print them in the terminal:
advixe-python <install-dir>/pythonapi/examples/ <project-dir>

GPU Roofline Chart Controls

There are several controls to help you focus on the GPU Roofline chart data most important to you, including the following.
  • Select by Mouse Rect
    : Select one or more kernels by tracing a rectangle with your mouse.
  • Zoom by Mouse Rect
    : Zoom in and out by tracing a rectangle with your mouse. You can also zoom in and out using your mouse wheel.
  • Move View by Mouse
    : Move the chart left, right, up, and down.
  • Undo
    : Undo or redo the previous zoom action.
Select the
Memory Level
(s) to show for each kernel in the chart: CARM, L3, SLM, GTI, L3 + SLM.
Add visual
to the GPU Roofline chart to make the interpretation of data easier, including performance limits and whether kernels are memory bound, compute bound, or both.
Use the following checkboxes in the
drop-down toolbar:
  • Display roof rulers
    : Enable showing a vertical line from a kernel to the nearest and topmost performance ceilings. To view the ruler, hover the cursor over a kernel dot. Where the line intersects with each roof, labels display hardware performance limits for the kernel.
  • Show memory level relationships
    : Visually emphasize the relationships among displayed memory levels and roofs for a selected dot by enabling.
  • Show Roofline boundaries
    : Color the GPU Roofline zones to make it easier to see if enclosed kernels are fundamentally memory bound, compute bound, or bound by compute and memory roofs.
The preview picture is updated as you select guidance options, allowing you to see how changes will affect the GPU Roofline chart's appearance. Click
to apply your changes or
to return the GPU Roofline chart to its original appearance.
  • Roofline View Settings
    : Change the default scale setting to show:
    • The
      scale (default), which adjusts to a chosen GPU Roofline chart view.
    • A
      scale, which adjusts to the tallest or widest view and does not change when a different GPU Roofline chart view is chosen.
  • Roof Settings
    : Change the visibility and appearance of roofline representations (lines):
    • Click a
      checkbox to show/hide a roof line.
    • Click a
      checkbox to change a roof line appearance: display the roof line as a solid or a dashed line.
    • Manually fine-tune roof values in the
      column to set hardware limits specific to your code.
  • Loop Weight Representation
    : Change the appearance of dots:
    • Point Weight Calculation
      : Change the
      Base Value
      for a point weight calculation.
      For a GPU Roofline chart, only
      Self Elapsed Time
      is available as a base value.
    • Point Weight Ranges
      : Change the
      , and weight
      Range (R)
      of a dot. Click the
      button to split a point weight range in two. Click the
      button to merge a point weight range with the range below.
    • Point Colorization
      : Color dots by
      weight ranges
      or by
      (vectorized or scalar). You can also change the color of loop with no self time.
  • Hover your mouse over a dot to display metrics and, if enabled, a roof ruler for it.
  • By default, Intel Advisor Beta generates a
    GPU Cache-Aware Roofline Model (CARM)
    , which reports memory traffic, in bytes, generated by all execution units.
    Show memory level relationships
    is enabled: Double-click a dot or select a dot and press
    to display labeled dots representing memory levels for the selected kernel. Lines connect the dots to indicate that they correspond to the selected kernel.
    If you have chosen to display only
    memory levels in the chart using the
    Memory Level
    toolbar, unselected memory levels are displayed with
    To hide the labeled dots, do one of the following:
    • Select another kernel.
    • Double-click an empty space in the GPU Roofline chart.
    • Press
  • Right-click a kernel dot or a blank area in the Roofline chart to perform more functions, such as:
    • Further simplify the GPU Roofline chart by filtering out (temporarily hiding a dot), filtering in (temporarily hiding all other dots), and clearing filters (showing all originally displayed dots).
    • Add visual guidance to the GPU Roofline chart to make the interpretation of data easier. These options are the same as in the

GPU Roofline Chart Data

Learn More About Roofline Charts and Investigate Kernels
chart plots an application's
achieved performance
arithmetic intensity
against the machine's
maximum achievable performance
  • Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) per byte, based on the kernel algorithm, transferred between GPU and memory
  • Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS)
In general:
  • The size and color of each dot represent relative execution time for each kernel. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
  • Diagonal lines indicate
    memory bandwidth limitations
    preventing kernels from achieving better performance without some form of optimization.
    • L3 cache roof
      : Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
    • SLM cache roof
      : Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
    • GTI roof
      : Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
    • DRAM roof
      : Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
  • Horizontal lines indicate
    compute capacity limitations
    preventing kernels from achieving better performance without some form of optimization.
  • A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
  • The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
The GPU Roofline chart is based on a CPU Roofline chart, but there are some differences:
  • The dots on the chart correspond to
    OpenCL kernels
    , while in the CPU version, they correspond to individual loops.
  • Some displayed information and controls (for example, thread/core count) are not relevant to GPU Roofline. For more information, see the table below.
  • Integrated chart shows multiple dots for a single kernel . These dots correspond to different memory levels used to calculate arithmetic intensity. Hover over a dot to identify its arithmetic intensity. To show or hide certain dots from a chart, use the
    Memory Level
    drop-down filter.

Known Issues and Limitations

On Windows OS:
  • GPU Roofline works only with 64-bit application.
  • GPU Roofline does not support Intel® Math Kernel Library (Intel® MKL) applications.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804