User Guide


Examine Bottlenecks on GPU Roofline Chart

Accuracy Level


Enabled Analyses

Survey with GPU profiling for all memory levels + FLOP (Characterization)
Other analyses and properties control a CPU Roofline part of the report, which shows metrics for loops executed on CPU. You can add the CPU Roofline panes to the main view using the button on the top pane. For details about CPU Roofline data, see
CPU / Memory Roofline Insights

Result Interpretation

The farther a dot is from the topmost roofs, the more room for improvement there is. In accordance with Amdahl's Law, optimizing the loops that take the largest portion of the program's total run time will lead to greater speedups than optimizing the loops that take a smaller portion of the run time.
Example of a GPU Roofline chart
To read the GPU Roofline chart:
  • The dots on the chart correspond to OpenCL™ kernels.
  • By dot size and color, identify loops that take most of total program time and/or located very low in the chart. For example:
    • Small, green dots take up relatively little time, so are likely not worth optimizing.
    • Large, red dots take up the most time, so the best candidates for optimization are the large, red dots with a large amount of space between them and the topmost roofs.
  • Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization:
    • L3 cache roof
      : Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
    • SLM cache roof
      : Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
    • GTI roof
      : Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
    • DRAM roof
      : Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
  • Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
  • The roofs above a dot represent the restrictions preventing it from achieving a higher performance, although the roofs below can contribute somewhat.
  • A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
  • The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.

Memory-Level GPU Roofline

By default, GPU Roofline reports data for all memory levels by default allowing you to examine each loop at different cache levels and arithmetic intensities and provides precise insights into which cache level causes the performance bottlenecks.
To configure the Memory-Level GPU Roofline chart:
  1. Expand the filter pane in the GPU Roofline chart toolbar.
  2. In the
    Memory Level
    section, select the memory levels you want to see metrics for.
    Select memory levels for a GPU Roofline chart
  3. Click
  4. In the GPU Roofline chart, double-click a loop to examine how the relationships between displayed memory levels and roofs. Labeled dots are displayed, representing memory levels with arithmetic intensity for the selected loop/function; lines connect the dots to indicate that they correspond to the selected loop/function.
Memory-Level GPU Roofline Data
Review the changes in the traffic from one memory level to another and compare it to respective to identify the memory hierarchy bottleneck for the kernel and determine optimization steps based on this information.
Example of a GPU Roofline chart for all memory levels
  • When you double-click a loop, it is expanded to several dots and/or X marks representing different memory levels:
    • CARM
      : Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
    • L3
      : Data transferred directly between execution units and L3 cache.
    • SLM
      : Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
    • GTI
      : Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory. Use this to get a sense of external memory traffic.
    • L3 + SLM
      : Summary traffic to/from L3 and Shared Local Memory.
  • The
    vertical distance
    between memory dots and their respective roofline shows how much you are limited by a given memory subsystem. If a dot is close to its roof line, it means that the kernel is limited by the performance of this memory level.
  • The
    horizontal distance
    between memory dots indicates how efficiently the loop/function uses cache. For example, if L3 and DRAM dots are very close on the horizontal axis for a single loop, the loop/function uses L3 and DRAM similarly. This mean that it does not use L3 and DRAM efficiently. Improve re-usage of data in the code to improve application performance.
  • Arithmetic intensity
    on the x axis determines the order in which dots are plotted, which can provide some insight into your code's performance. For example, the CARM dot is typically far to the right of the L3 dot, as read/write access is by cache lines and CARM traffic is the sum of actual bytes used in operations.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at