Examine Bottlenecks on GPU Roofline Chart
- Explore the factors that might limit your kernel performance.
- Horizontal lines indicate compute capacity limitations preventing kernels from achieving better performance without some form of optimization.
- Diagonal lines indicate memory bandwidth limitations preventing kernels from achieving better performance without some form of optimization:
- L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
- SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
- GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
- DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
- Identify top hotspots for optimization.The dots on the chart correspond to kernels running on GPU. In accordance with Amdahl's Law, optimizing the loops that take the largest portion of the program's total run time will lead to greater speedups than optimizing the loops that take a smaller portion of the run time. By dot size and color of a dot, identify kernels that take most of total GPU time and/or are located very low in the chart. For example:
- Small green dots take up relatively little time, so are likely not worth optimizing.
- Large red dots take up the most time, so the best candidates for optimization are the large, red dots with a large amount of space between them and the topmost roofs.
- Identify head room for optimization.The roofs above a dot represent the restrictions preventing it from achieving a higher performance. A dot cannot exceed the topmost rooflines, as they represent the maximum capabilities of the hardware. The farther a dot is from the topmost roofs, the more room for improvement there is. Highlight the roof that limits the performance of your kernel by double-clicking a dot on the chart.
- GPU Rooflinechart:
- Click a dot on a Roofline chart and click the+button that appears next to the dot. The dot expands into several dots representing the instances of the selected kernel.
- Click a dot representing a kernel instance and view details about its global and local work size in theGPU Detailspane.
- Hover over dots representing kernel instances to review and compare their performance metrics. Highlight a roofline limiting the performance of a given instance by double-clicking the dot.
- Grid in theGPUpane:
- Expand a source kernel in the grid.
- View the information about the work size of the kernel instances by expanding theWork Sizecolumn in the grid. To view the count of instances of a given global/local size, expand theCompute Task Detailscolumn in the grid and notice theInstance Countmetric.
- Compare performance metrics for instances of different global and local size using the grid and theGPU Detailspane.
Explore Memory-Level GPU Roofline
- Expand the filter pane in the GPU Roofline chart toolbar.
- In theMemory Levelsection, select the memory levels you want to see metrics for.By default, GPU Roofline reports data for GTI memory level (for integrated graphics) and HBM/DRAM memory level (for discrete graphics).
- CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
- L3: Data transferred directly between execution units and L3 cache.
- SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
- GTI: Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory. Use this to get a sense of external memory traffic.
- Spends less time on transferring data between L3 and CARM memory levels
- Uses as much data as possible for actual calculations
- Enhances the elapsed time of the kernel and of the entire application
Examine Kernel Details
- Pointer arrow that shows the exact roof that limits the kernel performance. The arrow points to what you should optimize the kernel for and shows the potential speedup after the optimization in the callout.If the arrow points to a diagonal line, the kernel is mostly memory bound. If the arrow points to a horizontal line, the kernel is mostly compute bound.Intel® Advisordisplays a compute roof limiting the performance of your kernel based on the instruction mix used.For example, on the screenshot below, the kernel is bounded by the L3 Bandwidth. If you optimize the memory access patterns in the kernel, it gets up to 5.1x speedup.
- Guidance on possible optimization steps depending on the factor limiting performance. Click on the bounding factor to expand the hint.
- Amount of data transferred for each cache memory levelThe chart is plotted for a dominant type of operations in a code (FLOAT or INT) and shows only roofs with cache memory levels, data types, and instructions mix used in the kernel. If there is no FLOP or INTOP in the kernel, the single-kernel Roofline chart is not shown.
- The total number of floating-point and integer operations transferred by the kernel per second as a percentage of the maximum compute capacity of your hardware. The red bar represents thedominantoperation data type used in the kernel.
- The amount of data transferred by the kernel at each cache memory level per second as a percentage of the memory level bandwidth. Cache memory level utilization (in per cent) is a ratio of effective bandwidth and maximum bandwidth of a given memory level. This metric shows how well the kernel uses the capability of your hardware and can help you identify bottlenecks for your kernel.For example, in the screenshot below, the kernel utilizes 19% of L3 Bandwidth. Compared to utilization metrics for other memory levels and compute capacity, L3 Bandwidth is the main factor limiting the performance of the kernel.
- Review how much time the kernel spends processing requests for each memory level in relation to the total time, in perspective, reported in theImpactshistogram.A big value indicates a memory level that bounds the selected kernel. Examine the difference between the two largest bars to see how much throughput you can gain if you reduce the impact on your main bottleneck. It also gives you a long-time plan to reduce your memory bound limitations as once you will solve the problems coming from the widest bar, your next issue will come from the second biggest bar and so on.Ideally, you should see the L3 or SLM as the most impactful memory.
- Review an amount of data that passes through each memory level reported in theShareshistogram.
Compute (FLOP and INTOP)
LOAD, STORE, SLM_LOAD, SLM_STORE types depending on the argument:
send, sendc, sends, sendsc
- Examine instruction count for each category as well as its percentage in overall instruction mix to determine the dominating category of instructions in the kernel.
- Examine instruction count for each type of compute, memory, atomics, and other instructions.
- For compute instructions, view the dominating data type for each type of instructions.The data type dominating in the entire kernel is highlighted blue.