Intel® Advisor Beta: GPU Roofline
GPU Roofline Analysis Purpose and Usage
- Collects OpenCL™ kernels timings and memory data using theSurvey analysiswith GPU profiling.
- Measures the hardware limitations and collects floating-point and integer operations data using theTrip Counts and FLOP analysiswith GPU profiling.
- What is the maximum achievable performance with your current hardware resources?
- Does your application work optimally on current hardware resources?
- If not, what are the best candidates for optimization?
- Is memory bandwidth or compute capacity limiting performance for each optimization candidate?
- CARM: Memory traffic generated by all execution units (EUs). Includes traffic between EUs and corresponding GPU cache or direct traffic to main memory. For each retired instruction with memory arguments, the size of each memory operand in bytes is added to this metric.
- L3: Data transferred directly between execution units and L3 cache.
- SLM: Memory access to/from Shared Local Memory (SLM), a dedicated structure within the L3 cache.
- GTI: Represents GTI traffic/GPU memory read bandwidth, the accesses between the GPU, chip uncore (LLC), and main memory. Use this to get a sense of external memory traffic.
- L3 + SLM: Summary traffic to/from L3 and Shared Local Memory.
Run a GPU Roofline Analysis
Intel® Advisor Beta/Linux* OS
The GPU Roofline analysis is available only on Linux* OS with a kernel version 4.14 or higher.
Intel® Advisor Beta/Windows* OS
- GPU profiling is applicable only to Intel® Processor Graphics.
- This page explains how to run the GPU Roofline analysis from theIntel® Advisor Betacommand-line interface. To run the GPU Roofline analysis from theIntel® Advisor BetaGUI (technical preview), set theADVIXE_EXPERIMENTAL=beta_guivariable. For more information, seeIntel® AdvisorUser Guide (PDF).
- Collect the GPU Roofline data with the--profile-gpuoption. The--profile-gpuoption enables the analysis of OpenCL and Intel® Media SDK programs running on Intel® Processor Graphics. Do one of the following:
- Run a shortcut--collect=rooflinecommand, which runs the Survey and Trip Count with FLOP analyses one by one:advixe-cl --collect=roofline --profile-gpu --project-dir=<project-dir> --search-dir src:r=<source-dir> -- <target> [target-options]
- Run the Survey and Trip Counts with FLOP analyses separately:advixe-cl --collect=survey --profile-gpu --project-dir=<project-dir> --search-dir src:r=<source-dir> -- <target> [target-options]advixe-cl --collect=tripcounts --flop --profile-gpu --project-dir=<project-dir> --search-dir src:r=<source-dir> -- <target> [target-options]
- Generate a GPU Roofline report with the--gpuoption:advixe-cl --report=roofline --gpu --project-dir=<project-dir> --report-output=<project-dir>/roofline.htmlTo generate a GPU Roofline report for integer data, specify the--data-type=intoption.
- Review the HTML report with the GPU Roofline model for your application.
advixe-python <install-dir>/pythonapi/examples/survey_gpu.py <project-dir>
GPU Roofline Chart Controls
Memory Level(s) to show for each kernel in the chart: CARM, L3, SLM, GTI, L3 + SLM.
Guidanceto the GPU Roofline chart to make the interpretation of data easier, including performance limits and whether kernels are memory bound, compute bound, or both.
Use the following checkboxes in the
The preview picture is updated as you select guidance options, allowing you to see how changes will affect the GPU Roofline chart's appearance. Click
Applyto apply your changes or
Defaultto return the GPU Roofline chart to its original appearance.
GPU Roofline Chart Data
- Arithmetic intensity (x axis) - measured in number of floating-point operations (FLOPs) per byte, based on the kernel algorithm, transferred between GPU and memory
- Performance (y axis) - measured in billions of floating-point operations per second (GFLOPS)
- The size and color of each dot represent relative execution time for each kernel. Large red dots take the most time, so are the best candidates for optimization. Small green dots take less time, so may not be worth optimizing.
- Diagonal lines indicatememory bandwidth limitationspreventing kernels from achieving better performance without some form of optimization.
- L3 cache roof: Represents the maximal bandwidth of the L3 cache for your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that fits entirely into L3 cache.
- SLM cache roof: Represents the maximal bandwidth of the Shared Local Memory for your current graphics hardware. Measured using an optimized sequence of load and store operations that work only with SLM.
- GTI roof: Represents the maximum bandwidth between the GPU and the rest of the SoC. This estimate is calculated via analytical formula based on the maximum frequency of your current graphics hardware.
- DRAM roof: Represents the maximal bandwidth of the DRAM memory available to your current graphics hardware. Measured using an optimized sequence of load operations, iterating over an array that does not fit in GPU caches.
- Horizontal lines indicatecompute capacity limitationspreventing kernels from achieving better performance without some form of optimization.
- A dot cannot exceed the topmost rooflines, as these represent the maximum capabilities of the machine. However, not all kernels can utilize maximum machine capabilities.
- The greater the distance between a dot and the highest achievable roofline, the more opportunity exists for performance improvement.
- The dots on the chart correspond toOpenCL kernels, while in the CPU version, they correspond to individual loops.
- Some displayed information and controls (for example, thread/core count) are not relevant to GPU Roofline. For more information, see the table below.
- Integrated chart shows multiple dots for a single kernel . These dots correspond to different memory levels used to calculate arithmetic intensity. Hover over a dot to identify its arithmetic intensity. To show or hide certain dots from a chart, use theMemory Leveldrop-down filter.
Known Issues and Limitations
- GPU Roofline works only with 64-bit application.
- GPU Roofline does not support Intel® Math Kernel Library (Intel® MKL) applications.