Developer Guide

Intel® Advisor GPU Analysis

Intel Advisor has two features that can help you analyze your performance of your application running on a GPU:
  • Use Offload Modeling to identify kernels in your CPU-based code and predict their performance when run on a GPU. It also allows you to explore different GPU configurations for GPUs that do not exist yet.
  • Use GPU Roofline Insights to see how your application is performing when compared to the limitations of your GPU.
Identify Regions to Offload to GPU with Offload Modeling
The Offload Modeling feature, a part of Intel Advisor, can be used to:
  • Identify the portions of a code that are profitable to be offloaded to a GPU.
  • Predict the code’s performance if run on a GPU.
  • Experiment with accelerator configuration parameters.
Offload Modeling produces upper-bound speedup estimates using a bounds-and-bottlenecks performance model. It takes measured x86 CPU metrics and application characteristics as an input and applies an analytical model to estimate execution time and characteristics on a target GPU.
You can run the Offload Modeling perspective from the Intel Advisor GUI, using the
advisor
command line interface, or using the dedicated Python* scripts delivered with the Intel Advisor. This topic describes how to run Offload modeling with the scripts. For detailed description of other ways to run the perspective, see the Intel Advisor User Guide.
Prerequisites: Set up Intel Advisor environment variables:
  • On Linux* OS:
    source <install-dir>/advisor-vars.sh
  • On Windows* OS:
    <install-dir>/advisor-vars.bat
To run Offload Modeling for a C++ Matrix Multiply application on Linux* OS:
Offload Modelling for a C++ Matrix application on Linux*OS
Offload Modelling for a C++ Matrix application on Linux\*OS
  1. Collect application performance metrics with
    collect.py
    :
    advisor-python $APM/collect.py ./advisor_project --config gen9_gt2 -- matrix_multiply
  2. Model your application performance on a GPU with
    analyze.py
    :
    advisor-python $APM/analyze.py ./advisor_project --config gen9_gt2
Once you have run the performance modeling, you can open the results in the Intel Advisor GUI or see CSV metric reports and an interactive HTML report generated in the
advisor_project/e000/pp000/data.0
Intel Advisor GUI, Offload Advisor
Intel Advisor GUI, Offload Advisor
For example, in the Summary section of the report, review the following:
  • The original execution time on a CPU, the predicted execution time on a GPU accelerator, the number of offloaded regions, and the estimated speedup in the Program metrics pane. For Matrix Multiply, Intel Advisor reports 4.4x potential speedup.
  • What the offloads are bounded by. This pane reports the main limiting factors that prevent your application from achieving a better performance on a target device. The Matrix Multiply application is 99% bounded by last level cache (LLC) cache bandwidth.
  • Exact source lines of the
    Top Offloaded
    code regions that can benefit from offloading to the GPU and estimated performance of each code region. For Matrix Multiply, there is one code region recommended for offloading.
  • Exact source lines of the
    Top Non-Offloaded
    code regions that are not recommended for offloading and specific reasons for it.
Go to the Offloaded Regions tab to view the detailed measured and estimated metrics for the code regions recommended for offloading. It also reports estimated amount of data transferred for the code regions and the corresponding offload taxes.
Use the data in the report to decide what regions of your code to port your code to DPC++. For example, you can port the C++ Matrix Multiply application to DPC++ as follows:
// Basic matrix multiply void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i, j, k; // Declare a deviceQueue sycl::default_selector device; sycl::queue q(device, exception_handler); cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n"; // Declare a 2 dimensional range sycl::range<2> matrix_range{NUM, NUM}; // Declare 3 buffers and Initialize them sycl::buffer<TYPE, 2> bufferA((TYPE *)a, matrix_range); sycl::buffer<TYPE, 2> bufferB((TYPE *)b, matrix_range); sycl::buffer<TYPE, 2> bufferC((TYPE *)c, matrix_range); // Submit our job to the queue q.submit([&](auto &h) { // Declare 3 accessors to our buffers. The first 2 read and the last // read_write sycl::accessor accessorA(bufferA, h, sycl::read_only); sycl::accessor accessorB(bufferB, h, sycl::read_only); sycl::accessor accessorC(bufferC, h); // Execute matrix multiply in parallel over our matrix_range // ind is an index into this range h.parallel_for(matrix_range, [=](sycl::id<2> ind) { int k; for (k = 0; k < NUM; k++) { // Perform computation ind[0] is row, ind[1] is col accessorC[ind[0]][ind[1]] += accessorA[ind[0]][k] * accessorB[k][ind[1]]; } }); }).wait_and_throw(); } // multiply1
Run a GPU Roofline Analysis
To estimate performance of an application running on a GPU against hardware limitations, you can use the GPU Roofline Insights feature. Intel Advisor can generate a Roofline model for kernels running on Intel
®
GPUs. The GPU Roofline model offers a very efficient way to characterize your kernels and visualize how far you are from ideal performance. For details about the GPU Roofline, see the Intel Advisor User Guide.
Prerequisites
: It is recommended to run the GPU Roofline with
root
privileges on Linux* OS or as an Administrator on Windows* OS.
If you do not have root permissions on Linux, configure your system to enable collecting GPU metrics for non-root users:
  1. Add your username to the video group. To check if you are already in the video group:
    groups | grep video
If you are not part of the video group, add your username to it:
sudo usermod -a -G video <username>
Set the value of the
dev.i915.perf_stream_paranoid sysctl
option to 0:
sysctl -w dev.i915.perf_stream_paranoid=0
  1. Disable time limit in order to run OpenCL kernel for longer period:
    sudo sh -c "echo N> /sys/module/i915/parameters/enable_hangcheck"
For all users:
  1. Make sure that your DPC++ code runs correctly on the GPU. To check which hardware you are running on, add the following to your DPC++ code and run it:
    sycl::default_selector selector; sycl::queue queue(delector); auto d = queue.get_device(); std::cout<<Running on :<<d.get_info<cl::sycl::info::device::name>()<<std::endl;
  2. Set up the Intel Advisor environment for Linux OS:
    source <advisor_install_dir>/env/vars.sh
and for Windows OS:
<install-dir>/advisor-vars.bat
To run the GPU Roofline analysis in the Intel Advisor CLI:
  1. Run the Survey analysis with the
    profile-gpu
    option:
    advisor -collect=survey --profile-gpu --project-dir=./advisor-project --search-dir src:r=./matrix_multiply -- matrix_multiply
  2. Run the Trip Count and FLOP analysis with
    --profile-gpu:
    :
    advisor --collect=tripcounts --stacks --flop --profile-gpu --project-dir=./advisor-project --search-dir src:r=./matrix_multiply -- matrix_multiply
  3. Open the generated GPU Roofline report in the Intel Advisor GUI. Review the following metrics for the DPC++ Matrix Multiply application:
  • In the Summary tab, view top hotspots and the memory layout in the Top Hotspots pane.
Top Hotspots pane
Top Hotspots pane
See how efficiently your application uses execution units in the Performance Characteristics pane.
Performance Characteristics pane
Performance Characteristics pane
In the GPU Roofline Regions tab, see the GPU Roofline chart and performance metrics.
GPU Roofline chart and performance metrics
GPU Roofline chart and performance metrics
  • The Matrix Multiply application gets 10.98 GFLOPS. It uses global memory and is not optimized for a local (SLM) memory. Since the application uses a global accessor, this makes sense.
  • The application is far from the maximum bandwidth of the GTI as represented by the red dot on the the right.
  • The dot on the left represents the L3 bandwidth. As the chart shows, it is far from the L3 bandwidth maximum.
As the GPU Roofline chart suggests, there are several possible optimizations that we can do to get better usage of memory bandwidth:
  • Use local memory (SLM).
  • Use cache blocking technique to better use SLM/L3 cache.
The following code is the optimized version of the Matrix Multiply application. In this version, we declare two tiles and define them as
sycl::access::target:local
, we also modify the kernel to process these tiles in some inner loops.
// Replaces accessorC reference with a local variable void multiply1_1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i, j, k; // Declare a deviceQueue sycl::default_selector device; sycl::queue q(device, exception_handler); cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n"; // Declare a 2 dimensional range sycl::range<2> matrix_range{NUM, NUM}; // Declare 3 buffers and Initialize them sycl::buffer<TYPE, 2> bufferA((TYPE *)a, matrix_range); sycl::buffer<TYPE, 2> bufferB((TYPE *)b, matrix_range); sycl::buffer<TYPE, 2> bufferC((TYPE *)c, matrix_range); // Submit our job to the queue q.submit([&](auto &h) { // Declare 3 accessors to our buffers. The first 2 read and the last // read_write sycl::accessor accessorA(bufferA, h, sycl::read_only); sycl::accessor accessorB(bufferB, h, sycl::read_only); sycl::accessor accessorC(bufferC, h); // Execute matrix multiply in parallel over our matrix_range // ind is an index into this range h.parallel_for(matrix_range, [=](sycl::id<2> ind) { int k; TYPE acc = 0.0; for (k = 0; k < NUM; k++) { // Perform computation ind[0] is row, ind[1] is col acc += accessorA[ind[0]][k] * accessorB[k][ind[1]]; } accessorC[ind[0]][ind[1]] = acc; }); }).wait_and_throw(); } // Replaces accessorC reference with a local variable and adds matrix tiling void multiply1_2(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i, j, k; // Declare a deviceQueue sycl::default_selector device; sycl::queue q(device, exception_handler); cout << "Running on " << q.get_device().get_info<sycl::info::device::name>() << "\n"; // Declare a 2 dimensional range sycl::range<2> matrix_range{NUM, NUM}; sycl::range<2> tile_range{MATRIXTILESIZE, MATRIXTILESIZE}; // Declare 3 buffers and Initialize them sycl::buffer<TYPE, 2> bufferA((TYPE *)a, matrix_range); sycl::buffer<TYPE, 2> bufferB((TYPE *)b, matrix_range); sycl::buffer<TYPE, 2> bufferC((TYPE *)c, matrix_range); // Submit our job to the queue q.submit([&](auto &h) { // Declare 3 accessors to our buffers. The first 2 read and the last // read_write sycl::accessor accessorA(bufferA, h, sycl::read_only); sycl::accessor accessorB(bufferB, h, sycl::read_only); sycl::accessor accessorC(bufferC, h); // Create matrix tiles sycl::accessor<TYPE, 2, sycl::access::mode::read_write, sycl::access::target::local> aTile(sycl::range<2>(MATRIXTILESIZE, MATRIXTILESIZE), h); sycl::accessor<TYPE, 2, sycl::access::mode::read_write, sycl::access::target::local> bTile(sycl::range<2>(MATRIXTILESIZE, MATRIXTILESIZE), h); // Execute matrix multiply in parallel over our matrix_range // ind is an index into this range h.parallel_for(sycl::nd_range<2>(matrix_range, tile_range), [=](cl::sycl::nd_item<2> it) { int k; const int numTiles = NUM / MATRIXTILESIZE; const int row = it.get_local_id(0); const int col = it.get_local_id(1); const int globalRow = MATRIXTILESIZE * it.get_group(0) + row; const int globalCol = MATRIXTILESIZE * it.get_group(1) + col; TYPE acc = 0.0; for (int t = 0; t < numTiles; t++) { const int tiledRow = MATRIXTILESIZE * t + row; const int tiledCol = MATRIXTILESIZE * t + col; aTile[row][col] = accessorA[globalRow][tiledCol]; bTile[row][col] = accessorB[tiledRow][globalCol]; it.barrier(sycl::access::fence_space::local_space); for (k = 0; k < MATRIXTILESIZE; k++) { // Perform computation ind[0] is row, ind[1] is col acc += aTile[row][k] * bTile[k][col]; } it.barrier(sycl::access::fence_space::local_space); } accessorC[globalRow][globalCol] = acc; }); }).wait_and_throw(); } // multiply1_2
Save the optimized version as
multiply_1_2
and rerun the GPU Roofline. As the new chart shows:
  • The optimized application gets 19.985 GFLOPS.
  • The application uses global and SLM memory, which represents the 16x16 tile. This increases the memory bandwidth.
GPU Roofline new chart
GPU Roofline new chart

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.