Cookbook

  • 2021.1
  • 02/24/2021
  • Public Content

Identify Code Regions to Offload to GPU and Visualize GPU Usage

This recipe explains how to identify regions to offload to the GPU, visualize performance of GPU kernels, and identify bottlenecks in your application using the
Offload Modeling
and the
GPU Roofline Insights
features of the
Intel® Advisor
.

Scenario

Some of the most common problems in today's computer science domain - such as artificial intelligence, simulation, and modeling - involve matrix multiplication. The algorithm is a triply-nested loop with a multiply and an add operation for each iteration. It is computationally intensive and it also accesses a lot of memory.
for(i=0; i<msize; i++) { for(j=0; j<msize; j++) { for(k=0; k<msize; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }

Ingredients

This section lists the hardware and software used to produce the specific result shown in this recipe:

Identify Regions to Offload to GPU with
Offload Modeling

Use the
Offload Modeling
feature of the
Intel Advisor
to identify the portions of a code that are profitable to be offloaded to a GPU.
Offload Modeling
can predict the code's performance if run on a GPU and lets you experiment with accelerator configuration parameters.
The
Intel Advisor
produces upper-bound speedup estimates using a bounds and bottlenecks performance model. It takes measured x86 CPU metrics and application characteristics as an input and applies an analytical model to estimate execution time and speedup on a target GPU.
Offload Advisor: estimated time on GPU
Prerequisites
: Set up the
Intel Advisor
environment variables to enable the command line interface (CLI):
source <advisor-install-dir>/advisor-vars.sh
To analyze your code with the
Offload Modeling
:
  1. Collect application performance metrics with Survey analysis:
    advisor --collect=survey --project-dir=./mmult --stackwalk-mode=online --static-instruction-mix -- /home/test/mmult
  2. Collect Trip Counts and FLOP data:
    advisor --collect=tripcounts --project-dir=./mmult --flop --target-device=gen9_gt2 -- /home/test/mmult
  3. Model the application performance for a
    gen9_gt2
    configuration:
    advisor --collect=projection --project-dir=./mmult --config=gen9_gt2 --no-assume-dependencies
    In the
    Intel Advisor
    GUI, this corresponds to a low-accuracy configuration of the
    Offload Modeling
    . See User Guide: Offload Modeling Accuracy Presets for details.
  4. Go to
    mmult/e000/pp000/data.0
    and open the interactive HTML report
    report.html
    in a web browser to see the performance projection results.
    Example of a report.html file
    In the
    Summary
    tab of the report, review the following:
    • In the
      Program Metrics
      pane:
      Intel Advisor
      predicts a 4.4x speedup if you offload the application to a GPU with the
      gen9_gt2
      configuration. The estimated execution time is 5.85 seconds compared to the original 25.07 seconds.
    • In the
      Offload Bounded by
      pane: The offloads are 99% bounded by the last-level cache (LLC) bandwidth.
    • In the
      Top Offloaded
      pane:
      Intel Advisor
      recommends to offload the loop at
      multiply.c:202
      . Click the loop location to go to
      Offloaded Regions
      tab and see more details.
    • In the
      Top Non-Offloaded
      : The time spent in other loops is too small to be modeled accurately and one of loops is outside of the code region marked for offloading. For this reason, they are not recommended for offloading.
Use this information to rewrite the matrix multiply application in DPC++.

Rewrite the Matrix Multiply Code in Data Parallel C++ (DPC++)

The
Intel Advisor
recommends to offload the
multiply.c:202
code region of the matrix multiply application to the GPU. To do this, you need to rewrite the matrix multiply code in Data Parallel C++ (DPC++) as follows:
  1. Select a device.
  2. Declare a device queue.
  3. Declare buffers to hold the matrix.
  4. Submit a job to the device queue.
  5. Execute the matrix multiply in parallel.
void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i,j,k; // Select a device cl::sycl::gpu_selector device; // Declare a deviceQueue cl::sycl::queue deviceQueue(device); // Declare a two-dimensional range cl::sycl::range<2> matrix_range{NUM, NUM}; // Declare three buffers and initialize them cl::sycl::buffer<TYPE, 2> bufferA((TYPE*)a, matrix_range); cl::sycl::buffer<TYPE, 2> bufferB((TYPE*)b, matrix_range); cl::sycl::buffer<TYPE, 2> bufferC((TYPE*)c, matrix_range); // Submit our job to the queue deviceQueue.submit([&](cl::sycl::handler& cgh) { // Declare three accessors to our buffers. The first two are read, and the last one is read_write auto accessorA = bufferA.template get_access<sycl_read>(cgh); auto accessorB = bufferB.template get_access<sycl_read>(cgh); auto accessorC = bufferC.template get_access<sycl_read_write>(cgh); // Execute the matrix multiply code in parallel over our matrix_range // Ind is an index into this range cgh.parallel_for<class Matrix<TYPE>>(matrix_range, [=](cl::sycl::id<2> ind) { int k; for(k=0; k<NUM; k++) { // Perform computation, where ind[0] is a row, ind[1] is a column accessorC[ind[0]][ind[1]] += accessorA[ind[0]][k] * accessorB[k][ind[1]]; } }); }); }
Save the file and rebuild the application.

Run GPU Roofline

To estimate performance of the GPU version of the matrix multiply application, you can use the new
GPU Roofline Insights
e feature.
Intel Advisor
can generate a Roofline model for kernels running on an Intel® GPU. The Roofline model is a very efficient way to characterize your kernels and visualize how far you are from ideal performance.
Prerequisites
: Before running the
GPU Roofline Insights
, make sure your system is properly configured to analyze GPU kernels.
  1. Add your username to the video group. To check if you are already in the video group, run:
    groups | grep video
    If you are not part of the video group, add your username to it:
    sudo usermod -a -G video <username>
  2. Enable GPU metrics collection:
    sudo su
    echo 0 > /proc/sys/kernel/perf_stream_paranoid
  3. Make sure that your DPC++ code runs correctly on the GPU. To check which hardware you are running on, add the following to your DPC++ code and run it:
    Cl::sycl::default_selector selector; Cl::sycl::queue queue(delector); auto d = queue.get_device(); std::cout<<”Running on :”<<d.get_info<cl::sycl::info::device::name>()<<std::endl;
  4. Set up the
    Intel Advisor
    environment variables:
    source <advisor-install-dir>/advisor-vars.sh
To run the
GPU Roofline Insights
from the
Intel Advisor
CLI:
  1. Run the Survey analysis with the
    --profile-gpu
    option:
    advisor --collect=survey --project-dir=./mmult_dpcpp --profile-gpu -- /home/test/mmult_dpcpp
  2. Run the Trip Count and FLOP analysis with
    --profile-gpu
    :
    advisor --collect=tripcounts --project-dir=./mmult_dpcpp --flop --profile-gpu -- /home/test/mmult_dpcpp
  3. Generate an HTML report with a GPU Roofline chart:
    advisor --report=roofline --gpu --project-dir=./mmult_dpcpp --report-output=roofline.html
  4. Open the generated
    roofline.html
    in a web browser.
    GPU Roofline chart
    • To get more information on different parts of memory, you can choose to display different dots based on which memory subsystem is used for the arithmetic intensity calculation. In this case, choose
      GTI (Memory)
      and
      L3 + SLM
      memory levels.
      GPU Roofline chart: memory levels
    • Double-click a dot to see more information about it:
      • The L3 dot is very close to the L3 maximum bandwidth. To get more FLOPS, you need to optimize caches further. A cache-blocking optimization strategy can make better use of memory and should increase the performance.
      • The GTI dot, which represents traffic between the GPU, GPU uncore (LLC), and main memory, is far from the GTI roofline. Transfer costs between CPU and GPU do not seem to be an issue.
        Roofline chart with L3 and GTI dots

Next Steps

Refactor the DPC++ code to optimize memory usage. You can use the cache-blocking technique to significantly improve performance.

Key Take-Aways

  • Use the
    Offload Modeling
    feature of the
    Intel Advisor
    to find the best candidates for code to offload to the GPU, estimate the outcome of porting to GPU, and identify performance bottlenecks.
  • Use
    GPU Roofline Insights
    feature of the
    Intel Advisor
    to identify bottlenecks in code already ported to GPU and see how close its performance is to the system maximums.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.