Cookbook

  • 09/22/2020
  • Public Content

Identify Code Regions to Offload to GPU and Visualize GPU Usage (Beta)

This recipe explains how to identify regions to offload to the GPU, visualize performance of GPU kernels, and identify bottlenecks in your application using the
Offload Advisor
and the GPU Roofline analysis features of the
Intel® Advisor Beta
.

Scenario

Some of the most common problems in today's computer science domain - such as artificial intelligence, simulation, and modeling - involve matrix multiplication. The algorithm is a triply-nested loop with a multiply and an add operation for each iteration. It is computationally intensive and it also accesses a lot of memory.
for(i=0; i<msize; i++) { for(j=0; j<msize; j++) { for(k=0; k<msize; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } }

Ingredients

This section lists the hardware and software used to produce the specific result shown in this recipe:

Identify Regions to Offload to GPU with
Offload Advisor

The
Offload Advisor
tool, part of the
Intel® Advisor Beta
, can identify the portions of a code that are profitable to be offloaded to a GPU. It can also predict the code's performance if run on GPU and lets you experiment with accelerator configuration parameters.
The
Offload Advisor
produces upper-bound speedup estimates using a bounds and bottlenecks performance model. It takes measured x86 CPU metrics and application characteristics as an input and applies an analytical model to estimate execution time and characteristics on a target GPU.
Offload Advisor: estimated time on GPU
Prerequisites
: Run the shell script to set up the
Intel® Advisor Beta
environment variables:
source <advisor_install_dir>/env/vars.sh
To analyze your code with the
Offload Advisor
:
  1. Collect application performance metrics with
    collect.py
    :
    advixe-python $APM/collect.py advisor_project --config gen9 -- /home/test/matrix
  2. Project your application performance on the GPU with
    analyze.py
    :
    advixe-python $APM/analyze.py advisor_project --config gen9 --out-dir /home/test/analyze
  3. Open the generated
    report.html
    file in the
    /home/test/analyze
    directory in a web browser to see the performance projection results.
    Example of a report.html file
    In the
    Summary
    section of the report, note the following:
    • The original CPU execution time, the predicted execution time on the GPU accelerator, the number of offloaded regions, and the speedup in the
      Program metrics
      pane.
    • What the
      offloads are bounded by
      . In our case,
      the offloads are 99% bounded by the last-level cache (LLC) bandwidth
      .
    • Exact source lines of the
      Top Offloaded
      code regions that will benefit from offloading to the GPU. In our case, there is only one code region recommended for offload.
    • Exact source lines of the
      Top Non-Offloaded
      code regions that are not recommended for offload for various reasons. In our case, the time spent in the loops is too small to be modeled accurately and one of loops is outside of the code region marked for offloading.
Use this information to rewrite the matrix multiply application in DPC++.

Rewrite the Matrix Multiply Code in Data Parallel C++ (DPC++)

The
Offload Advisor
recommends offloading the matrix multiply code region to the GPU. To do this, you need to rewrite the matrix multiply code in Data Parallel C++ (DPC++).
To rewrite the matrix multiply code, perform the following high-level steps, as shown in the code below:
  1. Select a device.
  2. Declare a device queue.
  3. Declare buffers to hold the matrix.
  4. Submit a job to the device queue.
  5. Execute the matrix multiply in parallel.
void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i,j,k; // Select a device cl::sycl::gpu_selector device; // Declare a deviceQueue cl::sycl::queue deviceQueue(device); // Declare a two-dimensional range cl::sycl::range<2> matrix_range{NUM, NUM}; // Declare three buffers and initialize them cl::sycl::buffer<TYPE, 2> bufferA((TYPE*)a, matrix_range); cl::sycl::buffer<TYPE, 2> bufferB((TYPE*)b, matrix_range); cl::sycl::buffer<TYPE, 2> bufferC((TYPE*)c, matrix_range); // Submit our job to the queue deviceQueue.submit([&](cl::sycl::handler& cgh) { // Declare three accessors to our buffers. The first two are read, and the last one is read_write auto accessorA = bufferA.template get_access<sycl_read>(cgh); auto accessorB = bufferB.template get_access<sycl_read>(cgh); auto accessorC = bufferC.template get_access<sycl_read_write>(cgh); // Execute the matrix multiply code in parallel over our matrix_range // Ind is an index into this range cgh.parallel_for<class Matrix<TYPE>>(matrix_range, [=](cl::sycl::id<2> ind) { int k; for(k=0; k<NUM; k++) { // Perform computation, where ind[0] is a row, ind[1] is a column accessorC[ind[0]][ind[1]] += accessorA[ind[0]][k] * accessorB[k][ind[1]]; } }); }); }

Run a GPU Roofline Analysis

To estimate performance of the GPU version of the matrix multiply application, you can use the new GPU Roofline feature.
Intel® Advisor Beta
can generate a Roofline model for kernels running on Intel® GPUs. The Roofline model offers a very efficient way to characterize your kernels and visualize how far you are from ideal performance.
Prerequisites
: Before running the Roofline analysis on a GPU, you must ensure your system is properly configured.
  1. Add your username to the video group. To check if you are already in the video group, run:
    groups | grep video
    If you are not part of the video group, add your username to it:
    sudo usermod -a -G video <username>
  2. Enable GPU metrics collection:
    sudo su
    echo 0 > /proc/sys/kernel/perf_stream_paranoid
The Roofline model on GPU is a technical preview feature and is not available by default. To enable it:
  1. Make sure that your DPC++ code runs correctly on the GPU. To check which hardware you are running on, add the following to your DPC++ code and run it:
    Cl::sycl::default_selector selector; Cl::sycl::queue queue(delector); auto d = queue.get_device(); std::cout<<”Running on :”<<d.get_info<cl::sycl::info::device::name>()<<std::endl;
  2. Set up the Intel® Advisor Beta environment:
    source <advisor_install_dir>/env/vars.sh
  3. Enable GPU profiling:
    export ADVIXE_EXPERIMENTAL=gpu-profiling
To run the GPU Roofline analysis in the
Intel® Advisor
CLI:
  1. Run the Survey analysis with the
    --enable-gpu-profiling
    option:
    advixe-cl –-collect=survey --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- /home/test/matrix [app_parameters]
  2. Run the Trip Count and FLOP analysis with
    --enable-gpu-profiling
    :
    advixe-cl –-collect=tripcounts --stacks --flop --enable-gpu-profiling --project-dir=<my_project_directory> --search-dir src:r=<my_source_directory> -- /home/test/matrix [app_parameters]
  3. Generate a GPU Roofline report:
    advixe-cl --report=roofline --gpu --project-dir=<my_project_directory> --report-output=roofline.html
  4. Open the generated
    roofline.html
    in a web browser to visualize GPU performance.
    GPU Roofline chart
    • To get more information on different parts of memory, you can choose to display different dots based on which memory subsystem is used for the arithmetic intensity calculation. In this case, choose
      GTI (Memory)
      and
      L3 + SLM
      memory levels.
      GPU Roofline chart: memory levels
    • Double-click a dot to see more information about it. In the GPU Roofline chart, note the following:
      • The L3 dot is very close to the L3 maximum bandwidth. To get more FLOPS, you need to optimize caches further. A cache-blocking optimization strategy can make better use of memory and should increase the performance.
      • The GTI dot, which represents traffic between the GPU, GPU uncore (LLC), and main memory, is far from the GTI roofline. Transfer costs between CPU and GPU do not seem to be an issue.
        Roofline chart with L3 and GTI dots

Next Steps

Refactor the DPC++ code to make better use of memory. You can use the cache-blocking technique to significantly improve performance.

Key Take-Aways

  • Intel® Advisor Beta
    can help you to find the best candidates for code to offload to the GPU, estimate the outcome of porting to GPU, and identify performance bottlenecks.
  • Use GPU Roofline feature of the Intel® Advisor Beta to identify bottlenecks in code already ported to GPU and see how close its performance is to the system maximums.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804