Optimize Your GPU Application with Intel® oneAPI Base Toolkit

Published:04/15/2021

What you'll learn

Use this step-by-step guide to learn graphics processing unit (GPU) optimization with Intel’s latest discrete and integrated GPUs and Intel’s oneAPI software. The primary path focuses on the use of Data Parallel C++ (DPC++), which is based on ISO C++ and incorporates standard SYCL* and community extensions to simplify data parallel programming. Advanced users can reference the resources and tools in each section to find additional options to help you customize to your needs.

Who is this for

Software developers interested in accelerating their code using Intel's latest integrated or external graphics cards and Data Parallel C++ (DPC++) for cross-code reuse. Start from your own C++ or CUDA* code, or use one of Intel’s many sample applications. 

This guide assumes you have an Intel GPU and the Intel® oneAPI Base Toolkit software installed on your development environment. Alternatively, you can use the Intel® DevCloud virtual sandbox where you’ll have free access to Intel GPUs and the oneAPI software tools. To help you choose, check out the first step below, Choose GPU Hardware Access.

Not sure if you want to offload to a GPU? This article compares different hardware to help you decide which hardware best suits your needs. 


The Workflow

Workflow diagram illustrating the six steps of GPU optimization from selecting hardware through using oneAPI Tools to assist in optimization

View Full Size


Step 1: Choose GPU Hardware Access

Workflow diagram illustrating process to follow in step 1 Whether you have the latest Intel architecture or need cloud resources, your optimization strategy starts with hardware. To make the best of your GPU offload, you need to understand what hardware resources you need to optimize for.

Tip: You can use a feature in Intel Advisor called Offload Modeling to help you choose what hardware is best for your needs by estimating performance improvements and ROI across different hardware solutions. 

Set Up Requirements

Set Up Requirements Intel DevCloud Intel Graphics Processor 
Operating System
  • Your Operating System: Windows*, Linux*, MacOS*
  • Intel DevCloud Operating System: Linux
Linux
Software Intel oneAPI Base Toolkit (pre-installed on Intel DevCloud) Intel oneAPI Base Toolkit
GPU GPUs Available: Intel® Iris® XMAX (nodes available on Intel DevCloud)

 GPUs Supported: 

  • Intel® Iris® XMAX
  • Intel® Iris® Xe Graphics (integrated GPU)
  • Intel® UHD Graphics
Language Data Parallel C++ (DPC++) Data Parallel C++ (DPC++)
Interface Command Line Interface (CLI) Command Line Interface (CLI)

Step 2: Choose Sample Code

Workflow diagram illustrating process to follow in step 2 You can start from an Intel sample, your existing CUDA* source code, or your own C++ application.

The Intel oneAPI Base Toolkit includes a collection of samples that reflect a wide range of techniques to express parallelism on the GPU using DPC++. A good starting point for understanding GPU optimization by introducing DPC++ code is to use a compute-intensive sample that performs simulation of acoustic isotropic wave propagation in a 3D medium (download the ISO3DFD sample). Use this OpenMP* sample to step through various tools and resources available to you for your optimization strategy with DPC++. Or browse the oneAPI samples.

Download the sample from GitHub to Intel DevCloud or your development environment. The sample readme includes detailed instructions for building and running the sample.

ISO3DFD Sample on GitHub
View all oneAPI samples on GitHub

 

Migrate Existing CUDA* Code

To migrate your existing CUDA code to a multi-platform program in DPC++. The Intel® DPC++ Compatibility Tool ports both CUDA language kernels and library API calls, migrating 80%-90% of CUDA code automatically to architecture and vendor portable DPC++ code. Inline comments help you finish writing and tuning your DPC++ code. Sample CUDA projects are also available to help you with the entire porting process.

  1. Migrate your CUDA code with the Intel DPC++ Compatibility Tool.
  2. Proceed to Step 3 to build your application and use Offload Modeling to evaluate your DPC++ code for further offload opportunities.

Optimize Your Own Projects

To use your own C++ code, set up your development environment and continue with this workflow. You will be able to apply these optimization techniques directly on your existing projects.

  • Intel DevCloud: copy your C++ application to the Intel DevCloud and proceed to Step 3 to build your application and use Offload Modeling to evaluate your code for further offload opportunities.
  • Your development environment: proceed to Step 3 to build your application and use Offload Modeling to evaluate your C++ code for further offload opportunities.
  • Use the command line to browse and download samples - Guide | Video
  • Explore samples using Eclipse* - Guide | Video
  • Use the VS Code extension to browse for samples - Guide

Step 3: Assess Code for Offload Opportunities with Intel® Advisor

Workflow diagram illustrating process to follow in step 3 The Intel® Advisor tool analyzes your code and helps you identify the best opportunities for GPU offload. The Offload Modeling feature provides performance speedup projections, estimates offload overhead, and pinpoints performance bottlenecks. Offload Modeling enables you to improve ROI by modeling different hardware solutions to maximize your performance.

Intel Advisor measures the data movement in your functions, the memory access patterns, and the amount of computation to project how code will perform on Intel GPUs. The code regions with highest potential benefit should be your first targets for offload.

Follow the steps found in each of these links:  

  1. Build your sample application using appropriate environment variables. (Required for DPC++, OpenMP*, and OpenCL™ applications)
  2. Run Offload Modeling Analysis.
  3. Review results.

Tip: Intel Advisor also offers a graphical user interface for creating projects and running analysis.


Step 4: Offload and Optimize with Intel® oneAPI Compilers and Libraries

Workflow diagram illustrating process to follow in step 4 Select the best optimization strategy to modify your code based on your application needs, advice from Intel Advisor, and available hardware. Documentation, samples, and training will help you make design decisions to maximize performance.

To learn how to build your GPU optimization strategy, use the Intel oneAPI GPU Optimization Guide

Consider these concepts first: 

  • Remember Amdahl's Law
  • Locality Matters
  • Rightsize your work

Get started.

oneAPI recommends a combination of 3 techniques to develop your parallelism strategy:

  1. Intel Optimized Libraries
    Intel oneAPI Programming Guide: oneAPI Library Overview
  2. Intel Compilers and Optimizations
    Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference
  3. Parallel Programming Language or API
    Intel oneAPI Programming Guide: DPC++

Key phases in the optimization workflow:

  1. Understand CPU and GPU overlap.
  2. Keep all the compute resources busy.
  3. Minimize the synchronization between the host and the device.
  4. Minimize the data transfer between the host and the device.
  5. Keep the data in faster memory and use an appropriate access pattern.

Step 5: Evaluate Offload Efficiency with Intel® Advisor

Workflow diagram illustrating process to follow in step 5 Once you've modified your application, return to Intel Advisor to help you measure the actual performance of offloaded code using the GPU Roofline Insights analysis. Advisor uses benchmarks and hardware metric profiling to measure GPU kernel performance. It points out limitations and identifies areas of your code where further optimization will have the most payoff.

If bottlenecks are identified, return to the Offload and Optimize step and rewrite the code to address issues.
Intel oneAPI GPU Optimization Guide


Step 6: Review Overall Application Performance with Intel® VTune™ Profiler

Workflow diagram illustrating process to follow in step 6 After optimizing your GPU offload code, use Intel® VTune™ Profiler to optimize overall application performance on all devices. VTune Profiler offers helpful optimization guidance within the analysis results.

Start with a quick baseline of application performance with the Performance Snapshot to identify focus areas for further analysis.

  1. Set up your system for GPU analysis.
  2. Launch the Intel VTune Profiler command line interface.
  3. Run the Performance Snapshot analysis.
  4. View the results.
    Intel DevCloud Users: view the summary report.
    Intel DevCloud Users with VTune Profiler installed locally: copy the results to your local system, create a project, and import into VTune Profiler.
    ​Intel oneAPI Base Toolkit Users: view results in VTune Profiler.

Tip: Intel VTune Profiler also offers a graphical user interface for creating projects and running an analysis

Start optimizing for CPU by reviewing how much time was spent transferring operations between the host and device. Next, further optimize for GPU by identifying areas of inefficient GPU usage.

  1. Run the GPU Offload analysis.
  2. Optimize CPU performance in your application.
  3. Run the GPU Compute/Media Hotspots analysis.
  4. Return to the Offload and Optimize step to further optimize GPU performance in your application.

 

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.