What you'll learn
Use this step-by-step guide to learn graphics processing unit (GPU) optimization with Intel’s latest discrete and integrated GPUs and Intel’s oneAPI software. The primary path focuses on the use of Data Parallel C++ (DPC++), which is based on ISO C++ and incorporates standard SYCL* and community extensions to simplify data parallel programming. Advanced users can reference the resources and tools in each section to find additional options to help you customize to your needs.
Who is this for
Software developers interested in accelerating their code using Intel's latest integrated or external graphics cards and Data Parallel C++ (DPC++) for cross-code reuse. Start from your own C++ or CUDA* code, or use one of Intel’s many sample applications.
This guide assumes you have an Intel GPU and the Intel® oneAPI Base Toolkit software installed on your development environment. Alternatively, you can use the Intel® DevCloud virtual sandbox where you’ll have free access to Intel GPUs and the oneAPI software tools. To help you choose, check out the first step below, Choose GPU Hardware Access.
Not sure if you want to offload to a GPU? This article compares different hardware to help you decide which hardware best suits your needs.
|Whether you have the latest Intel architecture or need cloud resources, your optimization strategy starts with hardware. To make the best of your GPU offload, you need to understand what hardware resources you need to optimize for.|
Sign up now for full access to the latest Intel CPUs, GPUs, and Intel oneAPI software tools. The Intel DevCloud is a free, virtual development sandbox to learn about and program oneAPI cross-architecture applications. The environment enables you to optimize code on real Intel GPU hardware.
Sign up to get access. Then connect to the Intel DevCloud with the OS of your choice.
Take advantage of Intel’s first discrete GPU with Intel® Iris® Xe MAX Graphics to explore offload and optimization strategies on the latest Intel hardware. Learn more about Intel Iris Xe MAX.
Many Intel platforms include integrated GPUs that can be used for basic GPU offload proof of concepts and pathfinding. Find out what GPU architecture you have.
The Intel oneAPI Base Toolkit provides the core set of tools and libraries you need to develop high-performance applications across diverse architectures. Download the Intel oneAPI Base Toolkit for free.
Tip: You can use a feature in Intel Advisor called Offload Modeling to help you choose what hardware is best for your needs by estimating performance improvements and ROI across different hardware solutions.
Set Up Requirements
|Set Up Requirements||Intel DevCloud||Intel Graphics Processor|
|Software||Intel oneAPI Base Toolkit (pre-installed on Intel DevCloud)||Intel oneAPI Base Toolkit|
|GPU||GPUs Available: Intel® Iris® Xe MAX (nodes available on Intel DevCloud)||
|Language||Data Parallel C++ (DPC++)||Data Parallel C++ (DPC++)|
|Interface||Command Line Interface (CLI)||Command Line Interface (CLI)|
Step 2: Choose Sample Code
|You can start from an Intel sample, your existing CUDA* source code, or your own C++ application.|
The Intel oneAPI Base Toolkit includes a collection of samples that reflect a wide range of techniques to express parallelism on the GPU using DPC++. A good starting point for understanding GPU optimization by introducing DPC++ code is to use a compute-intensive sample that performs simulation of acoustic isotropic wave propagation in a 3D medium (download the ISO3DFD sample). Use this OpenMP* sample to step through various tools and resources available to you for your optimization strategy with DPC++. Or browse the oneAPI samples.
Download the sample from GitHub to Intel DevCloud or your development environment. The sample readme includes detailed instructions for building and running the sample.
Migrate Existing CUDA* Code
To migrate your existing CUDA code to a multi-platform program in DPC++. The Intel® DPC++ Compatibility Tool ports both CUDA language kernels and library API calls, migrating 80%-90% of CUDA code automatically to architecture and vendor portable DPC++ code. Inline comments help you finish writing and tuning your DPC++ code. Sample CUDA projects are also available to help you with the entire porting process.
- Migrate your CUDA code with the Intel DPC++ Compatibility Tool.
- Proceed to Step 3 to build your application and use Offload Modeling to evaluate your DPC++ code for further offload opportunities.
Optimize Your Own Projects
To use your own C++ code, set up your development environment and continue with this workflow. You will be able to apply these optimization techniques directly on your existing projects.
- Intel DevCloud: copy your C++ application to the Intel DevCloud and proceed to Step 3 to build your application and use Offload Modeling to evaluate your code for further offload opportunities.
- Your development environment: proceed to Step 3 to build your application and use Offload Modeling to evaluate your C++ code for further offload opportunities.
|The Intel® Advisor tool analyzes your code and helps you identify the best opportunities for GPU offload. The Offload Modeling feature provides performance speedup projections, estimates offload overhead, and pinpoints performance bottlenecks. Offload Modeling enables you to improve ROI by modeling different hardware solutions to maximize your performance.|
Intel Advisor measures the data movement in your functions, the memory access patterns, and the amount of computation to project how code will perform on Intel GPUs. The code regions with highest potential benefit should be your first targets for offload.
Follow the steps found in each of these links:
- Build your sample application using appropriate environment variables. (Required for DPC++, OpenMP*, and OpenCL™ applications)
- Run Offload Modeling Analysis.
- Review results.
Tip: Intel Advisor also offers a graphical user interface for creating projects and running analysis.
|Select the best optimization strategy to modify your code based on your application needs, advice from Intel Advisor, and available hardware. Documentation, samples, and training will help you make design decisions to maximize performance.|
oneAPI recommends a combination of 3 techniques to develop your parallelism strategy:
- Intel Optimized Libraries
Intel oneAPI Programming Guide: oneAPI Library Overview
- Intel Compilers and Optimizations
Intel oneAPI DPC++/C++ Compiler Developer Guide and Reference
- Parallel Programming Language or API
Intel oneAPI Programming Guide: DPC++
Basic DPC++ framework:
- Select device
- Declare device queue
- Declare buffers
- Submit job
Key phases in the optimization workflow:
- Understand CPU and GPU overlap.
- Keep all the compute resources busy.
- Minimize the synchronization between the host and the device.
- Minimize the data transfer between the host and the device.
- Keep the data in faster memory and use an appropriate access pattern.
Step 5: Evaluate Offload Efficiency with Intel® Advisor
|Once you've modified your application, return to Intel Advisor to help you measure the actual performance of offloaded code using the GPU Roofline Insights analysis. Advisor uses benchmarks and hardware metric profiling to measure GPU kernel performance. It points out limitations and identifies areas of your code where further optimization will have the most payoff.|
Evaluate GPU code to see how close the performance is to hardware maximums:
Step 6: Review Overall Application Performance with Intel® VTune™ Profiler
|After optimizing your GPU offload code, use Intel® VTune™ Profiler to optimize overall application performance on all devices. VTune Profiler offers helpful optimization guidance within the analysis results.|
Start with a quick baseline of application performance with the Performance Snapshot to identify focus areas for further analysis.
- Set up your system for GPU analysis.
- Launch the Intel VTune Profiler command line interface.
- Run the Performance Snapshot analysis.
- View the results.
Intel DevCloud Users: view the summary report.
Intel DevCloud Users with VTune Profiler installed locally: copy the results to your local system, create a project, and import into VTune Profiler.
Intel oneAPI Base Toolkit Users: view results in VTune Profiler.
Tip: Intel VTune Profiler also offers a graphical user interface for creating projects and running an analysis
Start optimizing for CPU by reviewing how much time was spent transferring operations between the host and device. Next, further optimize for GPU by identifying areas of inefficient GPU usage.
Product and Performance Information
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.