GPU Offload Analysis
Explore code execution on various CPU and GPU cores on your platform, correlate CPU and GPU activity, and identify whether your application is GPU or CPU bound.
Run the GPU Offload analysis for applications that use a Graphics Processing Unit (GPU) for rendering, video processing, and computations with explicit support of DPC++, Intel® Media SDK and OpenCL™ software technology.
The tool infrastructure automatically aligns clocks across all cores in the entire system so that you can analyze some CPU-based workloads together with GPU-based workloads within a unified time domain.
This analysis enables you to:
- Identify how effectively your application uses DPC++ or OpenCL kernels and explore them further with GPU Compute/Media Hotspots analysis
- Analyze execution of Intel Media SDK tasks over time (for Linux targets only)
- Explore GPU usage and analyze a software queue for GPU engines at each moment of time
For the GPU Offload analysis, Intel® VTune™ Profiler instruments your code executing both on CPU and GPU. Depending on your configuration settings, VTune Profiler provides performance metrics that serve as insights on the efficiency of GPU hardware use as well as next steps in your analysis.

Configure and Run Analysis
By default, the GPU Offload analysis enables the
GPU Utilization
option to explore GPU busyness over time and understand whether your application is CPU or GPU bound. As a result of such an analysis configuration, if you explore the
Timeline view in the
Graphics
window and see that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, you can conclude that your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.
On Windows systems, to monitor general GPU usage over time, run
VTune
as an Administrator.
Profiler
To configure and run the GPU Offload analysis:
Prerequisites:
- For DPC++ applications: make sure to compile your code with the-gline-tables-onlyand-fdebug-info-for-profilingIntel oneAPI DPC++ Compiler options.
- Create a project and specify an analysis system and target.
- Click the
(standalone GUI)/
(Visual Studio IDE)
Configure Analysistoolbar button to open theConfigure Analysiswindow . - Click anywhere in the title bar of theHOWpane. Open the Analysis Tree and selectGPU Offload (Preview)analysis from theAcceleratorsgroup.The GPU Offload analysis is pre-configured to collect GPU usage data and collect Processor Graphics hardware events (Compute Basic preset).
- (Optional)If you are employing multiple Intel GPUs,VTuneidentifies all of these adapters in theProfilerTarget GPUpulldown menu. Use this menu to specify the device you want to profile. If you do not select a GPU,VTuneselects the most recent device family in the list by default.ProfilerTheTarget GPUpulldown menu displays only whenVTunedetects multiple GPUs running on the system. The menu then displays the name of each GPU with the bus/device/function (BDF) of its adapter. You can also find this information on your Windows (see Task Manager) or Linux (runProfilerlspci) system.
- Configure these GPU analysis options:
- Use theTrace GPU programming APIsoption to analyze DPC++, OpenCL™ and Intel Media SDK programs running on Intel Processor Graphics. This option may affect the performance of your application on the CPU side.
- Use theAnalyze memory bandwidthoption to collect the data required to compute memory bandwidth. This type of analysis requires Intel sampling drivers to be installed.
- Use theCollect CPU-side stacksoption to analyze call stacks executed on the CPU and identify critical paths.
- Use theShow GPU performance insightsto get metrics (based on the analysis of Processor Graphics events) that help you estimate the efficiency of hardware usage and learn next steps. The following Insights metrics are collected:
- TheEU Arraymetric shows the breakdown of GPU core array cycles, where:
- Active: The normalized sum of all cycles on all cores spent actively executing instructions. Formula:
- Stalled: The normalized sum of all cycles on all cores spent stalled. At least one thread is loaded, but the core is stalled for some reason. Formula:
- Idle: The normalized sum of all cycles on all cores when no threads were scheduled on a core. Formula:
- TheEU Threads Occupancymetric shows the normalized sum of all cycles on all cores and thread slots when a slot has a thread scheduled.
- TheComputing Threads Startedmetric shows the number of threads started across all EUs for compute work.
- ClickStartto run the analysis.
To run the GPU Offload analysis from the command line, enter:
To
generate the command line for any analysis configuration, use the
Command Line...
button at the bottom of the interface.
Once the GPU Offload Analysis completes data collection, the
You also see
Summary
window displays several metrics that describe:
- GPU usage
- GPU idle time
- The most active computing tasks that ran on the GPU

Recommendations
and guidance for next steps.
Analyze Data Transfer Between Host and Device
To understand the efficiency of data transfer between the CPU host and GPU device, see metrics in the Summary and Graphics windows.
The Summary window displays the total time spent on computing tasks as well as the execution time per task. The difference indicates the amount of time spent on data transfers between host and device. If the execution time is lower than the data transfer time, this indicates that your offload schema could benefit from optimization.
In the Summary window, look for offload cost metrics including
Host-to-Device Transfer
and
Device-to-Host Transfer
. These metrics can help you locate unnecessary memory transfers that reduce performance.
In the Graphics window, see the
Total Time by GPU Data Transfer Type
column, which displays the total time for each computation task.

The total time is broken down into:
- Time for data transfer from host to device(in)
- Time for computation(compute)
- Time for data transfer from device to host(out)
This breakdown can help you understand better the balance between data transfer and GPU execution time.
The Graphics window also displays the size of the data transfer between host and device per computation task.
Support for DPC++ Applications using oneAPI Level Zero API
This section describes support in the GPU Offload analysis for DPC++ applications that run OpenCL or
oneAPI Level Zero API in the back end.
VTune
supports version 1.0.4 of the
oneAPI Level Zero API.
Profiler
Support Aspect
| DPC++ application with OpenCL as back end
| DPC++ application with Level Zero as back end
|
---|---|---|
Operating System
| Linux OS
Windows OS
| Linux OS
Windows OS
|
Data collection
| VTune
collects and shows GPU computing tasks and the GPU computing queue.
Profiler | VTune
collects and shows GPU computing tasks and the GPU computing queue.
Profiler |
Data display
| VTune
maps the collected GPU HW metrics to specific kernels and displays them on a diagram.
Profiler | VTune
maps the collected GPU HW metrics to specific kernels and displays them on a diagram.
Profiler |
Display Host side API calls
| Yes
| Yes
|
Source Assembler for computing tasks
| Can drill down to Level Zero computing tasks using Source Assembler.
| Unavailable
|