User Guide

Contents

GPU Offload Analysis (Preview)

Explore code execution on various CPU and GPU cores on your platform, correlate CPU and GPU activity, and identify whether your application is GPU or CPU bound.
This is a
PREVIEW FEATURE
. A preview feature may or may not appear in a future production release. It is available for your use in the hopes that you will provide feedback on its usefulness and help determine its future. Data collected with a preview feature is not guaranteed to be backward compatible with future releases. Please send your feedback to parallel.studio.support@intel.com or to intelsystemstudio@intel.com.
Run the GPU Offload analysis for applications that use a Graphics Processing Unit (GPU) for rendering, video processing, and computations with explicit support of DPC++, Intel® Media SDK and OpenCL™ software technology.
The tool infrastructure automatically aligns clocks across all cores in the entire system so that you can analyze some CPU-based workloads together with GPU-based workloads within a unified time domain.
This analysis enables you to:
  • Identify how effectively your application uses DPC++ or OpenCL kernels and explore them further with GPU Compute/Media Hotspots analysis
  • Analyze execution of Intel Media SDK tasks over time (for Linux targets only)
  • Explore GPU usage and analyze a software queue for GPU engines at each moment of time
For the GPU Offload analysis, the Intel® VTune™ Profiler instruments your code executing both on CPU and GPU, and, depending on your configuration settings, provides performance metrics that serve as insights on an efficiency of the GPU hardware usage and your next analysis steps.

Configure and Run Analysis

By default, the GPU Offload analysis enables the
Analyze GPU usage
option to explore GPU busyness over time and understand whether your application is CPU or GPU bound. As a result of such an analysis configuration, if you explore the Timeline view in the
Graphics
window and see that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, you can conclude that your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.
On Windows systems, to monitor general GPU usage over time, run the
VTune
Profiler
as an Administrator.
To configure and run the GPU Offload analysis:
Prerequisites:
  1. Click the (standalone GUI)/ (Visual Studio IDE)
    Configure Analysis
    toolbar button.
    The
    Configure Analysis
    window opens.
  2. From the
    HOW
    pane, click the
    Browse
    button and select
    Platform Analysis > GPU Offload
    .
    The GPU Offload analysis is pre-configured to collect GPU usage data and collect Processor Graphics hardware events (Compute Basic preset).
  3. Configure the following GPU analysis options:
    • Use the
      Trace GPU programming APIs
      option to analyze DPC++, OpenCL™ and Intel Media SDK programs running on Intel Processor Graphics. This option may affect the performance of your application on the CPU side.
    • Use the
      Analyze memory bandwidth
      option to collect the data required to compute memory bandwidth. This type of analysis requires Intel sampling drivers to be installed.
    • Use the
      Collect CPU-side stacks
      option to analyze call stacks executed on the CPU and identify critical paths.
    • Use the
      Show GPU performance insights
      to get metrics (based on the analysis of Processor Graphics events) that help you estimate the efficiency of hardware usage and learn next steps. The following Insights metrics are collected:
      • The
        EU Array
        metric shows the breakdown of GPU core array cycles, where:
        • Active
          : The normalized sum of all cycles on all cores spent actively executing instructions. Formula:
        • Stalled
          : The normalized sum of all cycles on all cores spent stalled. At least one thread is loaded, but the core is stalled for some reason. Formula:
        • Idle
          : The normalized sum of all cycles on all cores when no threads were scheduled on a core. Formula:
      • The
        EU Threads Occupancy
        metric shows the normalized sum of all cycles on all cores and thread slots when a slot has a thread scheduled.
      • The
        Computing Threads Started
        metric shows the number of threads started across all EUs for compute work.
  4. Click
    Start
    to run the analysis.
To run the GPU Offload analysis from the command line, enter:
$
vtune
-collect gpu-offload [-knob <
knob_name
=
knob_option
>] -- <
target
> [
target_options
]
To generate the command line for any analysis configuration, use the
Command Line...
button at the bottom of the interface.

Support for DPC++ Applications using oneAPI Level Zero API

This section describes support in the GPU Offload analysis for DPC++ applications that run OpenCL or oneAPI Level Zero API in the back end.
VTune
Profiler
supports version 0.91.10 of the oneAPI Level Zero API.
Support Aspect
DPC++ application with OpenCL as back end
DPC++ application with Level Zero as back end
Operating System
Linux OS
Windows OS
Linux OS only
Data collection
VTune
Profiler
collects and shows GPU computing tasks and the GPU computing queue.
VTune
Profiler
collects and shows GPU computing tasks and the GPU computing queue.
Data display
VTune
Profiler
maps the collected GPU HW metrics to specific kernels and displays them on a diagram.
VTune
Profiler
maps the collected GPU HW metrics to specific kernels and displays them on a diagram.
Display Host side API calls
Yes
Yes
Source Assembler for computing tasks
Can drill down to Level Zero computing tasks using Source Assembler.
Unavailable

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804