• 2020
  • 06/18/2020
  • Public Content

Profiling a DPC++ Application

This recipe illustrates how you can build and compile a DPC++ (Data Parallel C++) application for CPU and GPU utilization analysis with Intel® VTune™ Profiler.


Here are the minimum hardware and software requirements for this performance recipe.
  • Application
  • Compiler
    : To profile a DPC++ application, you need the Intel oneAPI DPC++ Compiler (
    ) that is available with Intel oneAPI toolkits (Beta).
  • Tools
    Intel® VTune™
    (Beta) - GPU Compute/Media Hotspots Analysis
    • For
      downloads and product support, visit .
    • All the Cookbook recipes are scalable and can be applied to Intel VTune Amplifier 2018 and higher. Slight version-specific configuration changes are possible.
    • Intel® VTune™ Amplifier has been renamed to Intel® VTune™ Profiler starting with its version for Intel® oneAPI Base Toolkit (Beta). You can still use a standalone version of the VTune Profiler, or its versions integrated into Intel Parallel Studio XE or Intel System Studio.
  • Microarchitecture
    • Intel Processor Graphics Gen9
    • Intel microarchitecture code name Kaby Lake or Coffee Lake
  • Operating system
    : Linux*. Run GPU target profiling on Linux kernel 4.14 or newer.
  • Graphical User Interface
    • GTK+ (2.10 or higher. ideally, use 2.18 or higher)
    • Pango (1.14 or higher)
    • X.Org (1.0 or higher, ideally use 1.7 or higher)

Build and Compile a DPC++ Application

When you compile your own DPC++ application, make sure to apply the
oneAPI DPC++ Compiler options. These settings generate debug information required for performance analysis.
To build the
sample, do the following:
  1. Go to the sample directory.
    cd <
  2. The
    file in the
    directory contains several DPC++ versions of matrix multiplication. Select a version by editing the corresponding
    #define MULTIPLY
    line in
  3. Compile your sample DPC++ application:
    cmake . make
    This generates a
    To delete the program, type:
    make clean
    This removes the executable and object files that were created by the

Run GPU Analysis on a DPC++ Application

: Prepare the system to run a GPU analysis. See
Configure and Run Analysis
in the VTune Profiler User Guide .
  1. Launch the VTune Profiler (Beta) and click
    New Project
    from the Welcome page.
    Create a Project
    dialog box opens.
  2. Specify a project name and a location for your project and click
    Create Project
    Configure Analysis
    window opens.
  3. Make sure the
    Local Host
    is selected in the
  4. In the
    pane, make sure the
    Launch Application
    target is selected and specify the
    binary as an
    to profile.
  5. In the
    pane, select
    GPU Compute/Media Hotspots
    from the
    Platform Analysis
    This is the least intrusive analysis for applications running on platforms with Intel Graphics as well as on other third-party GPUs supported by
    Intel® VTune™
  6. For the initial GPU analysis, make sure you have the following default options enabled:
    • Characterization
      mode with an
      metric preset selected;
    • Trace GPU Programming APIs
      check box selected.
    GPU Compute/Media Hotspot Analysis
  7. Click the
    button at the bottom to launch the analysis.
    To run the same configuration from the command line, enter:
    vtune -collect gpu-hotspots -- ./matrix.dpcpp

Analyze Collected Data

You can start your analysis with the
GPU Offload
viewpoint. In the
window, see statistics on CPU and GPU resource usage to determine if your application is GPU-bound. In this example, the GPU usage is substantial.
Switch to the
window to see basic CPU and GPU metrics that help analyze GPU usage on a software queue. This data is correlated with CPU usage on the timeline. Observe the information in the
window for these inferences:
GPU Bound Applications
CPU Bound Applications
The GPU is busy for a majority of the profiling time.
The CPU is busy for a majority of the profiling time.
There are small idle gaps between busy intervals.
There are large idle gaps between busy intervals.
The GPU software queue is rarely reduced to zero.
Most applications may not present obvious situations as described above. A detailed analysis is important to understand all dependencies. For example, GPU engines that are responsible for video processing and rendering are loaded in turns. In this case, they are used in a serial manner. When the application code runs on the CPU, this can cause an ineffective scheduling on the GPU. The behavior can mislead you to interpret the application to be GPU bound.
When the GPU is intensely busy over time, switch to the
GPU Compute/Media Hotspots
viewpoint and open the
window >
tab to examine the type of work (rendering or computations) running on the GPU per thread.
Use the README file in the sample to profile other implementations of

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804