• 2020
  • 10/23/2020
  • Public Content

Profiling a DPC++ Application running on a GPU

This recipe illustrates how you can build and compile a DPC++ (Data Parallel C++) application. The recipe also describes how to use
Intel® VTune™
to run a GPU analysis on the DPC++ application and examine results.


Here are the minimum hardware and software requirements for this performance analysis.
  • Application
    : . This sample application is available as part of the code sample package for Intel® oneAPI toolkits.
  • Compiler
    : To profile a DPC++ application, you need the Intel oneAPI DPC++ Compiler (
    ) that is available with Intel oneAPI toolkits (Beta).
  • Tools
    Intel® VTune™
    - GPU Offload and GPU Compute/Media Hotspots Analyses.
    downloads and product support, visit
  • Microarchitecture
    • Intel Processor Graphics Gen8, Gen 9, or Gen 11.
    • Intel microarchitectures code name Kaby Lake, Coffee Lake, or Ice Lake.
  • Operating system
    • Linux* OS, kernel version 4.14 or newer.
    • Windows* 10 OS.
  • Graphical User Interface for Linux
    • GTK+ (2.10 or higher. ideally, use 2.18 or higher)
    • Pango (1.14 or higher)
    • X.Org (1.0 or higher, ideally use 1.7 or higher)

Build and Compile a DPC++ Application

On Linux OS:
  1. Go to the sample directory.
    cd <sample_dir>/VtuneProfiler/matrix_multiply_vtune
  2. The
    file in the
    directory contains several DPC++ versions of matrix multiplication. Select a version by editing the corresponding
    #define MULTIPLY
    line in
  3. Compile your sample DPC++ application:
    cmake . make
    This generates a
    To delete the program, type:
    make clean
    This removes the executable and object files that were created by the
On Windows OS:
  1. Open the sample directory:
  2. In this directory, open a Visual Studio* project file named
  3. The
    file contains several DPC++ versions of matrix multiplication. Select a version by editing the corresponding
    #define MULTIPLY
    line in
  4. Build the entire project with a Release configuration.
    This generates an executable called

Run GPU Offload Analysis on a DPC++ Application

: Prepare the system to run a GPU analysis. See Set Up System for GPU Analysis.
  1. Launch
    and click
    New Project
    from the Welcome page.
    Create a Project
    dialog box opens.
  2. Specify a project name and a location for your project and click
    Create Project
    Configure Analysis
    window opens.
  3. Make sure the
    Local Host
    is selected in the
  4. In the
    pane, make sure the
    Launch Application
    target is selected and specify the
    binary as an
    to profile.
  5. In the
    pane, select
    GPU Offload
    analysis type from the
    This is the least intrusive analysis for applications running on platforms with Intel Graphics as well as on other third-party GPUs supported by
  6. Click the
    button to launch the analysis.
Run Analysis from Command Line:
To run the analysis form the command line:
  • On Linux OS:
    1. Set
      environment variables by exporting the script:
      export <
    2. Run the analysis command:
      vtune -collect -gpu-offload -- ./matrix.dpcpp
  • On Windows OS:
    1. Set
      environment variables by running the batch file:
      export <
    2. Run the analysis command:
      vtune.exe -collect gpu-offload -- matrix_multiply.exe

Analyze Collected Data

Start your analysis with the
GPU Offload
viewpoint. In the
window, see statistics on CPU and GPU resource usage to determine if your application is GPU-bound, CPU-bound, or not effectively utilizing the compute capabilities of the system. In this example, the application should use the GPU for intensive computation. However, the result summary informs that GPU usage is actually low.
Switch to the
window. Here, you can see basic CPU and GPU metrics that help analyze GPU usage on a software queue. This data is correlated with CPU usage on the timeline. The information in the
window can help you make some inferences.
GPU Bound Applications
CPU Bound Applications
The GPU is busy for a majority of the profiling time.
The CPU is busy for a majority of the profiling time.
There are small idle gaps between busy intervals.
There are large idle gaps between busy intervals.
The GPU software queue is rarely reduced to zero.
Most applications may not present obvious situations as described above. A detailed analysis is important to understand all dependencies. For example, GPU engines that are responsible for video processing and rendering are loaded in turns. In this case, they are used in a serial manner. When the application code runs on the CPU, this can cause an ineffective scheduling on the GPU. The behavior can mislead you to interpret the application to be GPU bound.
Identify the GPU execution phase based on the computing task reference and
GPU Utilization
metrics. Then, you can define the overhead for creating the task and placing it into a queue.
To investigate a computing task, switch to the
window to examine the type of work (rendering or computation) running on the GPU per thread. Select the
Computing Task
grouping and use the table to study the performance characterization of your task.
To further analyze your computing task, run the GPU Compute/Media Hotspots analysis type.
Use the README file in the sample to profile other implementations of

Run GPU Compute/Media Hotspots Analysis

: If you have not already done so, prepare the system to run a GPU analysis. See Set Up System for GPU Analysis.
To run the analysis:
  1. In the
    group, select the
    GPU Compute/Media Hotspots
    analysis type.
  2. Configure analysis options as described in the previous section.
  3. Click the
    button to run the analysis.
Run Analysis from Command Line
To run the analysis from the command line:
  • On Linux OS:
    vtune -collect gpu-hotspots -- ./matrix.dpcpp
  • On Windows OS:
    vtune.exe -collect gpu-hotspots -- matrix_multiply.exe

Analyze Your Compute Task

The default analysis configuration invokes the Characterization profile with the Overview metric set. In addition to individual compute task characterization that is available through the
GPU Offload
provides memory bandwidth metrics that are categorized by different levels of GPU memory hierarchy.
For a visual representation of the memory hierarchy, see the
Memory Hierarchy Diagram
. This diagram reflects the microarchitecture of the current GPU and shows memory bandwidth metrics. Use the diagram to understand the data traffic between memory units and execution units. You can also identify potential bottlenecks that cause EU stalls.
You can analyze compute tasks at source code level too. For example, to count GPU clock cycles spent on a particular task or due to memory latency, use the
Source Analysis
Discuss this recipe in the
developer forum

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804