• 2020
  • 04/16/2021
  • Public Content

Using the Command-Line Interface to Analyze the Performance of a DPC++ Application running on a GPU (NEW)

This recipe illustrates how you use the command-line interface (CLI) in
Intel® VTune™
to analyze the performance of a Data Parallel C++ (DPC++) application offloaded on an Intel GPU. The recipe also describes how you can customize your report with collected data.
Content expert
: Egor Suldin, Mariya Petrova
Intel® VTune™
provides a command line interface (the
tool) for remote analysis, scripted commands, and performance regression checks to monitor software performance over time. The
command line interface (CLI) provides an extensive set of options with which you can perform almost every task that is possible through the GUI. You can initiate analysis via the command line (running it as a background task or on a remote system) and then view the result or generate a report.
This recipe explores how you can use the CLI efficiently to generate reports on hotspots for these purposes:
  • Explore hotspots on the CPU/GPU side by running
  • View the hottest GPU computing tasks annotated with:
    • Execution time
    • Data transfers
    • Working group sizes
    • SIMD width
    • Average GPU hardware metrics
  • Generate Source/Assembly code views to analyze instructions that possibly contributed to performance issues.
Here are the ingredients and instructions you need to explore efficient CLI use for GPU performance analysis.


Here are the minimum hardware and software requirements for this performance analysis.

Build and Compile the DPC++ Application

  1. Go to the sample directory.
    cd <sample_dir>/VtuneProfiler/matrix_multiply_vtune
  2. The
    file in the
    directory contains several DPC++ versions of matrix multiplication. Select a version by editing the corresponding
    #define MULTIPLY
    line in
  3. Compile your sample DPC++ application:
    cmake . && make
    This command generates a
    To delete the program, type:
    make clean
    This command removes the executable and object files that were created by the

Ensure Prerequisites for GPU Analyses

Complete these steps before you run the
GPU Offload Analysis
or the
GPU Compute/Media Hotspots Analysis
  1. Prepare the system to run a GPU analysis. See Set Up System for GPU Analysis.
  2. Set up environment variables for Intel software tools:
    source $ONEAPI_ROOT/

Run GPU Offload Analysis on the DPC++ Application

Use the
GPU Offload Analysis
as a starting point to identify if an application is CPU or GPU bound. Explore GPU offload efficiency through data transfer analysis and find performance-critical kernels for further analysis and optimization.
Run GPU Offload Analysis
In the CLI, type:
vtune -collect gpu-offload -r ./result_gpu-offload -- ./matrix.dpcpp
By default,
generates a summary report after collecting data. This report includes information on the following fields:
  • Elapsed time
  • GPU utilization information
  • Information about the hottest computing tasks
  • Recommendations
To see the summary report, type:
vtune -report summary -r ./result_gpu-offload
If you do not need to see the summary report immediately after data collection, change this setting with the
vtune -collect gpu-offload -finalization-mode=none -r ./result_gpu-offload -- ./matrix.dpcpp
Summary Report in GPU Offload CLI Analysis
Generate Additional Reports to View Collected Data
  • CPU Hotspots Report
    This report displays a list of executed functions with CPU Time metrics, module names, source file paths and other parameters. The report also lists the hottest program units, starting with the most performance-critical unit. Use the
    , and
    options to sort data into a tabular view:
    vtune -report hotspots -r ./result_gpu-offload
    Hotspots Report
  • CPU Hotspots Report Filtered by Module and Grouped by Function
    Use the option to focus on a specific part of report like a particular module. You can then use option to group results in a specific sequence.
    vtune -report hotspots -r ./result_gpu-offload -group-by=function -filter module=matrix.dpcpp -q
    You can group the generated data in several ways like function name, module, source file path, or computing task.
    To see available groupings for a specific result, type:
    vtune -report hotspots -r ./result_gpu-offload -group-by=?
  • CPU Hotspots Report Sorted by Order
    Use the and options to sort specific information about hotspots in descending or ascending order. You can specify an order for up to three columns.
    vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time:Execution" -q
    Here is another example:
    vtune -report hotspots -r result_gpu-offload -group-by module -sort-asc="CPU Time:Idle" -q
    To see available columns for a specific result, type:
    vtune -report hotspots -r ./result_gpu-offload -column=?
    The report data can contain such columns as
    CPU Time:Self
    , and
    Source File
  • Report of Top 'n' Time-Intensive Program Modules
    Use the limit option to see information about the top 'n' hotspots. For example, to understand details about the top five time-intensive program modules in your application, type:
    vtune -report hotspots -r result_gpu-offload -group-by module -sort-desc="CPU Time" -limit=5 -q
  • Hotspots Report Grouped by Computing Task (offloaded on GPU) with Transfer Columns
    This command displays hotspots information grouped by GPU computing task and also lists details about transfer sizes and transfer times between CPU and GPU:
    vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task -column=Transfer -q
    The report contains data transfers that are attributed to the respective computing task.
  • Hotspots Report Grouped by GPU Offload Computing Task and Time Columns
    This command displays hotspots information grouped by offload computing tasks and also lists details about transfer times between CPU and GPU:
    vtune -report hotspots -r ./result_gpu-offload -group-by=computing-task-offload -column='Time' -q

Run GPU Compute/Media Hotspots Analysis

Our next step is to run the
GPU Compute/Media Hotspots
analysis. This analysis can help us to further explore performance improvements for the GPU-bound application or its stages.
Type this command to run the analysis:
vtune -collect gpu-hotspots -r ./result_gpu-hotspots -- ./matrix.dpcpp
To see the summary report, type:
vtune -report summary -r ./result_gpu-hotspots
  • Computing Tasks with L3 Metrics
    Use this command to generate a report that lists only L3 metrics for computing tasks:
    vtune -report hotspots -r result_gpu-hotspots -group-by=computing-task -column='L3' -q
  • Analysis with Dynamic Instruction Count and SIMD Utilization
    Run the GPU Compute/Media Hotspots Analysis in the Characterization mode to collect data on dynamic instruction count and SIMD utilization:
    vtune -collect gpu-hotspots -knob characterization-mode=instruction-count -r ./result_gpu-hotspots_inst-count -- ./matrix.dpcpp
  • Source Code for Specific Computing Tasks
    Use this command to get the source code for a specific computing task:
    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=gpu-source-line -column="Source","GPU Instructions Executed:Int32 & SP Float" -q
  • Assembly Code for Specific Computing Tasks
    Use this command to get the assembly code for a specific computing task:
    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -q
  • Save Report as CSV File
    Use the option to save the generated report as a file. To specify the generation of a
    report, use and options:
    vtune -report hotspots -r result_gpu-hotspots_inst-count -source-object computing-task="Matrix1_1<float>" -group-by=address -limit=5 -report-output=result.csv -format=csv -csv-delimiter=comma -q
Discuss this recipe in the
developer forum

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at