Programming Guide

Contents

oneAPI Debug Tools

The following tools are available to help with debugging the DPC++ and OpenMP* offload process.
Tool
When to Use
Environment variables
Environment variables allow you to gather diagnostic information from the OpenMP and DPC++ runtimes at program execution with no modifications to your program.
ze_tracer tool from Profiling Tools Interfaces for GPU (PTI for GPU)
When using the oneAPI Level Zero backend for DPC++ and OpenMP Offload, this tool can be used to debug backend errors and for performance profiling on both the host and device.
Learn more:
Intercept Layer for OpenCL™ Applications
When using the OpenCL backend for DPC++ and OpenMP Offload, this library can be used to debug backend errors and for performance profiling on both the host and device.
Intel® Distribution for GDB*
Used for source-level debugging of the application, typically to inspect logical bugs, on the host and any devices you are using (CPU, GPU, FPGA emulation).
Intel® Inspector
This tool helps to locate and debug memory and threading problems, including those that can cause offloading to fail.
Intel Inspector is included in the Intel oneAPI HPC Toolkit or the Intel oneAPI IoT Toolkit.
In-application debugging
In addition to these tools and runtime based approaches, the developer can locate problems using other approaches. For example:
  • Comparing kernel output to expected output
  • Sending intermediate results back by variables they create for debugging purposes
  • Printing results from within kernels
    Both DPC++ and OpenMP allow printing to
    stdout
    from within an offload region - be sure to note which SIMD lane or thread is providing the output.

Debug Environment Variables

Both the OpenMP* and DPC++ offload runtimes provide environment variables that let you understand the communication between the host and offload device. The variables also allow you to discover or control the runtime chosen for offload computations.
OpenMP* Offload Environment Variables
There are several environment variables that you can use to understand how OpenMP Offload works and control which backend it uses.
OpenMP is not supported for FPGA devices.
Environment Variable
Description
LIBOMPTARGET_DEBUG
This environment variable enables debug output from the OpenMP Offload runtime. It reports:
  • The available runtimes detected and used (1,2)
  • When the chosen runtime is started and stopped (1,2)
  • Details on the offload device used (1,2)
  • Support libraries loaded (1,2)
  • Size and address of all memory allocations and deallocations (1,2)
  • Information on every data copy to and from the device, or device mapping in the case of unified shared memory (1,2)
  • When each kernel is launched and details on the launch (arguments, SIMD width, group information, etc.) (1,2)
  • Which Level Zero/OpenCL API functions are invoked (function name, arguments/parameters) (2)
Values: (0, 1, 2)
Default: 0
LIBOMPTARGET_PLUGIN_PROFILE
This variable enables the display of performance data for offloaded OpenMP code. It displays:
  • Total data transfer times (read and write)
  • Data allocation times
  • Module build times (just-in-time compile)
  • The execution time of each kernel.
Values:
  • F
    - disabled
  • T
    - enabled with timings in milliseconds
  • T,usec
    - enabled with timings in microseconds
Default:
F
Example:
export LIBOMPTARGET_PLUGIN_PROFILE=T,usec
LIBOMPTARGET_PLUGIN
This environment variable allows you to choose the backend used for OpenMP offload execution.
The Level Zero backend is only supported for GPU devices.
Values:
  • LEVEL0
    - uses the Level Zero backend
  • OPENCL
    - uses the OpenCL™ backend
Default:
  • For GPU offload devices:
    LEVEL0
  • For CPU or FPGA offload devices:
    OPENCL
SYCL* and DPC++ Environment Variables
DPC++ programs support all standard SYCL environment variables. The full list is available from GitHub. Of interest for debugging are the following SYCL environment variables, plus an additional Level Zero environment variable.
Environment Variable
Description
SYCL_DEVICE_FILTER
This complex environment variable allows you to limit the runtimes, compute device types, and compute device IDs used by the DPC++ runtime to a subset of all available combinations.
The compute device IDs correspond to those returned by the SYCL API,
clinfo
, or
sycl-ls
(with the numbering starting at 0) and have no relation to whether the device with that ID is of a certain type or supports a specific runtime. Using a programmatic special selector (like
gpu_selector
) to request a device filtered out by
SYCL_DEVICE_FILTER
will cause an exception to be thrown.
Refer to the Environment Variables descriptions in GitHub for additional details: https://github.com/intel/llvm/blob/sycl/sycl/doc/EnvironmentVariables.md
Example values include:
  • opencl:cpu
    - use only the OpenCL™ runtime on all available CPU devices
  • opencl:gpu
    - use only the OpenCL runtime on all available GPU devices
  • opencl:gpu:2
    - use only the OpenCL runtime on only the third device, which also has to be a GPU
  • level_zero:gpu:1
    - use only the Level Zero runtime on only the second device, which also has to be a GPU
  • opencl:cpu,level_zero
    - use only the OpenCL runtime on the CPU device, or the Level Zero runtime on any supported compute device
Default: use all available runtimes and devices
SYCL_PI_TRACE
This environment variable enables debug output from the DPC++ runtime.
Values:
  • 1
    - report SYCL plugins and devices discovered and used
  • 2
    - report SYCL API calls made, including arguments and result values
  • -1
    - provides all available tracing
Default: disabled
ZE_DEBUG
This environment variable enables debug output from the Level Zero backend when used with the DPC++ runtime. It reports:
  • Level Zero APIs called
  • Level Zero event information
Value: variable defined with any value - enabled
Default: disabled
Environment Variables that Produce Diagnostic Information for Support
The Level Zero backend provides a few environment variables that can be used to control behavior and aid in diagnosis.
An additional source of debug information comes from the Intel® Graphics Compiler, which is called by the Level Zero or OpenCL backends (used by both the OpenMP Offload and DPC++ Runtimes) at runtime or during Ahead-of-Time (AOT) compilation. Intel Graphics Compiler creates the appropriate executable code for the target offload device. The full list of these environment variables can be found at https://github.com/intel/intel-graphics-compiler/blob/master/documentation/configuration_flags.md. The two that are most often needed to debug performance issues are:
  • IGC_ShaderDumpEnable=1 (default=0)
    causes all LLVM, assembly, and ISA code generated by the Intel® Graphics Compiler to be written to
    /tmp/IntelIGC/<application_name>
  • IGC_DumpToCurrentDir=1 (default=0)
    writes all the files created by
    IGC_ShaderDumpEnable
    to your current directory instead of
    /tmp/IntelIGC/<application_name>
    . Since this is potentially a lot of files, it is recommended to create a temporary directory just for the purpose of holding these files.
If you have a performance issue with your OpenMP offload or DPC++ offload application that arises between different versions of Intel® oneAPI, when using different compiler options, when using the debugger, and so on, then you may be asked to enable
IGC_ShaderDumpEnable
and provide the resulting files. For more information on compatibility, see oneAPI Library Compatibility.

Offload Intercept Tools

In addition to debuggers and diagnostics built into the offload software itself, it can be quite useful to monitor the API calls and data sent through the offload pipeline. For OpenCL™, you can add a library to
LD_LIBRARY_PATH
that will intercept and report on all OpenCL calls, and then use environment variables to control what diagnostic information to report to a file. For Level Zero, your application is run as an argument to the ze_tracer tool, which intercepts and reports on various aspects of Level Zero API calls based on how ze_tracer is called.
Intercept Layer for OpenCL™ Applications
This library collects debugging and performance data when OpenCL is used as the backend to your DPC++ or OpenMP offload program. When OpenCL is used as the backend to your DPC++ or OpenMP offload program, this tool can help you detect buffer overwrites, memory leaks, mismatched pointers, and can provide more detailed information about runtime error messages (allowing you to diagnose these issues when either CPU, FPGA, or GPU devices are used for computation). Note that you will get nothing useful if you use ze_tracer on a program that uses the OpenCL backend, or the Intercept Layer for OpenCL Applications library on a program that uses the Level Zero backend.
Additional resources:
ze_tracer
Like the Intercept Layer for OpenCL™ Applications, this tool collects debugging and performance data when then Level Zero is the backend for DPC++ or OpenMP*, rather than OpenCL. Note that Level Zero can only be used as the backend for computations that happen on the GPU (there is no Level Zero backend for the CPU or FPGA at this time). The ze_tracer tool is part of the Profiling Tools Interfaces for GPU (PTI for GPU) project, found at https://github.com/intel/pti-gpu. Note that you will get nothing useful if you use ze_tracer on a program that uses the OpenCL backend, or the Intercept Layer for OpenCL™ Applications library on a program that uses the Level Zero backend.
The ze_tracer tool is distributed as source. Instructions for how to build the tool are available from https://github.com/intel/pti-gpu/tree/master/tools/ze_tracer. The tool provides the following features:
  • Call logging: This mode allows you to trace all standard Level Zero (L0) API calls along with their arguments and return values annotated with time stamps. Among other things, this can give you supplemental information on any failures that occur when a host program tries to make use of an attached compute device.
  • Host and device timing: These provide the duration of all API calls, the duration of each kernel, and application runtime for the entire application.
  • Device Timeline mode: Gives time stamps for each device activity. All the time stamps are in the same (CPU) time scale.
  • Browser visualization: It is possible to dump results of Call Tracing and Device Timeline modes into a trace (JSON) file to visualize it in browser.
These data can help debug offload failures or performance issues.
Additional resources:

Intel® Distribution for GDB*

The Intel Distribution for GDB* is an application debugger that allows you to inspect and modify the program state. With the debugger, both the host part of your application and kernels that are offloaded to a device can be debugged seamlessly in the same debug session. The debugger supports the CPU, GPU, and FPGA-emulation devices. Major features of the tool include:
  • Automatically attaching to the GPU device to listen to debug events
  • Automatically detecting JIT-compiled, or dynamically loaded, kernel code for debugging
  • Defining breakpoints (both inside and outside of a kernel) to halt the execution of the program
  • Listing the threads; switching the current thread context
  • Listing active SIMD lanes; switching the current SIMD lane context per thread
  • Evaluating and printing the values of expressions in multiple thread and SIMD lane contexts
  • Inspecting and changing register values
  • Disassembling the machine instructions
  • Displaying and navigating the function call-stack
  • Source- and instruction-level stepping
  • Non-stop and all-stop debug mode
  • Recording the execution using Intel Processor Trace (CPU only)
For more information and links to full documentation for Intel Distribution for GDB, see
Get Started with Intel Distribution for GDB on Linux* Host | Windows* Host
.

Intel® Inspector for Offload

Intel® Inspector is a dynamic memory and threading error checking tool for users developing serial and multithreaded applications. It can be used to verify correctness of the native part of the application as well as dynamically generated offload code.
Unlike the tools and techniques above, Intel Inspector cannot be used to catch errors in offload code that is communicating with a GPU or an FPGA. Instead, Intel Inspector requires that the DPC++ or OpenMP runtime needs to be configured to execute kernels on CPU target. In general, it requires definition of the following environment variables prior to an analysis run.
  • To configure a DPC++ application to run kernels on a CPU device
    export SYCL_DEVICE_FILTER=opencl:cpu
  • To configure an OpenMP application to run kernels on a CPU device
    export OMP_TARGET_OFFLOAD=MANDATORY export LIBOMPTARGET_DEVICETYPE=cpu
  • To enable code analysis and tracing in JIT compilers or runtimes
    export CL_CONFIG_USE_VTUNE=True export CL_CONFIG_USE_VECTORIZER=false
Use one of the following commands to start analysis from the command line. You can also start from the Intel Inspector graphical user interface.
  • Memory:
    inspxe-cl -c mi3 -- <app> [app_args]
  • Threading:
    inspxe-cl -c ti3 -- <app> [app_args]
View the analysis result using the following command:
inspxe-cl -report=problems -report-all
If your DPC++ or OpenMP Offload program passes bad pointers to the OpenCL™ backend, or passes the wrong pointer to the backend from the wrong thread, Intel Inspector should flag the issue. This may make the problem easier to find than trying to locate it using the intercept layers or the debugger.
Additional details are available from the
Intel Inspector User Guide for Linux* OS | Windows* OS
.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.