Programming Guide

Contents

Debug the Offload Process

Run with Different Runtimes or Compute Devices

When an offload program fails to run correctly or produces incorrect results, a relatively quick sanity check is to run the application on a different runtime (OpenCL™ vs. Level Zero) or compute device (CPU vs. GPU) using
LIBOMPTARGET_PLUGIN
and
OMP_TARGET_OFFLOAD
for OpenMP* applications, and
SYCL_DEVICE_FILTER
for DPC++ applications. Errors that reproduce across runtimes mostly eliminate the runtime as being a problem. Errors that reproduce on all available devices mostly eliminates bad hardware as the problem.

Debug CPU Execution

Offload code has two options for CPU execution: either the "host" implementation, or the CPU version of OpenCL. The "host" implementation is a truly native implementation of the offloaded code, meaning it can be debugged like any of the non-offloaded code. The CPU version of OpenCL, while it goes through the OpenCL runtime and code generation process, eventually ends up as normal parallel code running under a TBB runtime. Again, this provides a familiar debugging environment with familiar assembly and parallelism mechanisms. Pointers have meaning through the entire stack, and data can be directly inspected. There are also no memory limits beyond the usual limits for any operating system process.
Finding and fixing errors in CPU offload execution often solves errors seen in GPU offload execution with much less pain, and without requiring use of a system with an attached GPU or other accelerator.
For OpenMP applications, to get the "host" implementation, remove the "target" or "device" constructs, replacing them with normal host OpenMP code. If
LIBOMPTARGET_PLUGIN=OPENCL
and offload to the GPU is disabled, then the offloaded code runs under the OpenMP runtime with TBB providing parallelism.
For DPC++ applications, with
SYCL_DEVICE_FILTER=host
the "host" device is actually single-threaded, which may help you determine if threading issues, such as data races and deadlocks, are the source of execution errors. Setting
SYCL_DEVICE_FILTER=opencl:cpu
uses the CPU OpenCL runtime, which also uses TBB for parallelism.

Debug GPU Execution Using Intel® Distribution for GDB*

Intel® Distribution for GDB* is extensively documented in
Get Started with Intel Distribution for GDB on Linux* Host | Windows* Host
. However, since debugging applications with GDB* on a GPU differs slightly from the process on the host (some commands are used differently and you might see some unfamiliar output), some of those differences are summarized here.
The Debugging with Intel Distribution for GDB on Linux OS Host Tutorial shows a sample debug session where we start a debug session of a DPC++ program, define a breakpoint inside the kernel, run the program to offload to the GPU, print the value of a local variable, switch to the SIMD lane 5 of the current thread, and print the variable again.
As in normal GDB*, for a command
<CMD>
, use the
help <CMD>
command of GDB to read the information text for
<CMD>
. For example:
(gdb) help info threads Display currently known threads. Usage: info threads [OPTION]... [ID]... If ID is given, it is a space-separated list of IDs of threads to display. Otherwise, all threads are displayed. Options: -gid Show global thread IDs.
Inferiors, Threads, and SIMD Lanes Referencing in GDB*
The threads of the application can be listed using the debugger. The printed information includes the thread ids and the locations that the threads are currently stopped at. For the GPU threads, the debugger also prints the active SIMD lanes.
In the example referenced above, you may see some unfamiliar formatting used when threads are displayed via the GDB "info threads" command:
Id Target Id Frame 1.1 Thread <id omitted> <frame omitted> 1.2 Thread <id omitted> <frame omitted> 2.1 Thread 1610612736 (inactive) * 2.2:1 Thread 1073741824 <frame> at array-transform.cpp:61 2.2:[3 5 7] Thread 1073741824 <frame> at array-transform.cpp:61 2.3:[1 3 5 7] Thread 1073741888 <frame> at array-transform.cpp:61 2.4:[1 3 5 7] Thread 1073742080 <frame> at array-transform.cpp:61
Here, GDB is displaying the threads with the following format:
<inferior_number>.<thread_number>:<SIMD Lane/s>
So, for example, the thread id "
2.3:[1 3 5 7]
" refers to SIMD lanes 1, 3, 5, and 7 of thread 3 running on inferior 2.
An "inferior" in the GDB terminology is the process that is being debugged. In the debug session of a program that offloads to the GPU, there will typically be two inferiors; one "native" inferior representing the host part of the program (inferior 1 above), and another "remote" inferior representing the GPU device (inferior 2 above). Intel Distribution for GDB automatically creates the GPU inferior - no extra steps are required.
When you print the value of an expression, the expression is evaluated in the context of the current thread's current SIMD lane. You can switch the thread as well as the SIMD lane to change the context using the "thread" command such as "
thread 3:4
", "
thread :6
", or "
thread 7
". The first command makes a switch to the thread 3 and SIMD lane 4. The second command switches to SIMD lane 6 within the current thread. The third command switches to thread 7. The default lane selected will either be the previously selected lane, if it is active, or the first active lane within the thread.
The "thread apply command" may be similarly broad or focused (which can make it easier to limit the output from, for example, a command to inspect a variable). For more details and examples about debugging with SIMD lanes, see the Debugging with Intel Distribution for GDB on Linux OS Host Tutorial.
Controlling the Scheduler
By default, when a thread hits a breakpoint, the debugger stops all the threads before displaying the breakpoint hit event to the user. This is the all-stop mode of GDB. In the non-stop mode, the stop event of a thread is displayed while the other threads run freely.
In all-stop mode, when a thread is resumed (for example, to resume normally with the
continue
command, or for stepping with the
next
command), all the other threads are also resumed. If you have some breakpoints set in threaded applications, this can quickly get confusing, as the next thread that hits the breakpoint may not be the thread you are following.
You can control this behavior using the
set scheduler-locking
command to prevent resuming other threads when the current thread is resumed. This is useful to avoid intervention of other threads while only the current thread executes instructions. Type
help set scheduler-locking
for the available options, and see https://sourceware.org/gdb/current/onlinedocs/gdb/Thread-Stops.html for more information. Note that SIMD lanes cannot be resumed individually; they are resumed together with their underlying thread.
In non-stop mode, by default, only the current thread is resumed. To resume all threads, pass the "
-a
" flag to the
continue
command.
Dumping Information on One or More Threads/Lanes (Thread Apply)
Commands for inspecting the program state are typically executed in the context of the current thread's current SIMD lane. Sometimes it is desired to inspect a value in multiple contexts. For such needs, the
thread apply
command can be used. For instance, the following executes the
print element
command for the SIMD lanes 3-5 of Thread 2.5:
(gdb) thread apply 2.5:3-5 print element
Similarly, the following runs the same command in the context of SIMD lane 3, 5, and 6 of the current thread:
(gdb) thread apply :3 :5 :6 print element
Stepping GPU Code After a Breakpoint
To stop inside the kernel that is offloaded to the GPU, simply define a breakpoint at a source line inside the kernel. When a GPU thread hits that source line, the debugger stops the execution and shows the breakpoint hit. To single-step a thread over a source-line, use the
step
or
next
commands. The
step
commands steps into functions while
next
steps over calls. Before stepping, we recommend to
set scheduler-locking step
to prevent intervention of other threads.
Building a DPC++ Executable for Use with Intel® Distribution for GDB*
Much like when you want to debug a host application, you need to set some additional flags to create a binary that can be debugged on the GPU. See Get Started with Intel Distribution for GDB on Linux* Host for details.
For a smooth debug experience when using the just-in-time (JIT) compilation flow, enable debug information emission from the compiler via the
-g
flag, and disable optimizations via the
-O0
flag for both the host and JIT-compiled kernel of the application. The flags for the kernel are taken during link time. For example:
  • Compile your program using:
    dpcpp -g -O0 -c myprogram.cpp
  • Link your program using:
    dpcpp -g -O0 myprogram.o
If you are using CMake to configure the build of your program, use the
Debug
type for the
CMAKE_BUILD_TYPE
, and append
-O0
to the
CMAKE_CXX_FLAGS_DEBUG
variable. For example:
set (CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} -O0")
Applications that are built for debugging may take a little longer to start up than when built with the usual "release" level of optimization. Thus, your program may appear to run a little more slowly when started in the debugger. If this causes problems, developers of larger applications may want to use ahead-of-time (AOT) compilation to JIT the offload code when their program is built, rather than when it is run (warning, this may also take longer to build when using
-g -O0
). For more information, see Compilation Flow Overview.
When doing ahead-of-time compilation for GPU, you must use a device type that fits your target device. Run the following command to see the available GPU device options on your current machine:
ocloc compile --help
Additionally, the debug mode for the kernel must be enabled. The following example AoT compilation command targets the KBL device:
dpcpp -g -O0 -fsycl-targets=spir64_gen-unknown-unknown-sycldevice \ -Xs "-device kbl -internal_options -cl-kernel-debug-enable -options -cl-opt-disable" myprogram.cpp
Building an OpenMP* Executable for use with Intel® Distribution for GDB*
Compile and link your program using the
-g -O0
flags. For example:
icpx -fiopenmp -O0 -fopenmp-targets=spir64 -c -g myprogram.cpp icpx -fiopenmp -O0 -fopenmp-targets=spir64 -g myprogram.o
Set the following environment variables to disable optimizations and enable debug info for the kernel:
export LIBOMPTARGET_OPENCL_COMPILATION_OPTIONS="-g -cl-opt-disable" export LIBOMPTARGET_LEVEL0_COMPILATION_OPTIONS="-g -cl-opt-disable"
Ahead-of-time (AOT) compilation is currently not supported for OpenMP.

Debugging GPU Execution Using Offload Intercept Tools

A common issue with offload programs is that they may to fail to run at all, instead giving a generic OpenCL™ error with little additional information. Both the Intercept Layer for OpenCL Applications and ze_tracer can be used to get more information about these errors, often helping the developer identify the source of the problem.
Intercept Layer for OpenCL Applications
Using this library, in particular the
Buildlogging
,
ErrorLogging
, and
USMChecking=1
options, you can often find the source of the error.
  1. Create a
    clintercept.conf
    file in the home directory with the following content:
    SimpleDumpProgramSource=1 CallLogging=1 LogToFile=1 //KernelNameHashTracking=1 BuildLogging=1 ErrorLogging=1 USMChecking=1 //ContextCallbackLogging=1 // Profiling knobs KernelInfoLogging=1 DevicePerformanceTiming=1 DevicePerformanceTimeLWSTracking=1 DevicePerformanceTimeGWSTracking=1
  2. Run the application with cliloader as follows:
    <OCL_Intercept_Install_Dir>/bin/cliloader/cliloader -d ./<app_name> <app_args>
  3. Review the following results in the
    ~CLIntercept_Dump/<app_name>
    directory:
    • clintercept_report.txt: Profiling results
    • clintercept_log.txt: Log of OpenCL™ calls used to debug OpenCL issues
The following snippet is from an example log file generated by a program that returned the runtime error:
CL_INVALID_ARG_VALUE (-50)
... <<<< clSetKernelArgMemPointerINTEL -> CL_SUCCESS >>>> clGetKernelInfo( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225->13 ): param_name = CL_KERNEL_CONTEXT (1193) <<<< clGetKernelInfo -> CL_SUCCESS >>>> clSetKernelArgMemPointerINTEL( _ZTSZZ10outer_coreiP5mesh_i16dpct_type_1c0e3516dpct_type_60257cS2_S2_S2_S2_S2_S2_S2_S2_fS2_S2_S2_S2_iENKUlRN2cl4sycl7handlerEE197->45clES6_EUlNS4_7nd_itemILi3EEEE225->13 ): kernel = 0xa2d51a0, index = 3, value = 0x41995e0 mem pointer 0x41995e0 is an UNKNOWN pointer and no device support shared system pointers! ERROR! clSetKernelArgMemPointerINTEL returned CL_INVALID_ARG_VALUE (-50) <<<< clSetKernelArgMemPointerINTEL -> CL_INVALID_ARG_VALUE
In this example, the following values help with debugging the error:
  • ZTSZZ10outer_coreiP5mesh
  • index = 3, value = 0x41995e0
Using this data, you can identify which kernel had the problems, what argument was problematic, and why.
ze_tracer
Similar to Intercept Layer for OpenCL Applications, the ze_tracer tool can help find the source of errors for a Level Zero runtime.
To use the ze_tracer tool:
  1. Use Call Logging mode to run the application. Redirecting the tool output to a file is optional, but recommended.
    ./ze_tracer –c ./<app_name> <app_args> [2> log.txt]
  2. Review the call trace to figure out the error (
    log.txt
    ). For example:
    >>>> [102032049] zeKernelCreate: hModule = 0x55a68c762690 desc = 0x7fff865b5570 {29 0 0 GEMM} phKernel = 0x7fff865b5438 (hKernel = 0) <<<< [102060428] zeKernelCreate [28379 ns] hKernel = 0x55a68c790280 -> ZE_RESULT_SUCCESS (0) … >>>> [102249951] zeKernelSetGroupSize: hKernel = 0x55a68c790280 groupSizeX = 256 groupSizeY = 1 groupSizeZ = 1 <<<< [102264632] zeKernelSetGroupSize [14681 ns] -> ZE_RESULT_SUCCESS (0) >>>> [102278558] zeKernelSetArgumentValue: hKernel = 0x55a68c790280 argIndex = 0 argSize = 8 pArgValue = 0x7fff865b5440 <<<< [102294960] zeKernelSetArgumentValue [16402 ns] -> ZE_RESULT_SUCCESS (0) >>>> [102308273] zeKernelSetArgumentValue: hKernel = 0x55a68c790280 argIndex = 1 argSize = 8 pArgValue = 0x7fff865b5458 <<<< [102321981] zeKernelSetArgumentValue [13708 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT (2013265924) >>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8 pArgValue = 0x7ffe289c7e60 <<<< [104442529] zeKernelSetArgumentValue [13765 ns] -> ZE_RESULT_SUCCESS (0) >>>> [104455176] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 3 argSize = 4 pArgValue = 0x7ffe289c7e2c <<<< [104468472] zeKernelSetArgumentValue [13296 ns] -> ZE_RESULT_SUCCESS (0) ...
    The example log data shows:
    • A level zero API call that causes the problem (
      zeKernelSetArgumentValue
      )
    • The problem reason (
      ZE_RESULT_ERROR_INVALID_ARGUMENT
      )
    • The argument index (
      argIndex = 1
      )
    • An invalid value location (
      pArgValue = 0x7fff865b5458
      )
    • A kernel handle (
      hKernel = 0x55a68c790280
      ), which provides the name of the kernel for which this issue is observed (GEMM)
More information could be obtained by omitting the "redirection to file" option and dumping all the output (application output + tool output) into one stream. Dumping to one stream may help determine the source of the error in respect to application output (for example, you can find that the error happens between application initialization and the first phase of computations):
Level Zero Matrix Multiplication (matrix size: 1024 x 1024, repeats 4 times) Target device: Intel(R) Graphics [0x3ea5] ... >>>> [104131109] zeKernelCreate: hModule = 0x55af5f39ca10 desc = 0x7ffe289c7f80 {29 0 0 GEMM} phKernel = 0x7ffe289c7e48 (hKernel = 0) <<<< [104158819] zeKernelCreate [27710 ns] hKernel = 0x55af5f3ca600 -> ZE_RESULT_SUCCESS (0) ... >>>> [104345820] zeKernelSetGroupSize: hKernel = 0x55af5f3ca600 groupSizeX = 256 groupSizeY = 1 groupSizeZ = 1 <<<< [104360082] zeKernelSetGroupSize [14262 ns] -> ZE_RESULT_SUCCESS (0) >>>> [104373679] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 0 argSize = 8 pArgValue = 0x7ffe289c7e50 <<<< [104389443] zeKernelSetArgumentValue [15764 ns] -> ZE_RESULT_SUCCESS (0) >>>> [104402448] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 1 argSize = 8 pArgValue = 0x7ffe289c7e68 <<<< [104415871] zeKernelSetArgumentValue [13423 ns] -> ZE_RESULT_ERROR_INVALID_ARGUMENT (2013265924) >>>> [104428764] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 2 argSize = 8 pArgValue = 0x7ffe289c7e60 <<<< [104442529] zeKernelSetArgumentValue [13765 ns] -> ZE_RESULT_SUCCESS (0) >>>> [104455176] zeKernelSetArgumentValue: hKernel = 0x55af5f3ca600 argIndex = 3 argSize = 4 pArgValue = 0x7ffe289c7e2c <<<< [104468472] zeKernelSetArgumentValue [13296 ns] -> ZE_RESULT_SUCCESS (0) ... Matrix multiplication time: 0.0427564 sec Results are INCORRECT with accuracy: 1 ... Matrix multiplication time: 0.0430995 sec Results are INCORRECT with accuracy: 1 ... Total execution time: 0.381558 sec

Correctness

Offload code is often used for kernels that can efficiently process large amounts of information on the attached compute device, or to generate large amounts of information from some input parameters. If these kernels are running without crashing, this can often mean that you learn that they are not producing the correct results much later in program execution.
In these cases, it can be difficult to identify which kernel is producing incorrect results. One technique for finding the kernel producing incorrect data is to run the program twice, once using a purely host-based implementation, and once using an offload implementation, capturing the inputs and outputs from every kernel (often to individual files). Now compare the results and see which kernel call is producing unexpected results (within a certain epsilon - the offload hardware may have a different order of operation or native precision that causes the results to differ from the host code in the last digit or two).
Once you know which kernel is producing incorrect results, use Intel Distribution for GDB to determine the reason. See the Debugging with Intel Distribution for GDB on Linux OS Host Tutorial for basic information and links to more detailed documentation.
Both DPC++ and OpenMP* also allow for the use of standard language print mechanisms (
printf
for DPC++ and C++ OpenMP offload,
print *, …
for Fortran OpenMP offload) within offloaded kernels, which you can use to verify correct operation while they run. Print the thread and SIMD lane the output is coming from and consider adding synchronization mechanisms to ensure printed information is in a consistent state when printed. Examples for how to do this in DPC++ using the stream class can be found in the Intel oneAPI GPU Optimization Guide. You could use a similar approach to the one described for DPC++ for OpenMP offload.
Using
printf
can be verbose in DPC++ kernels. To simplify, add the following macro:
#ifdef __SYCL_DEVICE_ONLY__ #define CL_CONSTANT __attribute__((opencl_constant)) #else #define CL_CONSTANT #endif #define PRINTF(format, ...) { \ static const CL_CONSTANT char _format[] = format; \ sycl::ONEAPI::experimental::printf(_format, ## __VA_ARGS__) }
Usage example:
PRINTF("My integer variable:%d\n, (int) x);

Failures

Just-in-time (JIT) compilation failures that occur at runtime due to incorrect use of the DPC++ or OpenMP* offload languages will cause your program to exit with an error.
In the case of DPC++, if you cannot find these using ahead-of-time compilation of your DPC++ code, selecting the OpenCL backend, setting SimpleDumpProgramSource and BuildLogging, and using the Intercept Layer for OpenCL Applications may help identify the kernel with the syntax error.
Logic errors can also result in crashes or error messages during execution. Such issues can include:
  • Passing a buffer that belongs to the wrong context to a kernel
  • Passing the "this" pointer to a kernel rather than a class element
  • Passing a host buffer rather than a device buffer
  • Passing an uninitialized pointer, even if it is not used in the kernel
Using the Intel® Distribution for GDB* (or even the native GDB), if you watch carefully, you can record the addresses of all contexts created and verify that the address being passed to an offload kernel belongs to the correct context. Likewise, you can verify that the address of a variable passed matches that of the variable itself, and not its containing class.
It may be easier to track buffers and addresses using the Intercept Layer for OpenCL™ allocation or ze_tracer and choosing the appropriate backend. When using the OpenCL backend, setting
CallLogging
,
BuildLogging
,
ErrorLogging
, and
USMChecking
and running your program should produce output that explains what error in your code caused the generic OpenCL error to be produced. Using ze_tracer's Call Logging or Device Timeline should give additional enhanced error information to help you better understand the source of generic errors from the Level Zero backend. This can help locate many of the logic errors mentioned above.
If the code is giving an error when offloading to a device using the Level Zero backend, try using the OpenCL backend. If the program works, report an error against the Level Zero backend. If the error reproduces in the OpenCL backend to the device, try using the OpenCL CPU backend. In OpenMP offload, this can be specified by setting
OMP_TARGET_OFFLOAD
to
CPU
. For DPC++, this can be done by setting
SYCL_DEVICE_FILTER=opencl:cpu
. Debugging with everything on the CPU can be easier, and removes complications caused by data copies and translation of the program to a non-CPU device.
As an example of a logic issue that can get you in trouble, consider what is captured by the lambda function used to implement the
parallel_for
in this DPC++ snippet.
class MyClass { private: int *data; int factor; : void run() { : auto data2 = data; auto factor2 = factor; { dpct::get_default_queue_wait().submit([&](cl::sycl::handler &cgh) { auto dpct_global_range = grid * block; auto dpct_local_range = block; cgh.parallel_for<dpct_kernel_name<class kernel_855a44>>( cl::sycl::nd_range<1>( cl::sycl::range<1> dpct_global_range.get(0)), cl::sycl::range<1>( dpct_local_range.get(0))), [=](cl::sycl::nd_item<3> item_ct1) { kernel(data, b, factor, LEN, item_ct1); // This blows up }); }); } } // run } // MyClass
In the above code snippet, the program crashes because
[=]
will copy by value all variables used inside the lambda. In the example it may not be obvious that "
factor
" is really "
this->factor
" and "
data
" is really "
this->data,
" so "
this
" is the variable that is captured for the use of "
data
" and "
factor
" above. OpenCL or Level Zero will crash with an illegal arguments error in the "
kernel(data, b, factor, LEN, item_ct1)
" call.
The fix is the use of local variables
auto data2
and
auto factor2
. "
auto factor2 = factor
" becomes "
int factor2 = this->factor
" so using
factor2
inside the lambda with
[=]
would capture an "
int
". We would rewrite the inner section as "
kernel(data2, b, factor2, LEN, item_ct1);
" .
This issue is commonly seen when migrating CUDA* kernels. You can also resolve the issue by keeping the same CUDA kernel launch signature and placing the command group and lambda inside the kernel itself.
Using the Intercept Layer for OpenCL™ allocation or ze_tracer, you would see that the kernel was called with two identical addresses, and the extended error information would tell you that you are trying to copy a non-trivial data structure to the offload device.
Note that if you are using unified shared memory (USM), and "
MyClass
" is allocated in USM, the above code will work. However, if only "
data
" is allocated in USM, then the program will crash for the above reason.
In this example, note that you can also re-declare the variables in local scope with the same name so that you don't need to change everything in the kernel call.
Intel® Inspector can also help diagnose these sorts of failures. If you set the following environment variables and then run Memory Error Analysis on offload code using the CPU device, Intel Inspector will flag many of the above issues:
  • OpenMP*
    • export OMP_TARGET_OFFLOAD=CPU
    • export OMP_TARGET_OFFLOAD=MANDATORY
    • export LIBOMPTARGET_PLUGIN=OPENCL
  • DPC++
    • export SYCL_DEVICE_FILTER=opencl:cpu
    • Or initialize your queue with a CPU selector to force use of the OpenCL CPU device:
      cl::sycl::queue Queue(cl::sycl::cpu_selector{});
  • Both
    • export CL_CONFIG_USE_VTUNE=True
    • export CL_CONFIG_USE_VECTORIZER=false
A crash can occur when optimizations are turned on during the compilation process. If turning off optimizations causes your crash to disappear, use
-g -[optimization level]
for debugging. For more information, see the
Intel oneAPI
DPC++/C++
Compiler
Developer Guide and Reference
.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.