Programming Guide

Contents

Trace the Offload Process

When a program that offloads computation to a GPU is started, there are lot of moving parts involved in program execution. Machine-independent code needs to be compiled to machine-dependent code, data and binaries need to be copied to the device, results returned, etc. This section will discuss how to trace all this activity using the tools described in oneAPI Debug Tools.

Kernel Setup Time

Before offload code can run on the device, the machine-independent version of the kernel needs to be compiled for the target device, and the resulting code needs to be copied to the device. This can complicate/skew benchmarking if this kernel setup time is not considered. Just-in-time compilation can also introduce a noticeable delay when debugging an offload application.
If you have an OpenMP* offload program, setting
LIBOMPTARGET_PLUGIN_PROFILE=T[,usec]
explicitly reports the amount of time required to build the offload code "ModuleBuild", which you can compare to the overall execution time of your program.
Kernel setup time is more difficult to determine if you have a DPC++ offload program.
  • If Level Zero is your backend, you can derive kernel setup time from the Device Timing and Device Timeline returned by ze_tracer.
  • If OpenCL™ is your backend, you may be able to derive the information by setting the
    BuildLogging
    ,
    KernelInfoLogging
    ,
    CallLogging
    ,
    CallLoggingElapsedTime
    ,
    KernelInfoLogging
    ,
    HostPerformanceTiming
    ,
    HostPerformanceTimeLogging
    ,
    ChromeCallLogging
    , or
    CallLoggingElapsedTime
    flags when using the Intercept Layer for OpenCL Applications to get similar information.
You can also use this technique to supplement the information returned by
LIBOMPTARGET_PLUGIN_PROFILE=T
.

Monitoring Buffer Creation, Sizes, and Copies

Understanding when buffers are created, how many buffers are created, and whether they are reused or constantly created and destroyed can be key to optimizing the performance of your offload application. This may not always be obvious when using a high-level programming language like OpenMP or DPC++, which can hide a lot of the buffer management from the user.
At a high level, you can track buffer-related activities using the
LIBOMPTARGET_DEBUG
and
SYCL_PI_TRACE
environment variables when running your program.
LIBOMPTARGET_DEBUG
gives you more information than
SYCL_PI_TRACE
- it reports the addresses and sizes of the buffers created. By contrast,
SYCL_PI_TRACE
just reports the API calls, with no information you can easily tie to the location or size of individual buffers.
At a lower level, if you are using Level Zero as your backend, the Call Logging mode of ze_tracer will give you information on all Level Zero API calls, including their arguments. This can be useful because, for example, a call for buffer creation (such as
zeMemAllocDevice
) will give you the size of the resulting buffer being passed to and from the device. ze_tracer also allows you to dump all the Level Zero device-side activities (including memory transfers) in Device Timeline mode. For each activity one can get append (to command list), submit (to queue), start and end times.
If you are using OpenCL as your backend, setting the
CallLogging
,
CallLoggingElapsedTime
, and
ChromeCallLogging
flags when using the Intercept Layer for OpenCL Applications should give you similar information.

Total Transfer Time

Comparing total data transfer time to kernel execution time can be important for determining whether it is profitable to offload a computation to a connected device.
If you have an OpenMP offload program, setting
LIBOMPTARGET_PLUGIN_PROFILE=T[,usec]
explicitly reports the amount of time required to build ("DataAlloc"), read ("DataRead"), and write data ("DataWrite") to the offload device (although only in aggregate).
Data transfer times can be more difficult to determine if you have a DPC++ program.
  • If Level Zero is your backend, you can derive total data transfer time from the Device Timing and Device Timeline returned by ze_tracer.
  • If OpenCL is your backend, you may be able to derive the information by setting the
    BuildLogging
    ,
    KernelInfoLogging
    ,
    CallLogging
    ,
    CallLoggingElapsedTime
    ,
    KernelInfoLogging
    ,
    HostPerformanceTiming
    ,
    HostPerformanceTimeLogging
    ,
    ChromeCallLogging
    , or
    CallLoggingElapsedTime
    flags when using the Intercept Layer for OpenCL Applications.

Kernel Execution Time

If you have an OpenMP offload program, setting
LIBOMPTARGET_PLUGIN_PROFILE=T[,usec]
explicitly reports the total execution time of every offloaded kernel ("Kernel#...").
For DPC++ offload programs:
  • If Level Zero is your backend, the Device Timing mode of ze_tracer will give you the device-side execution time for every kernel.
  • If OpenCL is your backend, you may be able to derive the information by setting the
    CallLoggingElapsedTime
    ,
    DevicePerformanceTiming
    ,
    DevicePerformanceTimeKernelInfoTracking
    ,
    DevicePerformanceTimeLWSTracking
    ,
    DevicePerformanceTimeGWSTracking
    ,
    ChromePerformanceTiming
    ,
    ChromePerformanceTimingInStages
    flags when using the Intercept Layer for OpenCL Applications.

When Device Kernels are Called and Threads are Created

On occasion, offload kernels are created and transferred to the device a long time before they actually start executing (usually only after all data required by the kernel has also been transferred, along with control).
You can set a breakpoint in a device kernel using the Intel® Distribution for GDB*. From there, you can query kernel arguments, monitor thread creation and destruction, list the current threads and their current positions in the code (using "info thread"), and so on.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.