Developer Guide


Utilizing Hardware Kernel Invocation Queue

Kernel invocation queue is a first-in first-out (FIFO) buffer used by the DPC++ runtime to store arguments for multiple kernel invocations on the device. Once the kernel finishes execution, the invocation queue allows the next invocation of the kernel to start immediately after. DPC++ kernels are built with invocation queue to enable immediate launch of the next invocation.
As illustrated in the following figure, when the invocation queue is used, system and DPC++ runtime environment overheads (from responding to the finish and sending in the next set of invocation arguments) are overlapped with the kernel executions. This allows kernels to execute continuously, maximizing the system level throughput.
Kernel Execution with and without Invocation Queue
Kernel Execution with and without Invocation Queue
DPC++ kernel invocations are queued in hardware when the same DPC++ kernel function is already running on the device, and the following are true:
  • DPC++ kernel was not compiled with hardware kernel invocation buffer disabled (
  • DPC++ kernel was not compiled with performance counters (
  • Any host to device synchronization operation (such as, host accessor, buffer destruction, and so on) is done between sequential kernel enqueues that requires first enqueue to finish.
Consider the following definitions of


void simple_kernel(queue &deviceQueue, buffer<cl_float, 1> &bufferA, buffer<cl_float, 1> &bufferC) { deviceQueue.submit([&](handler& cgh) { auto accessorA = bufferA.template get_access<access::mode::read>(cgh); auto accessorC = bufferC.template get_access<access::mode::discard_read_write>(cgh); cgh.single_task<class SimpleAdd>([=]() { for (int i = 0; i < N; i++) { accessorC[i] = accessorA[i] + accessorA[i]; } }); }); }


void check_output(buffer<cl_float, 1> &outBuffer) { auto output_buf_acc = outBuffer.template get_access<access::mode::read>(); ... // Check output ... }
Based on the function definitions of simple_kernel() and check_output(), consider the following example code snippet where the kernel enqueue can be queued on the hardware kernel invocation queue:
// Example 1 main() { ... simple_kernel(device_queue, bufferA, bufferC); simple_kernel(device_queue, bufferX, bufferZ); check_output(bufferC); check_output(bufferZ); ... }
As soon as the first enqueue of
kernel is running, the second enqueue can be queued since they have no dependency.
Now, consider the following example code where kernel invocation cannot be queued on hardware:
// Example 2 main() { ... simple_kernel(device_queue, bufferA, bufferC); check_output(bufferC); simple_kernel(device_queue, bufferX, bufferZ); check_output(bufferZ); ... }
function for the output buffer in check_output() function is a synchronization point that blocks the DPC++ runtime until the first enqueue of
kernel is complete.
For additional information, refer to the FPGA tutorial sample "Overlapping Kernel Execution" listed in the Intel® oneAPI Samples Browser on Linux* or Intel® oneAPI Samples Browser on Windows*.

Product and Performance Information


Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804