Developer Guide


Kernel Variable Accesses

This section shows techniques you can use to optimize local and private variables in kernels.

Inferring a Shift Register

The shift register design pattern is a very important design pattern for efficient implementation of many applications on the FPGA. However, the implementation of a shift register design pattern might seem counter-intuitive at first.
Consider the following code example:
using InPipe = INTEL::pipe<class PipeIn, int, 4>; using OutPipe = INTEL::pipe<class PipeOut, int, 4>; #define SIZE 512 //Shift register size must be statically determinable // this function is used in kernel void foo() { int shift_reg[SIZE]; //The key is that the array size is a compile time constant // Initialization loop #pragma unroll for (int i = 0; i < SIZE; i++) { //All elements of the array should be initialized to the same value shift_reg[i] = 0; } while(1) { // Fully unrolling the shifting loop produces constant accesses #pragma unroll for (int j = 0; j < SIZE–1; j++) { shift_reg[j] = shift_reg[j + 1]; } shift_reg[SIZE – 1] = InPipe::read(); // Using fixed access points of the shift register int res = (shift_reg[0] + shift_reg[1]) / 2; // ‘out’ pipe will have running average of the input pipe OutPipe::write(res); } }
In each clock cycle, the kernel shifts a new value into the array. By placing this shift register into a block RAM, the 
Intel® oneAPI
can efficiently handle multiple access points into the array. The shift register design pattern is ideal for implementing filters (for example, image filters like a Sobel filter or time-delay filters like a finite impulse response (FIR) filter).
When implementing a shift register in your kernel code, remember the following key points:
  • Unroll the shifting loop so that it can access every element of the array.
  • All access points must have constant data accesses. For example, if you write a calculation in nested loops using multiple access points, unroll these loops to establish the constant access points.
  • Initialize all elements of the array to the same value. Alternatively, you may leave the elements uninitialized if you do not require a specific initial value.
  • If some accesses to a large array are not inferable statically, they force the
    Intel® oneAPI
    to create inefficient hardware. If these accesses are necessary, use 
    memory instead of 
  • Do not shift a large shift register conditionally. The shifting must occur in very loop iteration that contains the shifting code to avoid creating inefficient hardware.
  • Conditionally shifting large shift registers inside pipelined loops leads to the creation of inefficient hardware.
  • For example, the following kernel consumes more resources when the 
    K > 5
    ) condition is present:
int K = ... constexpr int SHIFT_REG_LEN = 1024; q.submit([&](handler &cgh) { accessor src(src_buf, cgh, read_only); accessor dst(dst_buf, cgh, write_only, noinit); cgh.single_task<class BadShiftReg>([=]() { float shift_reg[SHIFT_REG_LEN]; int sum = 0; for (unsigned i = 0; i < K; i++) { sum += shift_reg[0]; shift_reg[SHIFT_REG_LEN-1] = src[i]; // This condition will cause sever area bloat. if (K > 5) { #pragma unroll for (int m = 0; m < SHIFT_REG_LEN-1 ; m++) shift_reg[m] = shift_reg[m + 1]; } dst[i] = sum; } }); });
If it is necessary to implement conditional shifting of a large shift register in your kernel, consider modifying your code so that it uses local memory.

Memory Access Considerations

Intel® recommends the following kernel programming strategies that can improve memory access efficiency and reduce area use of your DPC++ kernel:
  • Minimize the number of access points to external memory to reduce area. The compiler infers an LSU for each access point in your kernel, which consumes area.
    If possible, structure your kernel such that it reads its input from one location, processes the data internally, and then writes the output to another location.
  • Instead of relying on local or global memory accesses, structure your kernel as a single work-item with shift register inference whenever possible.

Product and Performance Information


Performance varies by use, configuration and other factors. Learn more at