Developer Guide

Contents

Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.
The following functions are expensive:
  • Integer division and modulo (remainder) operators
  • Most floating-point operators except addition, multiplication, absolute value, and comparison. For more information about optimizing floating-point operations, refer to the Optimize Floating-point Operation section.
  • Atomic functions
In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.
The following functions are inexpensive:
  • Binary logic operations such as
    AND
    ,
    NAND
    ,
    OR
    ,
    NOR
    ,
    XOR
    , and
    XNOR
  • Logical operations with one constant argument
  • Shift by constant
  • Integer multiplication and division by a constant that is a power of two
If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel.
On the contrary, the following code example depicts a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:
// this function is used in kernel code void myKernel (accessor<int, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, cl::sycl::id<1> wiID, const float c, const float d) { //inefficient since each work-item must calculate c divided by d b[wiID ] = a[wiID ] * (c / d); }
The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown in the following:
void myKernel (accessor<int, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, cl::sycl::id<1> wiID, const float c_divided_by_d) { /*host calculates c divided by d once and passes it into kernel to avoid redundant expensive calculations*/ b[wiID ] = a[wiID ] * c_divided_by_d; }
The 
Intel® oneAPI
DPC++/C++
Compiler
consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the
Intel® oneAPI
DPC++/C++
Compiler
creates a single divider block shared by all work-items because division of 
c
 by 
d
remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.
The
-Xsfpc
flag has a limited effect for float types when the FPGA DSP supports floating-point operations, since the intermediate rounding is already part of the DSP block. Therefore, avoiding expensive functions is mostly useful for double types.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.