Developer Guide

Contents

Data Types and Operations

Data Type Selection Considerations

Select the appropriate data type to optimize the FPGA area use by your DPC++ application:
  • Select the most appropriate data type for your application. For example, do not define your variable as 
    float
     if the data type 
    short
     is sufficient.
  • Ensure that both sides of an arithmetic expression belong to the same data type. Consider an example where one side of an arithmetic expression is a floating-point value and the other side is an integer. The mismatched data types cause the 
    Intel® oneAPI DPC++/C++ Compiler
     to create implicit conversion operators, which can become expensive if they are present in large numbers.
  • Take advantage of padding if it exists in your data structures. For example, if you only need 
    float3
     data type, which has the same size as 
    float4
    , you may change the data type to 
    float4
     to make use of the extra dimension to carry an unrelated value.

Arithmetic Operation Considerations

Select the appropriate arithmetic operation for your DPC++ application to avoid excessive FPGA area use:
  • Introduce floating-point arithmetic operations only when necessary.
  • The 
    Intel® oneAPI DPC++/C++ Compiler
    defaults floating-point constants to double data type. Add an 
    f
     designation to the constant to make it a single precision floating-point operation.
  • For example, the arithmetic operation 
    sin(1.0)
     represents a double precision floating-point sine function. The arithmetic operation 
    sin(1.0f)
     represents a single precision floating-point sine function.
  • If you do not require full precision result for a complex function, compute simpler arithmetic operations to approximate the result. Consider the following example scenarios:
    • Instead of computing the function 
      pow(x,n)
       where 
      n
       is a small value, approximate the result by performing repeated squaring operations because they require much less hardware resources and area.
    • Ensure you are aware of the original and approximated area uses because in some cases, computing a result via approximation might result in excess area use. For example, the 
      sqrt
       function is not resource-intensive. Other than a rough approximation, replacing the 
      sqrt
       function with arithmetic operations that the host must compute at runtime might result in larger area use.
    • If your kernel performs a complex arithmetic operation with a constant that the
      Intel® oneAPI DPC++/C++ Compiler
      computes at compilation time (for example, 
      log(PI/2.0)
      ), perform the arithmetic operation on the host instead and pass the result as an argument to the kernel at runtime.
Currently, SYCL implementation of math functions is not supported on FPGAs.

Optimizing Floating-Point Operations

For floating-point operations, you can manually direct the 
Intel® oneAPI DPC++/C++ Compiler
to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware use. These optimizations can cause small differences in floating-point results.
Tree Balancing
Order of operation rules apply in the DPC++ language. In the following example, the
Intel® oneAPI DPC++/C++ Compiler
performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:
result = (((A * B) + C) + (D * E)) + (F * G);
By default, the
Intel® oneAPI DPC++/C++ Compiler
creates an implementation that resembles a long vine for such computations, as illustrated in the following figure:
Default Floating-Point Implementation
Long, unbalanced operations lead to hardware that is more expensive. A more efficient hardware implementation is a balanced tree, as shown below:
Balanced Tree
In a balanced tree implementation, the
Intel® oneAPI DPC++/C++ Compiler
converts the long vine of floating-point adders into a tree pipeline structure. The
Intel® oneAPI DPC++/C++ Compiler
does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.
If you want the
Intel® oneAPI DPC++/C++ Compiler
to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the 
-Xsfp-relaxed
 option in the 
clang++
 command, as shown below:
dpcpp -fintelfpga -Xshardware -Xsfp-relaxed <source_file>.cpp
Rounding Operations
The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The
Intel® oneAPI DPC++/C++ Compiler
does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.
You can reduce the amount of hardware necessary to implement floating-point operations with the 
-Xsfpc
 option of the 
clang++
 command. If your program can tolerate small differences in floating-point results, invoke the following command:
dpcpp -fintelfpga -Xshardware -Xsfpc <source_file>.cpp
The 
-Xsfpc
 option directs the
Intel® oneAPI DPC++/C++ Compiler
to perform the following tasks:
  • Remove floating-point rounding operations and conversions whenever possible. If possible, the
    -Xsfpc
    argument directs the
    Intel® oneAPI DPC++/C++ Compiler
    to round a floating-point operation only once—at the end of the tree of the floating-point operations.
  • Carry additional mantissa bits to maintain precision. The
    Intel® oneAPI DPC++/C++ Compiler
    carries additional precision bits through the floating-point calculations and removes these precision bits at the end of the tree of floating-point operations.
This type of optimization results in hardware that performs a fused floating-point operation. It is a feature of many new hardware-processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to results that are more accurate. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The
Intel® oneAPI DPC++/C++ Compiler
can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.

Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.
The following functions are expensive:
  • Integer division and modulo (remainder) operators
  • Most floating-point operators except addition, multiplication, absolute value, and comparison. For more information about optimizing floating-point operations, refer to the Optimizing Floating-Point Operations section.
  • Atomic functions
In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.
The following functions are inexpensive:
  • Binary logic operations such as
    AND
    ,
    NAND
    ,
    OR
    ,
    NOR
    ,
    XOR
    , and
    XNOR
  • Logical operations with one constant argument
  • Shift by constant
  • Integer multiplication and division by a constant that is a power of two
If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel.
On the contrary, the following code example depicts a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:
// this function is used in kernel code void myKernel (accessor<int, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, cl::sycl::id<1> wiID, const float c, const float d) { //inefficient since each work-item must calculate c divided by d b[wiID ] = a[wiID ] * (c / d); }
The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown in the following:
void myKernel (accessor<int, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, cl::sycl::id<1> wiID, const float c_divided_by_d) { /*host calculates c divided by d once and passes it into kernel to avoid redundant expensive calculations*/ b[wiID ] = a[wiID ] * c_divided_by_d; }
The 
Intel® oneAPI DPC++/C++ Compiler
consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the
Intel® oneAPI DPC++/C++ Compiler
creates a single divider block shared by all work-items because division of 
c
 by 
d
remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.
The
-Xsfpc
flag has a limited effect for float types when the FPGA DSP supports floating-point operations, since the intermediate rounding is already part of the DSP block. Therefore, avoiding expensive functions is mostly useful for double types.

Product and Performance Information

1

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804