# Data Types and Operations

## Data Type Selection Considerations

Select the appropriate data type to optimize the FPGA area use by your DPC++ application:

- Select the most appropriate data type for your application. For example, do not define your variable asfloatif the data typeshortis sufficient.
- Ensure that both sides of an arithmetic expression belong to the same data type. Consider an example where one side of an arithmetic expression is a floating-point value and the other side is an integer. The mismatched data types cause theIntel® oneAPIto create implicit conversion operators, which can become expensive if they are present in large numbers.DPC++/C++Compiler
- Take advantage of padding if it exists in your data structures. For example, if you only needfloat3data type, which has the same size asfloat4, you may change the data type tofloat4to make use of the extra dimension to carry an unrelated value.

## Arithmetic Operation Considerations

Select the appropriate arithmetic operation for your DPC++ application to avoid excessive FPGA area use:

- Introduce floating-point arithmetic operations only when necessary.
- TheIntel® oneAPIdefaults floating-point constants to double data type. Add anDPC++/C++Compilerfdesignation to the constant to make it a single precision floating-point operation.
- For example, the arithmetic operationsin(1.0)represents a double precision floating-point sine function. The arithmetic operationsin(1.0f)represents a single precision floating-point sine function.
- If you do not require full precision result for a complex function, compute simpler arithmetic operations to approximate the result. Consider the following example scenarios:
- Instead of computing the functionpow(x,n)wherenis a small value, approximate the result by performing repeated squaring operations because they require much less hardware resources and area.
- Ensure you are aware of the original and approximated area uses because in some cases, computing a result via approximation might result in excess area use. For example, thesqrtfunction is not resource-intensive. Other than a rough approximation, replacing thesqrtfunction with arithmetic operations that the host must compute at runtime might result in larger area use.
- If your kernel performs a complex arithmetic operation with a constant that theIntel® oneAPIcomputes at compilation time (for example,DPC++/C++Compilerlog(PI/2.0)), perform the arithmetic operation on the host instead and pass the result as an argument to the kernel at runtime.

Currently, SYCL implementation of math functions is not supported on FPGAs.

## Optimizing Floating-Point Operations

For floating-point operations, you can manually direct the

Intel® oneAPI

to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware use. These optimizations can cause small differences in floating-point results.
DPC++/C++

Compiler

Tree Balancing

Order of operation rules apply in the DPC++ language. In the following example, the

Intel® oneAPI

performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:
DPC++/C++

Compilerresult = (((A * B) + C) + (D * E)) + (F * G);

By default, the

Intel® oneAPI

creates an implementation that resembles a long vine for such computations, as illustrated in the following figure:
DPC++/C++

CompilerLong, unbalanced operations lead to hardware that is more expensive. A more efficient hardware implementation is a balanced tree, as shown below:

In a balanced tree implementation, the

Intel® oneAPI

converts the long vine of floating-point adders into a tree pipeline structure. The
DPC++/C++

CompilerIntel® oneAPI

does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.
DPC++/C++

CompilerIf you want the

Intel® oneAPI

to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the DPC++/C++

Compiler-Xsfp-relaxed

option in the clang++

command, as shown below:
dpcpp -fintelfpga -Xshardware -Xsfp-relaxed <source_file>.cpp

Rounding Operations

The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The

Intel® oneAPI

does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.
DPC++/C++

CompilerYou can reduce the amount of hardware necessary to implement floating-point operations with the

-Xsfpc

option of the clang++

command. If your program can tolerate small differences in floating-point results, invoke the following command:
dpcpp -fintelfpga -Xshardware -Xsfpc <source_file>.cpp

The

-Xsfpc

option directs the
Intel® oneAPI

to perform the following tasks:
DPC++/C++

Compiler- Remove floating-point rounding operations and conversions whenever possible. If possible, the-Xsfpcargument directs theIntel® oneAPIto round a floating-point operation only once—at the end of the tree of the floating-point operations.DPC++/C++Compiler
- Carry additional mantissa bits to maintain precision. TheIntel® oneAPIcarries additional precision bits through the floating-point calculations and removes these precision bits at the end of the tree of floating-point operations.DPC++/C++Compiler

This type of optimization results in hardware that performs a fused floating-point operation. It is a feature of many new hardware-processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to results that are more accurate. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The

Intel® oneAPI

can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.
DPC++/C++

Compiler## Avoid Expensive Functions

Some functions are expensive to implement in FPGAs. Expensive functions might decrease kernel performance or require a large amount of hardware to implement.

The following functions are expensive:

- Integer division and modulo (remainder) operators
- Most floating-point operators except addition, multiplication, absolute value, and comparison. For more information about optimizing floating-point operations, refer to the Optimizing Floating-Point Operations section.
- Atomic functions

In contrast, inexpensive functions have minimal effects on kernel performance, and their implementation consumes minimal hardware.

The following functions are inexpensive:

- Binary logic operations such asAND,NAND,OR,NOR,XOR, andXNOR
- Logical operations with one constant argument
- Shift by constant
- Integer multiplication and division by a constant that is a power of two

If an expensive function produces a new piece of data for every work-item in a work-group, it is beneficial to code it in a kernel.

On the contrary, the following code example depicts a case of an expensive floating-point operation (division) executed by every work-item in the NDRange:

// this function is used in kernel code void myKernel (accessor<int, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, cl::sycl::id<1> wiID, const float c, const float d) { //inefficient since each work-item must calculate c divided by d b[wiID ] = a[wiID ] * (c / d); }

The result of this calculation is always the same. To avoid this redundant and hardware resource-intensive operation, perform the calculation in the host application and then pass the result to the kernel as an argument for all work-items in the NDRange to use. The modified code is shown in the following:

void myKernel (accessor<int, access::mode::read, access:: target::global_buffer> a, accessor<int, access::mode::read, access:: target::global_buffer> b, cl::sycl::id<1> wiID, const float c_divided_by_d) { /*host calculates c divided by d once and passes it into kernel to avoid redundant expensive calculations*/ b[wiID ] = a[wiID ] * c_divided_by_d; }

The

Intel® oneAPI

consolidates operations that are not work-item-dependent across the entire NDRange into a single operation. It then shares the result across all work-items. In the first code example, the
DPC++/C++

CompilerIntel® oneAPI

creates a single divider block shared by all work-items because division of DPC++/C++

Compilerc

by d

remains constant across all work-items. This optimization helps minimize the amount of redundant hardware. However, the implementation of an integer division requires a significant amount of hardware resources. Therefore, it is beneficial to off-load the division operation to the host processor and then pass the result as an argument to the kernel to conserve hardware resources.
The

-Xsfpc

flag has a limited effect for float types when the FPGA DSP supports floating-point operations, since the intermediate rounding is already part of the DSP block. Therefore, avoiding expensive functions is mostly useful for double types.