# Optimize Floating-point Operation

For floating-point operations, you can manually direct the

Intel® oneAPI

to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware use. These optimizations can cause small differences in floating-point results. You can also apply the
DPC++/C++

Compilerfp contract

and
fp reassociate

floating-point pragmas to handle kernel's arithmetic and floating-point operations at finer granularity. For more information about the pragmas, refer to
Floating Point Pragmas.Starting with the oneAPI 2021.2 release, fast math is enabled by default, allowing the

Intel® oneAPI

to make various out-of-box floating point math (DPC++/C++

Compilerfloat

or
double

) optimizations. With these optimizations enabled, you might observe different bitwise results when compared to results from the oneAPI 2021.1 release or from GCC. The tradeoff is done to improve performance and area of your design. Automatic dot product inference and floating-point contraction for double precision math are two key noticeable FPGA optimizations that save a large amount of FPGA area and improve performance/latency. To return to the same precise version of math as that of the oneAPI 2021.1 release or GCC, use the following compiler options:
- For Linux:-no-fma -fp-model=precise
- For Windows:/Qfma- /fp:precise

For more information about these options, refer to
Using the.

-fp-model (/fp)

Option and
fma, Qfma

topics in the
Intel® oneAPI

Developer Guide and ReferenceDPC++/C++

Compiler## Tree Balancing

Order of operation rules apply in the DPC++ language. In the following example, the

Intel® oneAPI

performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:
DPC++/C++

Compilerresult = (((A * B) + C) + (D * E)) + (F * G);

By default, the

Intel® oneAPI

creates an implementation that resembles a long vine for such computations, as illustrated in the following figure:
DPC++/C++

CompilerLong, unbalanced operations lead to hardware that is more expensive. A more efficient hardware implementation is a balanced tree, as shown below:

In a balanced tree implementation, the

Intel® oneAPI

converts the long vine of floating-point adders into a tree pipeline structure. The
DPC++/C++

CompilerIntel® oneAPI

does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.
DPC++/C++

CompilerIf you want the

Intel® oneAPI

to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the DPC++/C++

Compiler-Xsfp-relaxed

option in the clang++

command, as shown below:
dpcpp -fintelfpga -Xshardware -Xsfp-relaxed <source_file>.cpp

## Rounding Operations

The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The

Intel® oneAPI

does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.
DPC++/C++

CompilerYou can reduce the amount of hardware necessary to implement floating-point operations with the

-Xsfpc

option of the clang++

command. If your program can tolerate small differences in floating-point results, invoke the following command:
dpcpp -fintelfpga -Xshardware -Xsfpc <source_file>.cpp

The

-Xsfpc

option directs the
Intel® oneAPI

to perform the following tasks:
DPC++/C++

Compiler- Remove floating-point rounding operations and conversions whenever possible. If possible, the-Xsfpcargument directs theIntel® oneAPIto round a floating-point operation only once—at the end of the tree of the floating-point operations.DPC++/C++Compiler
- Carry additional mantissa bits to maintain precision. TheIntel® oneAPIcarries additional precision bits through the floating-point calculations and removes these precision bits at the end of the tree of floating-point operations.DPC++/C++Compiler

This type of optimization results in hardware that performs a fused floating-point operation. It is a feature of many new hardware-processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to results that are more accurate. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The

Intel® oneAPI

can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.
DPC++/C++

Compiler