Developer Guide

Contents

Optimize Floating-point Operation

For floating-point operations, you can manually direct the 
Intel® oneAPI
DPC++/C++
Compiler
to perform optimizations that create more efficient pipeline structures in hardware and reduce the overall hardware use. These optimizations can cause small differences in floating-point results. You can also apply the
fp contract
and
fp reassociate
floating-point pragmas to handle kernel's arithmetic and floating-point operations at finer granularity. For more information about the pragmas, refer to Floating Point Pragmas.
Starting with the oneAPI 2021.2 release, fast math is enabled by default, allowing the
Intel® oneAPI
DPC++/C++
Compiler
to make various out-of-box floating point math (
float
or
double
) optimizations. With these optimizations enabled, you might observe different bitwise results when compared to results from the oneAPI 2021.1 release or from GCC. The tradeoff is done to improve performance and area of your design. Automatic dot product inference and floating-point contraction for double precision math are two key noticeable FPGA optimizations that save a large amount of FPGA area and improve performance/latency. To return to the same precise version of math as that of the oneAPI 2021.1 release or GCC, use the following compiler options:
  • For Linux:
    -no-fma -fp-model=precise
  • For Windows:
    /Qfma- /fp:precise
For more information about these options, refer to Using the
-fp-model (/fp)
Option
and
fma, Qfma
topics in the
Intel® oneAPI
DPC++/C++
Compiler
Developer Guide and Reference
.

Tree Balancing

Order of operation rules apply in the DPC++ language. In the following example, the
Intel® oneAPI
DPC++/C++
Compiler
performs multiplications and additions in a strict order, beginning with operations within the innermost parentheses:
result = (((A * B) + C) + (D * E)) + (F * G);
By default, the
Intel® oneAPI
DPC++/C++
Compiler
creates an implementation that resembles a long vine for such computations, as illustrated in the following figure:
Default Floating-Point Implementation
Long, unbalanced operations lead to hardware that is more expensive. A more efficient hardware implementation is a balanced tree, as shown below:
Balanced Tree
In a balanced tree implementation, the
Intel® oneAPI
DPC++/C++
Compiler
converts the long vine of floating-point adders into a tree pipeline structure. The
Intel® oneAPI
DPC++/C++
Compiler
does not perform tree balancing of floating-point operations automatically because the outcomes of the floating-point operations might differ. As a result, this optimization is inconsistent with the IEEE Standard 754-2008.
If you want the
Intel® oneAPI
DPC++/C++
Compiler
to optimize floating-point operations using balanced trees and your program can tolerate small differences in floating-point results, include the 
-Xsfp-relaxed
 option in the 
clang++
 command, as shown below:
dpcpp -fintelfpga -Xshardware -Xsfp-relaxed <source_file>.cpp

Rounding Operations

The balanced tree implementation of a floating-point operation includes multiple rounding operations. These rounding operations can require a significant amount of hardware resources in some applications. The
Intel® oneAPI
DPC++/C++
Compiler
does not reduce the number of rounding operations automatically because doing so violates the results required by IEEE Standard 754-2008.
You can reduce the amount of hardware necessary to implement floating-point operations with the 
-Xsfpc
 option of the 
clang++
 command. If your program can tolerate small differences in floating-point results, invoke the following command:
dpcpp -fintelfpga -Xshardware -Xsfpc <source_file>.cpp
The 
-Xsfpc
 option directs the
Intel® oneAPI
DPC++/C++
Compiler
to perform the following tasks:
  • Remove floating-point rounding operations and conversions whenever possible. If possible, the
    -Xsfpc
    argument directs the
    Intel® oneAPI
    DPC++/C++
    Compiler
    to round a floating-point operation only once—at the end of the tree of the floating-point operations.
  • Carry additional mantissa bits to maintain precision. The
    Intel® oneAPI
    DPC++/C++
    Compiler
    carries additional precision bits through the floating-point calculations and removes these precision bits at the end of the tree of floating-point operations.
This type of optimization results in hardware that performs a fused floating-point operation. It is a feature of many new hardware-processing systems. Fusing multiple floating-point operations minimizes the number of rounding steps, which leads to results that are more accurate. An example of this optimization is a fused multiply-accumulate (FMAC) instruction available in new processor architectures. The
Intel® oneAPI
DPC++/C++
Compiler
can provide fused floating-point mathematical capabilities for many combinations of floating-point operators in your kernel.

Product and Performance Information

1

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.