Developer Guide and Reference

Contents

Programming Tradeoffs in Floating-Point Applications

In general, the programming objectives for floating-point applications fall into the following categories:
  • Accuracy:
    The application produces results that are close to the correct result.
  • Reproducibility and portability:
    The application produces consistent results across different runs, different sets of build options, different compilers, different platforms, and different architectures.
  • Performance:
    The application produces fast, efficient code.
Based on the goal of an application, you will need to make tradeoffs among these objectives. For example, if you are developing a 3D graphics engine, performance may be the most important factor to consider, with reproducibility and accuracy as secondary concerns.
The Intel®
C++
Compiler provides several compiler options that allow you to tune your applications based on specific objectives. Broadly speaking, there are the floating-point specific options, such as the
-fp-model
(Linux*
and
macOS*
) or
/fp
(Windows*) option, and the fast-but-low-accuracy options, such as the
[Q]imf-max-error
option. The compiler optimizes and generates code differently when you specify these different compiler options. Select appropriate compiler options by carefully balancing your programming objectives and making tradeoffs among these objectives. Some of these options may influence the choice of math routines that are invoked.
Many routines in the
libirc
,
libm
, and
svml
library are more highly optimized for Intel microprocessors than for non-Intel microprocessors.

Using Floating-Point Options

Take the following code as an example:
Example
float t0, t1, t2; ... t0=t1+t2+4.0f+0.1f;
If you specify the
-fp-model extended
(Linux*
and
macOS*
) or
/fp:extended
(Windows*) option in favor of accuracy, the compiler generates the following assembly code:
fld       DWORD PTR _t1 fadd      DWORD PTR _t2 fadd      DWORD PTR _Cnst4.0 fadd      DWORD PTR _Cnst0.1 fstp      DWORD PTR _t0
This code maximizes accuracy because it utilizes the highest mantissa precision available on the target platform. The code performance might suffer when managing the x87 stack, and it might yield results that cannot be reproduced on other platforms that do not have an equivalent extended precision type.
If you specify the
-fp-model source
(Linux*
and
macOS*
) or
/fp:source
(Windows*) option in favor of reproducibility and portability, the compiler generates the following assembly code:
movss     xmm0, DWORD PTR _t1 addss     xmm0, DWORD PTR _t2 addss     xmm0, DWORD PTR _Cnst4.0 addss     xmm0, DWORD PTR _Cnst0.1 movss     DWORD PTR _t0, xmm0
This code maximizes portability by preserving the original order of the computation, and by using the IEEE single-precision type for all computations. It is not as accurate as the previous implementation, because the intermediate rounding error is greater compared to extended precision. It is not the highest performance implementation, because it does not take advantage of the opportunity to pre-compute
4.0f + 0.1f.
If you specify the
-fp-model fast
(Linux*
and
macOS*
) or
/fp:fast
(Windows*) option in favor of performance, the compiler generates the following assembly code:
movss     xmm0, DWORD PTR _Cnst4.1 addss     xmm0, DWORD PTR _t1 addss     xmm0, DWORD PTR _t2 movss     DWORD PTR _t0, xmm0
This code maximizes performance using Intel® Streaming SIMD Extensions (Intel® SSE) instructions and pre-computing
4.0f + 0.1f.
It is not as accurate as the first implementation, due to the greater intermediate rounding error. It does not provide reproducible results like the second implementation, because it must reorder the addition to pre-compute
4.0f + 0.1f
. All compilers, on all platforms, at all optimization levels do not reorder the addition in the same way.
For many other applications, the considerations may be more complicated.

Using Fast-But-Low-Accuracy Options

The fast-but-low-accuracy options provide an easy way to control the accuracy of mathematical functions and utilize performance/accuracy tradeoffs offered by the Intel® Math Kernel Library (Intel® MKL). You can specify accuracy, via a command line interface, for all math functions or a selected set of math functions at the level more precise than low, medium, or high.
You specify the accuracy requirements as a set of function attributes that the compiler uses for selecting an appropriate function implementation in the math libraries. Examples using the attribute,
max-error
, are presented here. For example, use the following option to specify the relative error of two ULPs for all single, double, long double, and quad precision functions:
-fimf-max-error
=
2
To specify twelve bits of accuracy for a
sin
function, use:
-fimf-accuracy-bits
=
12:sin
To specify relative error of ten ULPs for a
sin
function, and four ULPs for other math functions called in the source file you are compiling, use:
-fimf-max-error
=
10:sin
-fimf-max-error
=
4
On Windows systems, the Intel®
C++
Compiler defines the default value for the
max-error
attribute depending on the
/fp
option
and
/Qfast-transcendentals
settings. In
/fp:fast
mode
, or if fast but less accurate math functions are explicitly enabled by
/Qfast-transcendentals-
, then
the Intel®
C++
Compiler sets a
max-error=
4.0
for the call. Otherwise, it sets a
max-error=
0.6
.

Dispatching of Math Routines

The Intel®
C++
Compiler optimizes calls to routines from the
libm
and
svml
libraries into direct CPU-specific calls, when the compilation configuration specifies the target CPU where the code is tuned, and if the set of instructions available for the code compilation is not narrower than the set of instructions available in the tuning target CPU.
For example:
  • The code containing calls to the
    exp()
    library function and compiled with
    -mtune=
    corei7-avx
    (specifies tuning target CPU that supports Intel® Advanced Vector Extensions (Intel® AVX)) and
    -QxCORE-AVX2/-march=
    core-avx2
    (specifies Intel® Advanced Vector Extensions 2 (Intel® AVX2) instructions set) call the
    exp()
    routine that is optimized for processors with Intel® AVX support. This code provides the best performance for these processors.
  • The same code, compiled with
    -mtune=
    core-avx2
    and
    -QxAVX/-march=
    corei7-avx
    , calls a library dispatch routine that picks the optimal CPU specific version of the
    exp()
    routine in runtime. Dispatching cannot be avoided because the instruction set does not allow the use of Intel® AVX2. Dynamic dispatching provides the best performance with the Intel® AVX2 CPU.
In the second example, if some portions of code extend the available instructions set by means of
the
_allow_cpu_features()
or
the
_may_i_use_cpu_feature()
intrinsic, then the compiler might produce direct calls to Intel® AVX2 specific versions of
exp()
.
The dispatching optimization applies to the
exp()
routine, and to the other math routines with CPU specific implementations in the libraries. The dispatching optimization can be disabled using the
-fimf-force-dynamic-target
(or Qimf-force-dynamic-target) option. This option specifies a list of math routines that are improved with a dynamic dispatcher. (See the Intel®
C++
Compiler documentation for syntax examples.)