The floating-point model used by the Intel® Compiler and its application to Intel® Xeon® processors is described in the paper "Consistency of Floating-Point Results using the Intel® Compiler." For a suitable choice of settings, the compiler generates code that is fully compliant with the ANSI language standards and the IEEE-754 standard for binary floating-point arithmetic. Compiler options give the user control over the tradeoffs between optimizations for performance, accuracy, reproducibility of results and strict conformance with these standards.
The same floating-point model applies to the Intel® Xeon Phi™ coprocessor x100 product family, (formerly code name Knights Corner), but the architectural differences compared to Intel Xeon processors lead to a few small differences in implementation. Those differences are the subject of this paper.
The Intel Xeon Phi coprocessor x100 product family supports the same floating-point data types as the Intel Xeon processor. Single (32-bit) and double (64-bit) precision are supported in hardware; quadruple (128-bit) precision is supported through software. Extended (80-bit) precision is supported through the x87 instruction set. Denormalized numbers and gradual underflow are supported, but abrupt underflow is the default at all optimization levels except -O0. The same set of rounding modes is supported as for Intel Xeon processors.
The biggest differences arise in the treatment of floating-point exceptions. The vector floating-point unit on the Intel Xeon Phi coprocessor flags but does not support trapping of floating-point exceptions. The exception masking bits in the VXCSR register are protected; attempts to modify them result in a segmentation fault. Some compiler options such as -fp-trap (C/C++) or -fpe0 (Fortran) that would unmask floating-point exceptions on Intel Xeon processors are unsupported on Intel® Many Integrated Core Architecture (Intel® MIC architecture).
The options -fp-model except or -fp-model strict still request strict, standard-conforming semantics for floating-point exceptions. This is achieved by generating x87 code for floating-point operations instead of code that makes use of Intel® Initial Many Core Instructions (Intel® IMCI). Because such code cannot be vectorized, this may have a substantial impact on performance. Nevertheless, these options may be useful for application debugging. For similar reasons, the options -ansi and -fmath-errno may result in calls to math functions that are implemented using x87 rather than Intel IMCI.
The -fp-model switch
The same settings of the -fp-model switch are supported as for Intel Xeon processors. The default setting is -fp-model fast=1 for both. The behavior of the -fp-model precise option is the same, though the consequent reduction in performance may be somewhat greater for Intel Xeon Phi coprocessors, because of the larger vector width on Intel MIC architecture and larger potential performance gain from the vectorization of reduction loops and loops containing transcendental math functions. The impact of -fp-model except and -fp-model strict on performance may be greater, for the reasons noted above.
The setting -fp-model fast=2 sets the -fimf-domain-exclusion switch and enables faster, inlined versions of some math functions for the Intel Xeon Phi Coprocessor, see the section "Precision of Math Functions" below. -fp-model fast=2 is also implied by the option -fast.
The -fp-model precise switch enables arithmetic using denormalized numbers and disables abrupt underflow; abrupt underflow can be re-enabled explicitly using -ftz. Even when enabled, e.g. by -ftz, abrupt underflow is allowed but never required. The compiler is still free to generate denormals (gradual underflow), for example by generating x87 code. This behavior is the same as on Intel Xeon processors.
Intel Xeon processors up to and including the Intel Xeon processor v2 family do not have a fused multiply-add (FMA) instruction. Intel Xeon Phi coprocessors support an FMA instruction that only performs a single rounding on the ﬁnal result, so can yield results that diﬀer very slightly from separate addition and multiplication instructions.
In principle, the -fp-model strict switch would disable fused multiply-add (FMA) instructions. But since, as noted above, -fp-model strict suppresses Intel IMCI in favor of legacy x87 floating-point instructions, this additional behavior is moot. Optimizations resulting in FMA instructions are enabled by default, but may be disabled directly by the switch -no-fma. Optimizations resulting in FMA instructions are not disabled by -fp-model precise.
Precision of Math Functions
In the Intel® Composer XE 2015 product, the compiler by default invokes transcendental functions of at least medium accuracy (< 4 ulp) for both scalar and vector code for both Intel Xeon Phi coprocessors and Intel Xeon processors. Depending on circumstances, the compiler may call more accurate implementations. For example, scalar code on Intel64 Xeon systems is likely to call libm (< 0.6 ulp) although vector code usually calls the medium accuracy Short Vector Math Library (libsvml). On Intel Xeon Phi coprocessors, division defaults to medium accuracy inlined code that uses a reciprocal instruction and square root defaults to medium accuracy inlined code that uses an inverse square root instruction. The -ﬁmf-domain-exclusion switch may be used to obtain inlined versions of certain other math functions. See compiler documentation for details of the excluded domains. Corresponding higher accuracy (typically 0.6 ulp or 1.0 ulp), vectorizable versions of divide, square root and transcendental functions may be obtained with -ﬁmf-precision=high. For division and square root, these high accuracy versions are also called with –fp-model precise. For transcendental functions, -fp-model precise results in high accuracy (<0.6 ulp), scalar calls to libm. The switches -prec-div, -prec-sqrt and -no-fast-transcendentals respectively have the same effect. x87 instructions continue to be generated for -fp-model strict.
Medium accuracy, vectorizable transcendental math functions can still be obtained in conjunction with -fp-model precise by specifying -fp-model precise -fast-transcendentals. More highly accurate, vectorizable versions are obtained by adding -ﬁmf-precision=high.
Comparing floating-point results between Intel Xeon Phi coprocessors and Intel Xeon Processors
In general, floating-point computations on an Intel Xeon Phi coprocessor may not give bit-for-bit identical results to the equivalent computations on an Intel Xeon processor, even though underlying hardware instructions conform to the same standards. Compiler optimizations may be implemented differently, math functions may be implemented differently, etc. The fused multiply-add (FMA) instruction available on the Intel Xeon Phi coprocessor is a common source of differences. Nevertheless, the following guidelines may help to minimize differences between results obtained on different platforms, at a cost in performance:
Build your application on both platforms using -fp-model precise -fp-model source. This also selects high accuracy math functions on both platforms.
If your Intel Xeon processor does not support FMA instructions, (e.g. the Intel Xeon processor v2 family and earlier), build your application for the Intel Xeon Phi coprocessor with -no-fma, to disable the use of FMA instructions. If your Intel Xeon processor does support FMA instructions, (e.g. the Intel Xeon processor v3 family), you may try building your application for this processor using the -xcore-avx2 option. However, FMA generation is an optimization, and since the compiler may sometimes make different FMA optimization decisions for different platforms, you may still be able to reduce differences further with -no-fma. Alternatively, for C or C++ applications, you may disable FMA generation for individual functions by using #pragma fp_contract (off | on), or for Fortran applications, by using the !DIR$ NOFMA directive.
For OpenMP applications that perform reductions in parallel, set the environment variable KMP_DETERMINISTIC_REDUCTION=yes, use static scheduling, and use OMP_NUM_THREADS to set the same number of threads on each platform.
For C++ applications making use of Intel® Threading Building Blocks, (Intel® TBB), the parallel_deterministic_reduce() function may help to obtain more consistent results on both platforms, even for differing numbers of threads.
These guidelines are intended to help enhance reproducibility and minimize variations in floating-point results between different platforms. The observed variations do not represent the true numerical uncertainty in your result, which may be much greater.
Intel, the Intel logo, Core, Phi and Xeon are trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others
Copyright© 2014 Intel Corporation. All rights reserved.
For more complete information about performance and benchmark results, visit www.intel.com/benchmarks