Consistency of Floating-Point Results using the Intel® Compiler

Consistency of Floating-Point Results using the Intel® Compiler
or
Why doesn’t my application always give the same answer?


Dr. Martyn J. Corden
David Kreitzer

Software Solutions Group
Intel Corporation


Introduction

Binary floating-point [FP] representations of most real numbers are inexact, and there is an inherent uncertainty in the result of most calculations involving floating-point numbers. Programmers of floating-point applications typically have the following objectives:
• Accuracy
o Produce results that are “close” to the result of the exact calculation
- Usually measured in fractional error, or sometimes “units in the last place” (ulp).
• Reproducibility
o Produce consistent results:
- From one run to the next;
- From one set of build options to another;
- From one compiler to another
- From one processor or operating system to another
• Performance
o Produce an application that runs as fast as possible

These objectives usually conflict! However, good programming practices and judicious use of compiler options allow you to control the tradeoffs.

For example, it is sometimes useful to have a degree of reproducibility that goes beyond the inherent accuracy of a computation. Some software quality assurance tests may require close, or even bit-for-bit, agreement between results before and after software changes, even though the mathematical uncertainty in the result of the computation may be considerably larger. The right compiler options can deliver consistent, closely reproducible results while preserving good (though not optimal) performance.

....................<omitted>........................

Bottom Line

Compiler options let you control the tradeoffs between accuracy, reproducibility and performance. Use

/fp:precise /fp:source (Windows*) or
-fp-model precise -fp-model source (Linux* or OS X*)

to improve the consistency and reproducibility of floating-point results while limiting the impact on performance.
If reproducibility between different processor types of the same architecture is important, use also
 

/Qimf-arch-consistency:true (Windows) or
-fimf-arch-consistency=true (Linux or OS X)

For best reproducibility between processors that support FMA instructions and processors that do not, use also /Qfma- (Windows)       or -no-fma (Linux or OS X). In the version 17 compiler, best reproducibility may be obtained with the single switch /fp:consistent (Windows) or -fp-model consistent (Linux or OS X), which sets all of the above options.

For the complete article, updated for version 17 of the Intel® Compiler,  please open the attached PDF file.

 See here for a comparison to Intel® MIC Architecture.
 

Optimization Notice in English

For more complete information about compiler optimizations, see our Optimization Notice.
AttachmentSize
PDF icon FP_Consistency_070816.pdf437.52 KB

8 comments

Top
Olgu T.'s picture

Hello:

I am migrating
FROM(OLD)
Intel(R) Fortran Compiler Integration for Microsoft Visual Studio 2005, 10.1.3885.2005
TO(NEW)
Intel(R) Visual Fortran Composer XE 2011 Update 11 Integration for Microsoft Visual Studio* 2005, 12.1.3534.2005.

I built my Fortran projects via two compliers above on two different Windows 7 computers. I have 2 run cases, case1 and case2. Both cases run ok with the old complier. Though case1 runs on the new complier and matches the output of case1 from the built via old complier, case2 first diverges and then crashes (NaN) on the new complier. My guess is that the crash is due to floating point arithmetic difference between two compliers and the following property options of the project in two different compliers are not aligning since there is no one to one correspondence between complier options below for floating point arithmetics.

OLD
Floating-Point Consistency = Improve Consistency (/Op)

NEW
Floating Point Model = Source(/fp:source)
Reliable Floating Point Exceptions Model = Disable (/fp:except-)

Can you please help?

anonymous's picture

We do have a single thread application which produces run-to-run dependant results using same input data on same machine, although we compile everythting with -fp-model precise -fp-model source (ifort version 11.1). The only clue
so far is that the module which produces the discrepancies uses some math library functions.
Is it possible that the math library which comes with intel FORTRAN (libimf.so libimf.a) is not compiled with
-fp-model precise/source flags??

Tim P.'s picture

with icc 12.0/xe 2011, more optimizations which are removed by /fp:source can be restored for an individual loop by #pragma simd (which may set /fp:fast in the immediately following for loop) which also has the options reduction and private, which may enable otherwise missed optimizations.

Martyn Corden (Intel)'s picture

(1) The only case I know of where rerunning an executable built with /fp:precise on an identical data set and equivalent processor would not produce identical results is if it contains a reduction loop that is threaded using OpenMP. The OpenMP standard states that the partial sums may be added in any order, (e.g., as they become available), and that this order may very from run to run. This can lead to differences in rounding and small differences in the final result of the reduction. The ways to avoid this are
1) don’t code reduction loops using OpenMP and the reduction attribute; or
2) accumulate explicit partial sums for each thread and then add them in a separate, unthreaded loop or statement, where the order will be preserved with /fp:precise.
This doesn’t apply to vectorized reduction loops or auto-parallelized (with /Qparallel) reduction loops, since the compiler knows not to vectorize or parallelize such loops if /fp:precise is specified. But if a directive specifies that a reduction loop is to be threaded using OpenMP, the compiler is not at liberty to change that.

Where external libraries are used to perform a reduction in parallel, reproducibility depends on how that reduction is implemented internally. So reductions (such as matrix multiplications) in math libraries such as MKL, reductions such as MPIAllReduce in some MPI libraries, etc, may not be exactly reproducible from run to run.

(2) Alignment matters because SSE SIMD instructions require 16 byte aligned data, and SSE loads of aligned data are much more efficient that loads of unaligned data. So to vectorize a loop, the compiler “peels off” one or two iterations to get to the point at which the data arrays are 16 byte aligned. These peeled iterations constitute a loop prolog that may be optimized differently from the main, vectorized loop kernel. (and likewise for any loop epilog). When the alignment changes, then the number of iterations in the prolog and epilog may change. The way an iteration is optimized may change when it moves from the prolog to the kernel. The most obvious (and I think most frequent) case is again for a reduction loop, when the order in which the terms are added, and consequently the rounding effects, change when an iteration is moved from the prolog to the kernel.
These alignment considerations apply equally to single and double precision variables. Only variables that appear in vectorized loops need be aligned. (primarily variables in vectorized reductions loops; I’m not sure whether other differences might arise between loop prologs and vectorized loop kernels).

As I understand it, MKL is not built with /fp:precise, that is why you need to pass in consistently aligned arrays to get exactly reproducible results. In Intel Fortran, the simplest way to ensure data alignment is with the !DIR$ ATTRIBUTES ALIGN directive.

Finally, a reminder that the variations we are talking about are very tiny for individual operations, though they can accumulate over the length of an application. Even then, the accumulated variations should be smaller than the numerical uncertainty in the result. I understand that some users like to use bit-to-bit comparisons as a way of checking that no unintended changes have occurred in their application. But reproducibility should not be mistaken for accuracy; a result can be precisely reproducible but wildly inaccurate. There is a real numerical uncertainty in the result of almost all floating-point calculations.

Tony Garratt's picture

Very useful! But I have 2 questions:

(1) You infer (by using the statement "may also be used to improve the run-to-run consistency") that even with fp:/precise, the results may still vary from run to run. Is that the case or not, and, if so, what can we do to be sure the results are consistent?

(2) I would like to understand how the environment and memory alignment can affect the fp operations and optimization? This topic also comes up with MKL (we have protected all of our double precision arrays passed to MKL to be 16-byte boundary aligned), but that is not practical to do for all Fortran Allocate() statements that allocate arrays that are NOT passed to MKL. Is it necessary to ensure ALL double precision arrays are aligned on windows or linux platforms even if you are not using MKL UNLESS you use fp:precise?

anonymous's picture

This article was very helpful. Before using /fp:precise I was getting two different results when passing an external subroutine to an external bisection routine and when using the same bisection routine but copied as a local function (this was with OpenMP running on top). After /fp:precise, each method produces identical results (which differ from both of the non precise results!). Thanks.

Martyn Corden (Intel)'s picture

Slight variations in floating-point results have sometimes been seen when the same executable is run from different directories or on different (but equivalent) systems. Some applications may allocate a variable length string containing the directory name, node name, or even the date and time. This may result in variations in the alignment of subsequent memory allocations, which may in turn cause slight variations in rounding effects and in floating point results, as discussed in the article.

The switches -fp-model precise or -fp-model source (/fp:precise or /fp:source on Windows*) may also be used to improve the run-to-run consistency of floating-point results in such cases.

Tim P.'s picture

It might be noted that it is practical to re-enable certain performance features which are removed by -fp:source e.g. -fp:source -Qftz
Future CPUs are planned to correct many of the performance problems associated with IEEE gradual underflow.
One of the more significant and less risky optimizations removed by -fp:source is "vectorized" sum or dot product reduction.
ifort has compatibility options not available in Intel C/C++:
-assume:protect_parens,minus0
The combination ifort -assume:protect_parens,minus0 -Qprec-div -Qprec-sqrt (slight changes in spellings for linux/Mac) removes some of the more standards-incompliant features without disabling important optimizations. protect_parens (like -fp-source) requires compliance with parentheses; -minus0 set compliance with f95 and f2003 standard on treatment of negative 0
Setting compatibility options in ifort.cfg, icc.cfg, icpc.cfg in the compiler installation may be helpful in avoiding problems without having to remember to set the options every time.

Add a Comment

Have a technical question? Visit our forums. Have site or software product issues? Contact support.