If I rerun the identical program on the identical input data on an identical processor, will I get an identical result?
The only known source of variability in floating-point results for serial code is variations in data alignment that are triggered by changes in the external environment. An example would be if the user dynamically allocates a new string to contain the date. The size of this string could vary for different days of the week or months of the year, which in turn could affect the alignment of subsequently allocated data. Alignment variations can change which loop iterations are part of the loop prolog or epilog (“remainder loop”), and which are part of the vectorized kernel. This in turn changes the order of operations in a floating-point reduction; the rounding of different intermediate results to the declared precision can lead to small variations in the final result. Other differences in optimization between the prolog, epilog and kernel can cause similar variations. Such variations can be prevented in either of two ways:
1) By explicitly aligning any data that are allocated at run-time. For C/C++ programs, this can be done using aligned versions of malloc/calloc/free etc., such as _mm_malloc() and _mm_free() or the Posix equivalents. For Fortran programs, !DIR$ ATTRIBUTES ALIGN:64 :: arrayname directives may be used, or the command line switch -align array64byte (recommended). Aligning data to 64 byte boundaries may also have performance benefits.
2) By compiling with –fp-model precise. This disables vectorization of reductions and certain other optimizations that may lead to variations in floating-point results, not just from run to run, but also between different optimization levels. There is a potential impact on performance that should be measured, since it can be very application-dependent. Some optimizations that are not expected to impact run-to-run reproducibility can be re-enabled using options such as -ftz, -fast-transcendentals, -no-prec-div and -no-prec-sqrt.
Parallel applications are subject to further sources of variation in addition to those discussed above. For parallel code containing reductions, changes in parallel decomposition are likely to produce changes in the order of operations and hence variations in results. For a fixed decomposition, the Intel compiler and run-time provide a mechanism to ensure reproducible results from OpenMP reductions, subject to static scheduling and a fixed number of threads, by setting the environment variable KMP_DETERMISTIC_REDUCTION to ‘true’ (or ‘yes’ or ‘on’ or ‘1’). This is the default for large numbers of threads, currently >4 on Intel64 processors or >8 on Intel Xeon Phi coprocessors.
The Intel® Math Kernel Library supports run-to-run reproducibility under related conditions. These include consistent alignment of input data (64 byte alignment gives best performance); constant number of threads; static OpenMP scheduling; and KMP_DETERMINISTIC_REDUCTION=true.
Intel® Threading Building Blocks support a new function, parallel_deterministic_reduce(), that can give reproducible results for parallel reductions, even for a variable number of threads, at some potential cost in performance. The Intel® Composer XE 2013 product documentation contains more detail.
Intel® Cilk™ Plus threading keywords do not currently ensure run-to-run reproducibility of parallel reductions.
Applications built using MPI are not discussed here, but similar considerations may apply.
There is no way to ensure bit-for-bit reproducibility between code executed on Intel® Xeon processors and code executed on Intel® Xeon Phi™ coprocessors, even for fixed numbers of threads or for serial code. For a discussion of potential differences and of ways to minimize these, along with other aspects of floating-point calculations on Intel Xeon Phi coprocessors , see the article https://secure-software.intel.com/sites/default/files/article/326703/floating-point-differences-sept11.pdf linked as “Floating-point Differences” under “Programming” at http://software.intel.com/en-us/mic-developer.
A program that runs on an Intel Xeon host and offloads to one or more Intel Xeon Phi coprocessors is expected to be reproducible provided it adheres to the conditions described above, and the same sections of code are offloaded to the same coprocessor from one run to the next. If a section of code is sometimes offloaded and sometimes run on the Intel Xeon host processor, it will most likely not be reproducible.
The run-to-run variations in floating-point results discussed above are in general very tiny, typically in the last bit of the mantissa for individual operations, so of order 1 part in 107 for single precision and 1 part in 1016 for double precision. These reflect the finite precision of floating-point arithmetic; one result is typically not more “correct” than another. These variations may become significant if there are large cancellations (of which the user is sometimes unaware), or if the user depends on bit-for-bit agreement for reasons of quality control. In either case, the variations in results may be a warning that the true numerical uncertainty in the result may be larger, sometimes a lot larger, than the user realizes.
For many purposes, any result within the expected computational uncertainty may be acceptable. When comparing results, whether between different runs, different optimization levels, different software versions, different processors or different architectures, users are encouraged to test for agreement within some reasonable tolerance, rather than seeking bit-for-bit identical results.
For a more detailed discussion of general reproducibility issues, including tradeoffs between reproducibility, accuracy and performance, see the online article attached at http://software.intel.com/en-us/articles/consistency-of-floating-point-results-using-the-intel-compiler/