If I rerun the identical program on the identical input data on an identical processor, will I get an identical result?
The only known source of variability in floating-point results for serial code is variations in data alignment that are triggered by changes in the external environment. An example would be if the user dynamically allocates a new string to contain the date. The size of this string could vary for different days of the week or months of the year, which in turn could affect the alignment of subsequently allocated data. Alignment variations can change which loop iterations are part of the “peel loop” (loop prolog) or “remainder loop” (loop epilog), and which are part of the vectorized kernel. (See https://software.intel.com/videos/getting-the-most-out-of-the-intel-compiler-with-new-optimization-reports for an example and optimization report illustrating peel and remainder loops). This in turn changes the order of operations in a floating-point reduction; the rounding of different intermediate results to the declared precision can lead to small variations in the final result. Other differences in optimization between the peel loop, remainder loop and kernel can cause similar variations. Such variations can be prevented in one of three ways:
1) By explicitly aligning any data that are allocated at run-time. For C/C++ programs, this can be done using aligned versions of malloc/calloc/free etc., such as _mm_malloc() and _mm_free(), or the Posix equivalents, or the aligned_new feature of C++11. For Fortran programs, !DIR$ ATTRIBUTES ALIGN:64 :: arrayname directives may be used, or simply the command line switch -align array64byte (Linux* or OS X*) or /align:array64byte (Windows*). Aligning data to 64 byte boundaries may also have performance benefits.
2) By compiling with -fp-model precise (Linux and OS X) or /fp:precise (Windows). This disables vectorization of reductions and certain other optimizations that may lead to variations in floating-point results, not just from run to run, but also between different optimization levels. There is a potential impact on performance that should be measured, since it can be very application-dependent. Some optimizations that are not expected to impact run-to-run reproducibility can be re-enabled using options such as -ftz, -fast-transcendentals, -no-prec-div and -no-prec-sqrt (/Qftz, /Qfast-transcendentals, /Qprec-div- and /Qprec-sqrt- on Windows).
3) By compiling with -qno-opt-dynamic-align (/Qopt-dynamic-align- on Windows) using the Intel® Compiler version 15 or later. This disables only those optimizations for which results are sensitive to run-time variations in data alignment. The impact on performance is typically much less than for -fp-model precise (/fp:precise).
Parallel applications are subject to further sources of variation in addition to those discussed above. For parallel code containing reductions, changes in parallel decomposition are likely to produce changes in the order of operations and hence variations in results. For a fixed decomposition, the Intel compiler and run-time provide a mechanism to ensure reproducible results from OpenMP reductions, subject to static scheduling and a fixed number of threads, by setting the environment variable KMP_DETERMISTIC_REDUCTION to ‘true’ (or ‘yes’ or ‘on’ or ‘1’). This is the default for large numbers of threads, currently >4 on Intel64 processors or >8 on Intel Xeon Phi coprocessors.
The Intel® Math Kernel Library supports run-to-run reproducibility under related conditions. These include constant number of threads; static OpenMP scheduling; and KMP_DETERMINISTIC_REDUCTION=true. The user should either call the function mkl_cbwr_set() or set the environment variable MKL_CBWR. See the section "Obtaining Numerically Reproducible results" in the Intel Math Kernel Library User's Guide.
Intel® Threading Building Blocks support a new function, parallel_deterministic_reduce(), that can give reproducible results for parallel reductions, even for a variable number of threads, at some potential cost in performance. The Intel® Composer XE 2015 product documentation contains more detail.
Intel® Cilk™ Plus threading keywords do not currently ensure run-to-run reproducibility of parallel reductions.
Applications built using MPI are not discussed here, but similar considerations may apply.
Applications that are built with OpenMP but that do not contain parallel reductions may give results that vary when the number of OpenMP threads is changed. This is because the decomposition changes which iterations get assigned to the vectorized kernel and which get assigned to peel and remainder loops, which may be optimized differently. Such variations can be prevented by compiling with -fp-model precise (/fp:precise on Windows. They can also be prevented, at a lesser cost in performance, by compiling instead with -qno-opt-dynamic-align (/Qopt-dynamic-align- on Windows), selecting static OpenMP scheduling and choosing an OpenMP chunk size that is a multiple of the SIMD vector length. So for example, when targeting Intel® Streaming SIMD Extensions (Intel® SSE), the chunk size for a loop over doubles should be a multiple of 2 (16 bytes SIMD width / sizeof(double) ). When targeting Intel® Advanced Vector Extensions (Intel® AVX), the chunk size for a loop over floats should be a multiple of 8 (32 bytes SIMD width / sizeof(float) ). When targeting Intel® Xeon Phi™ coprocessors, the chunk size for a loop over doubles should be a multiple of 8 (64 bytes SIMD width / sizeof(double) ). Some SIMD programming extensions, such as Intel® Cilk™ Plus, may allow the SIMD vector length to be specified explicitly. Specifying the chunk size in this way ensures that the same iterations get executed in peel and remainder loops, even when the number of threads is varied.
There is no way to ensure bit-for-bit reproducibility between code executed on Intel® Xeon processors and code executed on Intel® Xeon Phi™ coprocessors, even for fixed numbers of threads or for serial code. For a discussion of potential differences and of ways to minimize these, along with other aspects of floating-point calculations on Intel Xeon Phi coprocessors , see the article https://secure-software.intel.com/sites/default/files/article/326703/floating-point-differences-sept11.pdf linked as “Differences in Floating-point Arithmetic” under “Programming” at http://software.intel.com/mic-developer.
A program that runs on an Intel Xeon host and offloads to one or more Intel Xeon Phi coprocessors is expected to be reproducible provided it adheres to the conditions described above, and the same sections of code are offloaded to the same coprocessor from one run to the next. If a section of code is sometimes offloaded and sometimes run on the Intel Xeon host processor, it will most likely not be reproducible.
The run-to-run variations in floating-point results discussed above are in general very tiny, typically in the last bit of the mantissa for individual operations, so of order 1 part in 107 for single precision and 1 part in 1016 for double precision. These reflect the finite precision of floating-point arithmetic; one result is typically not more “correct” than another. These variations may become significant if there are large cancellations (of which the user is sometimes unaware), or if the user depends on bit-for-bit agreement for reasons of quality control. In either case, the variations in results may be a warning that the true numerical uncertainty in the result may be larger, sometimes a lot larger, than the user realizes.
For many purposes, any result within the expected computational uncertainty may be acceptable. When comparing results, whether between different runs, different optimization levels, different software versions, different processors or different architectures, users are encouraged to test for agreement within some reasonable tolerance, rather than seeking bit-for-bit identical results.
For a more detailed discussion of general reproducibility issues, including tradeoffs between reproducibility, accuracy and performance, see the online article attached at http://software.intel.com/articles/consistency-of-floating-point-results-using-the-intel-compiler/