Different results with 32 bit and 64 bit executables from Fortran compiler

Different results with 32 bit and 64 bit executables from Fortran compiler

Portrait de john-l-ditter

Here is my situation:

- ifort version 11.1.048

- All the operations are done in double precision in 32 and 64 bits.

- Compiler options in 32:

-c /w90 /w95 /cm -DP4 -DWIN32 /Qfpp /MD /4Yportlib /O3 /fp:source /Qip /Qopenmp

- Compiler options in 64:

-c /w90 /w95 /cm -DP4 -DWIN32 -DAMD64_WIN /Qfpp /MD /4Yportlib /O3 /fp:source /Qip /Qopenmp

Differences in the 15th place after the decimal point creep in and eventually over time, the results become completely different.

I have been printing numbers in a really long format to see the number as it exists in memory and for example I get these:

64 bit:
0.93049996167787141221339197727502323690000000000000E+00

32 bit:
0.93049996167787196732490428985329344870000000000000E+00

As you can see in the 15th digit, I'll end up with 1 in 64 bit and after rounding with 2 in 32 bit. This is despite the fact that I am using fp:source.

Any help would be appreciated.

17 posts / 0 nouveau(x)
Dernière contribution
Reportez-vous à notre Notice d'optimisation pour plus d'informations sur les choix et l'optimisation des performances dans les produits logiciels Intel.
Portrait de Steve Lionel (Intel)

Without seeing an example that reproduces the difference, one can only speculate, but any slight difference in choice of instructions or order of operations can cause such differences, and changing from a 32-bit to 64-bit platform, where there are more registers and more instructions, can be a factor. /fp:source certainly helps but it is not a guarantee of bit-for-bit same results.

Steve
Portrait de john-l-ditter

Thanks Steve.

I am building the same source. I have tried compiling on a truly 32 bit platform as well as compiling on a 64 bit platform in the IA32 environment of IVF: identical differences in results.

Portrait de john-l-ditter

What can I provide for you in order to make a more educated recommendation?

Portrait de jimdempseyatthecove

Also I notice you are using /Qopenmp
When a program is running multi-threaded the execution sequence between threads is not necessarily deterministic.
The precise results may vary depending on number of threads and/or completion order.
If your code has more than 2 threads and is performing reductions the results may also vary
A+(B+(C+D))
when A, B, C, D are inexact reductions by 4 threads is not necessarily equal to
((A+B)+C)+D
or
(A+B)+(C+D)
or
(A+C)+(B+D)
...
(all permutations)

Jim Dempsey

www.quickthreadprogramming.com
Portrait de john-l-ditter

Good point Jim. I should have pointed out that despite I am compiling the code for parallel threads, I am not using it in this case. The code used here is completely serial with no omp directives. Would the /Qopenmp make a difference here?

Portrait de john-l-ditter

Additionally, I have tried /O0 as well as /fp:strict with no change in the results :-(

Portrait de jimdempseyatthecove

Look at Steve's last post. x32 and x64 have different numbers of SSE (and/or AVX) registers as well as different number of general purpose registers. These differences will (may) result in different code generation due to the different numbers of FP temps (in registers) and different numbers of indexing registers. The different code sequence will (may) also alter the sequence in which the reductions are made (ABCD thing mentioned above). Though /fp:source will attempt to reduce these differences the /O3 optimizations may work against this. Should this difference be absolutely critical (e.g. non-convergence or not matching exact values approved by external athority), then consider using /O0 for the particular source files presenting the differences (or fix your code to converge or have the "athority" consider change in approved results).

As it stands now you are experiencing

(approximation x32) != (approximation x64)

Or to say another way

(not exact value) != (other not exact value)

If the situation is you are experiencing a non-conversion issue or gain explosion/implosion with x64, then your x32 code likely succeeded by chance and not by design.

If the situation is you are not producing an approved set of results then the "athority" that defined the approved set of result defined a set of in-exact numbers to comply with.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de john-l-ditter

Hmmm...I tried /O0 with no luck.

About the results: we are simulating physics and as these differences accumulate the resulting time history from each executable becomes completely different. Only a sum of an array of size 100,000 is enough to push this 1e-15 difference into the 1e-10 range. And then, over 100,000 time steps, the solution deviates by 1e-5. The picture below is the evolution of time-step size between 32 bit and 64 bit executables for the same code and the same simulation case.

Fichiers joints: 

Fichier attachéTaille
Téléchargement untitled-1.png9.78 Ko
Portrait de mecej4

Is your problem one involving high sensitivity to slight changes in initial conditions or parameter values? Do you have an estimate of the Lyapunov time (if the concept is applicable to your system), and how does the integration time interval compare to the Lyapunov time?

Portrait de jimdempseyatthecove

In looking at the graph, in particular how the two lines approximate the same values but with alternating (crossing patterns), my educated guess is the code that you use to generate the variable use for your integration interval (time step size) differs by 1 lsb (alternately). You can verify this by prining the HEX value (Z format) of the integration interval value. Should this prove to be the same, then the next area to look at is the code that applies the integration interval to the state for advancement. IOW the summation of the 100,000 results may not be the problem.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de mecej4

It is misleading to compare time-step sizes when running an ODE integrator with automatic error control. Depending on the error tolerances supplied, 32-bit and 64-bit codes (presumably, one is referring to address register size rather than floating point register size here) may use different internal precision (32-bit software may use 80-bit reals with 64-bit significand). Therefore, the step-size picked by the automatic error-control algorithm can be different for the two runs.

A more pertinent comparison would involve the dependent variable values from the two runs at matched independent variable values.

Portrait de john-l-ditter

Time-step size is controlled by other independent variables. Their difference feeds back into the time-step size. Here is the kinetic energy. They start pretty close and as the difference accumulates, they separate...

Fichiers joints: 

Fichier attachéTaille
Téléchargement untitled-2.png15.41 Ko
Portrait de jimdempseyatthecove

There is a subtle difference between:

dT = ...
T0 = ...
T = T0
...
T = T + dT
...
T = T + dT
...
----------- and -------
dT = ....
Tick = 0.0
T = T0 + dT * Tick
...
Tick = Tick + 1.0
T = T0 + dT * Tick
....
Tick = Tick + 1.0
T = T0 + dT * Tick
....

The latter is superior in situations were dT is relatively small (and approximate in base-2).
Depending on values chosen for dT, it may be better to process with OOdT (One Over dT) and perform

T = T0 + Tick / OOdT

Jim Dempsey

www.quickthreadprogramming.com
Portrait de John Campbell

If the chart in untitled_2.png represents the results of the same calculation for 32-bit and 64-bit OS, then it shows that roundoff for real*8 calculations is being swamped by a very small time step. I'd assume that the change in values between time steps is small compared to an accuracy of 15 sig figures that are available to the accumulators.
I'd expect that the time step is too small to accumulate the differences. You need to change the way the differences are accumulated so that their significance is not lost, or increase the time step, or use real*16 for some of the key accumulators and calculations.
.
Even though you have printed the result to 40 sig figures, those past 15 are merely random. You should expect that only 15 to 16 significant figures are all you can get with real*8.
.
If it is the case that the differences graphed are resulting from different round-off between OS32 and OS64, then this highlights how unstable the original difference calculation has been all along. If, for each itteration, you have been adding a very small number to a large accumulator, then the increment has had a much larger error than you prevoiusly expected. You must check the expected change of the accumulators at each etteration and confirm what accuracy is being retained.
.
accumulator = big_number
step = small_itteration
Precision_accumulator = PRECISION(accumulator)
do i = 1, huge_number
increment = small_value
start = start + increment
sig_figures = Precision_accumulator - log_10(abs(accumulator/increment))
end do
sig_figures is the number of significant figures of the increment which is being added to the accumulator, which is an estimation of the accuracy of the numerical computation.
.
There are two different types of itterative calculations,
1) where the itteration is a foward extrapolation and an absolure error check is not available, (which appears to be your case)
2) where the absolute error is calculated and there is an attempt to arrive at a closer approximation. (These are more stable, as the error is always being re-checked and then updated)
.
Time history response calculations are a typical example of extrapolation problems, where the accuracy of the increment calculation needs to be balanced against the size of the time step. Increasing the time step may require a higher order calculation of the increment (such as changeing from assuming constant to linear acceleration in the time interval), while a smaller time step leads to significant round-off errors.
.
If I am understanding your problem, you need to confirm the accuracy being retained at each itteration. With extrapolation, you are only checking a relative accuracy at each itteration, if you do not do any absolute accuracy check.
.
John

Portrait de jimdempseyatthecove

John >> You need to change the way the differences are accumulated so that their significance is not lost

The two methods I mentioned in my prior post corrects for this.

Jim Dempsey

www.quickthreadprogramming.com
Portrait de John Campbell

Jim,
.
I'm not sure that it is that easy, as if the magintude of the increment (calculated at each itteration) to be added to the accumulator is much smaller, then the precision of the accumulation is very poor.
.
For exampe:
if the accumulator is 1.e6 and the increment is 1.e-6, then only 15-12 = 3 significant figures of the increment is being retained at each itteration.
If there is no way of testing the absolute error of the calculation ( as for foward extrapolation ) then the accumulated error can not be recovered.
.
The only way is to have an accumulator as real(16), so that the number of significant figures retained is increased back to 15.
.
Your method is good for calculating the tick more accurately, but not other values associated with each itteration, such as say displacement or velocity. Making the time step smaller improves the accuracy of the increment, but not it's accumulation. This has been a classic problem with the coding of Newmark's method for dynamic response calculations.
.
John

Connectez-vous pour laisser un commentaire.