-fp-model or -fltconsistency switch necessary problem

-fp-model or -fltconsistency switch necessary problem

Hello,

I've bumped into a trouble. If I don't use the '-fp-model precise' or '-ftlconsistency' switch I obtain an extremely wrong (not only slightly inaccurate) result (expected a small positive number around 1, got cca -3000) of my calculation.
All the calculations are programmed to use DP = kind(1D0) reals.

I've been using the latest ifort 11.0.038 + mkl on ia32 pentium 4 (one processor) machine (but the same problem exists for ifort 11.0.069 on the same machine, and also for ifort 10.1 on a 4-processor xeon machine).

My perception: (11.0.038 + mkl on ia32 pentium 4 (one processor)):
no switch whatsoever --- problem
-flt-consistency --- OK
-fp-model precise --- OK
-IPF-fltacc --- problem
-g --- OK
-align -xHost --- problem
-align -xHost -fltconsistency --- OK
-align -xHost -fp-model precise --- OK
-fast --- doesn't compile
-O0 --- problem

Situation with ifor 11.0.069 on the same machine is the same.

On a 4-processor xeon with ifort 10.1 the situation is different
no switch --- OK
whenever -xN is used (-xHost doesn't exist) --- problem (no dependency on -fp-model or/and -fltconsistency)

Could anybody comment on this behaviour? I'd really appreciate it.
Thanks a lot!
Ruda

PS.:
/proc/cpuinfo from my machine:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel Pentium 4 CPU 2.40GHz
stepping : 7
cpu MHz : 2399.993
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe up pebs bts
bogomips : 4805.21
clflush size : 64

/proc/cpuinfo from the xeon machine:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel Xeon CPU 3.20GHz
stepping : 5
cpu MHz : 3189.501
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 6381.50

processor : 1
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel Xeon CPU 3.20GHz
stepping : 5
cpu MHz : 3189.501
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 6377.81

processor : 2
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel Xeon CPU 3.20GHz
stepping : 5
cpu MHz : 3189.501
cache size : 2048 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 6377.91

processor : 3
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel Xeon CPU 3.20GHz
stepping : 5
cpu MHz : 3189.501
cache size : 2048 KB
physical id : 3
siblings : 2
core id : 0
cpu cores : 1
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 6377.91

lines from the makefile:
ifort $(options) \
precisiondef.f90 allconsts.f90 itrasoncbl.modules.f90 \
gscr_linked_mod.f90 \
gscr_mod.f90 \
cinv_mod.f90 \
$@.f90 \
-lmkl_lapack95 -lmkl_blas95 \
-lmkl_intel -lmkl_intel_thread -lmkl_core \
-liomp5 -lpthread \
-o $@

14 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Interesting. Try these options:

-O0 -nolib-inline

And if this is OK, kick up the -O to higher levels leaving -nolib-inline

As a general sanity check, try these and look for runtime error messages:

-02 -xHOST -g -traceback -check all -fp-stack-check

ron

Quoting - Ronald Green (Intel)
Interesting. Try these options:

-O0 -nolib-inline

And if this is OK, kick up the -O to higher levels leaving -nolib-inline

As a general sanity check, try these and look for runtime error messages:

-02 -xHOST -g -traceback -check all -fp-stack-check

ron

I did and got:
-O0 -nolib-inline --- OK
-O1 -nolib-inline --- OK
-O2 -nolib-inline --- problem; the interesting number is cca -3952 instead of cca 1
-O3 -nolib-inline --- problem; the interesting number is cca -27 instead of cca 1

-O2 -xHOST -g -traceback -check all -fp-stack-check --- problem; the number now is cca 8

(I also get a lot of warnings like:
forrtl: warning (402): fort: (1): In call to GSCR_LINKED_ALLOC_ATOM, an array temporary was created for argument #4
--- I've seen these warnings before and added noarg_temp_created to the -check; this I discussed some time ago in this forum; think it has nothing to do with the current problem)

All these are reproducible.
Thanks

Ruda

Quoting - rudolfii

I did and got:
-O0 -nolib-inline --- OK
-O1 -nolib-inline --- OK
-O2 -nolib-inline --- problem; the interesting number is cca -3952 instead of cca 1
-O3 -nolib-inline --- problem; the interesting number is cca -27 instead of cca 1

-O2 -xHOST -g -traceback -check all -fp-stack-check --- problem; the number now is cca 8

(I also get a lot of warnings like:
forrtl: warning (402): fort: (1): In call to GSCR_LINKED_ALLOC_ATOM, an array temporary was created for argument #4
--- I've seen these warnings before and added noarg_temp_created to the -check; this I discussed some time ago in this forum; think it has nothing to do with the current problem)

All these are reproducible.
Thanks

Ruda

OK, we're making progress. the -nolib-inline tells us that a vectorized intrinsic was at least partially responsible for the -O0 and -01 cases ( this option prevents vectorized intrinsics - yes, they are used even at -O0. We are reviewing this decision for a future release ). So keep this option -nolib-inline for all tests and runs from now on.

Now we see that at O2 we start to see issues again. O2 enables vectorization. We can test this hypothesis with this:

-O2 -nolib-inline -no-vec (you must use the 11.0 compiler for this option )

I will bet this gives you the precision you need. However, this is a big impact on performance. You did not say so, but my guess is that you wish to balance performance with precision for this numerically sensitive code, yes? It's obvious your code is sensitive to vectorization. So let's see how we can get the best performance with the precision your algorithm requires. Here are some tests:

-O2 -nolib-inline -fp-model precise
-O2 -nolib-inline -fp-model source
-O2 -nolib-inline -assume protect-parens
-O2 -nolib-inline -prec-div
-O2 -nolib-inline -fp-speculation=off

Next, do you know if you generate denormalized values, and do you consider denormal values to be 'valid' ( see -ftz option in the documentation ) Try this to determine:

-O2 -nolib-inline -fpe0 -g -traceback

and see if you are getting any FP exceptions in the code. If you really think you need the denormals, you can try

-O2 -nolib-inline -no-ftz !...use with caution, can cause 2 orders of magnitude slowdown if denorms are present!!

Finally, if you want to root-cause the numerical sensitive section of code, you can try to interatively compile each source with -O0 -nolib-inline whilst keeping the other sources at full optimization. This will help to indentify which source files are numerically sensitive. From there, you could use -vec-report to identify the vectorized sections. Next, start to disable each vectorized loop with the !DEC$ NOVECTOR directive in front of each vectorized loop - then remove those 1 by 1 until you find all the sensitive regions.

It's a bit of an interative process, but if you want to get to the heart of the numerical sensitivity this is one way to do so.

ron

I usually set -assume protect_parens -prec-div -prec-sqrt, as those seldom show any reduction in performance on current CPUs, but do avoid many problems such as you report. Those options would normally be included in -fp-model source or -fp-model precise, which also set -ftz. Setting or unsetting -ftz has an effect only when used in the compilation of the main program.

Quoting - Ronald Green (Intel)

OK, we're making progress. the -nolib-inline tells us that a vectorized intrinsic was at least partially responsible for the -O0 and -01 cases ( this option prevents vectorized intrinsics - yes, they are used even at -O0. We are reviewing this decision for a future release ). So keep this option -nolib-inline for all tests and runs from now on.

Now we see that at O2 we start to see issues again. O2 enables vectorization. We can test this hypothesis with this:

-O2 -nolib-inline -no-vec (you must use the 11.0 compiler for this option )

I will bet this gives you the precision you need. However, this is a big impact on performance. You did not say so, but my guess is that you wish to balance performance with precision for this numerically sensitive code, yes? It's obvious your code is sensitive to vectorization. So let's see how we can get the best performance with the precision your algorithm requires. Here are some tests:

-O2 -nolib-inline -fp-model precise
-O2 -nolib-inline -fp-model source
-O2 -nolib-inline -assume protect-parens
-O2 -nolib-inline -prec-div
-O2 -nolib-inline -fp-speculation=off

Next, do you know if you generate denormalized values, and do you consider denormal values to be 'valid' ( see -ftz option in the documentation ) Try this to determine:

-O2 -nolib-inline -fpe0 -g -traceback

and see if you are getting any FP exceptions in the code. If you really think you need the denormals, you can try

-O2 -nolib-inline -no-ftz !...use with caution, can cause 2 orders of magnitude slowdown if denorms are present!!

Finally, if you want to root-cause the numerical sensitive section of code, you can try to interatively compile each source with -O0 -nolib-inline whilst keeping the other sources at full optimization. This will help to indentify which source files are numerically sensitive. From there, you could use -vec-report to identify the vectorized sections. Next, start to disable each vectorized loop with the !DEC$ NOVECTOR directive in front of each vectorized loop - then remove those 1 by 1 until you find all the sensitive regions.

It's a bit of an interative process, but if you want to get to the heart of the numerical sensitivity this is one way to do so.

ron

Ok. I followed your way with the results:

-O2 -nolib-inline -no-vec --- you'd loose your bet: problem: got 1.24713... instead of 1.24711...
-O2 -nolib-inline -fp-model precise --- OK
-O2 -nolib-inline -fp-model source --- OK
-O2 -nolib-inline -assume protect-parens --- problem: -3952
-O2 -nolib-inline -prec-div --- problem: the same -3952
-O2 -nolib-inline -fp-speculation=off --- problem: the same -3952

Next, do you know if you generate denormalized values, and do you consider denormal values to be 'valid' ( see -ftz option in the documentation ) Try this to determine:

-O2 -nolib-inline -fpe0 -g -traceback --- problem: the same -3952 and no FP exceptions

-O2 -nolib-inline -no-ftz !...use with caution, can cause 2 orders of magnitude slowdown if denorms are present!!
--- problem: the same -3952

I'll try to find more, but could you just explain to me briefly what probably happens?

Thanks
Ruda

I take it you're telling us that -fp-model source improves accuracy over -no-vec, but that a vectorization step which is omitted in that option breaks results entirely? What about -assume protect_parens -no-vec ?
I think you were given suggestions already about how to turn off individual vectorizaton steps loop by loop.
If you're not satisfied with simply using -fp-model source, you could append various options to see if any of them break your code or improve performance.

-fp-model source is the one option which as far as possible removes all optimizations which may violate Fortran or IEEE standards, while permitting the others. -fp-model source -ftz is a frequent choice as a base option for all subroutines which don't need additional optimization.

Quoting - tim18
I take it you're telling us that -fp-model source improves accuracy over -no-vec, but that a vectorization step which is omitted in that option breaks results entirely? What about -assume protect_parens -no-vec ?
I think you were given suggestions already about how to turn off individual vectorizaton steps loop by loop.
If you're not satisfied with simply using -fp-model source, you could append various options to see if any of them break your code or improve performance.
-fp-model source -ftz (or adding back other default optimizations which are removed by -fp-model source).

-O2 -nolib-inline -assume protect_parens -no-vec --- problem: the same as the previously mentioned:
-O2 -nolib-inline -no-vec --- problem: got 1.24713... instead of 1.24711...

Let me also explain something. I and people around me are physicists. I am happy if a computer gives out a good result. I like optimizations but only as long as they don't spoil the results. And here I see something bad (and I am actually a lucky man that I've noticed that there is something bad: It was one really bad result in a heap of millions presumably good numbers. And only that it was so bad I could notice it. If it were just slightly wrong like 1.24713 instead of 1.24711, I wouldn't notice it at all and that could lead to some ideas that would be simply wrong.)

I just want a compiler do any optimizations that are absolutely SAFE and never beyond. I do not know what those optimizations now do exactly, but I see daunting results. It should not be a lottery with options, I want no loose of precision as far as it can be achieved (and here we witness a monstrous loose of precision, if it can be still called a precision...)

It is a must for me to understand well what's going on here. Whether it is something in our code that seems to be, at a special point, somehow unstable (how), or what.
So far I do not even know a secure way that would ensure the results be the same between several computers and/or versions of ifort (I mentioned that xeon with ifort 10.1 behaves differently from my machine).

Further, many people around me simply expect that just running a compiler without options is safe (or when optimizations are turned of with -O0). As it seems, it is not.

Thanks a lot!
Ruda

Quoting - rudolfii

I just want a compiler do any optimizations that are absolutely SAFE and never beyond.

That's the purpose of -fp-model source, which you said was working for you.

Quoting - tim18

Quoting - rudolfii

I just want a compiler do any optimizations that are absolutely SAFE and never beyond.

That's the purpose of -fp-model source, which you said was working for you.

In my 1st post I said that '-fp-model precise' produces a good result on my machine (I combined it with -xHost) BUT has no effect on the xeon machine (where it was combined with -xN). So it is probably not really safe (or at least wasn't for ifort 10.1).

Ruda

Quoting - rudolfii

In my 1st post I said that '-fp-model precise' produces a good result on my machine (I combined it with -xHost) BUT has no effect on the xeon machine (where it was combined with -xN). So it is probably not really safe (or at least wasn't for ifort 10.1).

Ruda

For 10.1 compiler on the older Xeon: you will have to use -fp-modle precise and do not use the -x vectorization options.

There are certain classes of code where accuracy is far more important than performance. This seems to be your case from what you say. Orbital mechanics, high-energy physics are classic examples. These codes typically deal with very large and very small floating point values. Therein, the order of expression evaluation becomes critical. A somewhat classic example:

do while residual < tolerance !many many many iterations
residual = A(i) - B(i) + residual
... use residual in calculation of next timestep
end do

where A(i) and B(i) are approximately equal and are very large numbers. residual is a small value near zero. As it approaches something around 1 the algorithm exits (tolerance)

Now, if you compute (A(i) - B(i)) first, then add residual, no problems.
what if you:
B(i) + residual where B(i) >> residual? the answer can give B(i) because of round off. Now subtract from A(i). You may even have coded (A(i) - B(i)) + residual. Higher optimizations allow ignoring parenthesis.

Do you see the difference? Mathematically, reassociation is equivalent but not in floating point calculations.

This is but one example of MANY cases where floating point mathematics is not equivalent to "real" mathematics. The vectorization optimations can and do change the order of expression evaluation. For some codes this is not a problem. But again, if you have very small numbers and very large numbers the order can be very important. Not knowing your code, I cannot conjecture what you have that is making your code numerically sensitive.

Thus, it becomes a balance between accuracy and performance. Your balance seems to lean heavily towards accuracy.

ron

Quoting - Ronald Green (Intel)
do while residual < tolerance !many many many iterations
residual = A(i) - B(i) + residual
... use residual in calculation of next timestep
end do

where A(i) and B(i) are approximately equal and are very large numbers. residual is a small value near zero. As it approaches something around 1 the algorithm exits (tolerance)

Now, if you compute (A(i) - B(i)) first, then add residual, no problems.
what if you:
B(i) + residual where B(i) >> residual? the answer can give B(i) because of round off. Now subtract from A(i). You may even have coded (A(i) - B(i)) + residual. Higher optimizations allow ignoring parenthesis.

ron

well, I've been trying to find if something like that happens in our code. Meanwhile:
what if your mentioned code is line-separated... is it still optimized so that the problems can appear? I mean:
help = A(i) - B(i)
help = help + residual

Thanks
Ruda

Quoting - rudolfii

well, I've been trying to find if something like that happens in our code. Meanwhile:
what if your mentioned code is line-separated... is it still optimized so that the problems can appear? I mean:
help = A(i) - B(i)
help = help + residual

Thanks
Ruda

I did not intend for you to start hunting for computations with this type of expression. This is one of many many possible cases where optimizations and vectorizations can cause differences. And I do not recommend changing your code to accommodate one compiler or another - the best long term strategy is to write straightforward code that is readable and maintainable.

The only way to truely tell where the numeric sensitivity is located is the method I outlined earlier - narrowing down to find the (hopefully) one source file, and then diving into that source file and finding the vectorized loops ( using -vec-report ). And then disabling those loops with the $DEC NOVECTOR directive.

There are many many things that could be the cause. Alignment of data, computations using mixes of REAL(4) and REAL(8) in expressions, constants without explicit type declarations ( 1.234567890123 vs 1.234567890123_8 for example ), the use of vectorized intrinsics as we explained earlier, amongst many other possibilities or combinations of these.

ron

Quoting - Ronald Green (Intel)
The only way to truely tell where the numeric sensitivity is located is the method I outlined earlier - narrowing down to find the (hopefully) one source file, and then diving into that source file and finding the vectorized loops ( using -vec-report ). And then disabling those loops with the $DEC NOVECTOR directive.

I now have the !DEC$ NOVECTOR directive at the beginning of all do loops.
Yet, I've got a wrong result for
-O2 -nolib-inline --- have -22 instead of ~ +1
while a correct one for
-O0 -nolib-inline
or
-O1 -nolib-inline

Could you give me some suggestion about how to progress?

There are many many things that could be the cause. Alignment of data, computations using mixes of REAL(4) and REAL(8) in expressions, constants without explicit type declarations ( 1.234567890123 vs 1.234567890123_8 for example )

All reals, complexs are _DP, all constants are _DP, where
integer, public, parameter :: DP = kind(1D0)

Thanks
Ruda

Login to leave a comment.