Floating Point Operations on Different Hardware

Floating Point Operations on Different Hardware

I'm having trouble comparing 2 floating point numbers on different hardware.

I have a unit vector defined. I want to take the dot product of the vector with itself. The answer should be 1 since the vectors are identical. I have verified that the magnitude of the vector is 1 to 16 decimal places. I can compile the code on my local machine and I see that the dot product is 0.9999999999999999. Then I deploy the same code to a server and run it and the result is 1.0000000000000002. The end goal is to use the intrinsic ACOS function to get and angle, but since the server value is greater than 1, I end up with NaN.

I had optimizations turned on at first, but after turning all of them off (I think) the problem still exists. I am using the strict floating point model, and as I said, the dll is the exact same. Are there any compiler optoins, or anything I can do to help the situation? If not, does anyone understand why this would happen? I understand that different architectures can give different results but it's strange to me the the intrinsic dot product isn't really working. I'm sure this is a somewhat common question, but I've done some digging and I haven't found a satisfactory answer. I appreciate any help and thanks in advance.

16 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Have you checked that your unit vector, as represented in machine numbers, is indeed a unit vector? In fact, your finding that the dot product exceeds 1 indicates that there is a problem in that regard.

actually I have had this problem on different hardware for the last 25 years. If the vectors are unitsed to whaever the precison of the machine but the dot product can still be 1 +/- some machine precsison. if (answer >1.0) answer=1.0 ! given you have validated the data previously.

You leave so many questions open it's hard to guess where to begin.  If it's single precision, you might be using dot_product(dble(a),a).   For simd architectures, optimized dot product will use batching of sums if there are enough of them, which should improve accuracy, but results may vary slightly depending on how many batches are chosen and possibly on data alignment (if you don't set !dir$ vector unaligned).  In /arch: IA32 mode you could set extended precision by /Qpc80.  I guess, since you don't mention it, you haven't considered Kahan summation.  In spite of all that, it's possible that the elements are such that they do sum > 1 even if they are correctly rounded and summed with extra precision.

r = sqrt(v(1)**2+v(2)**2+v(3)**2)
u = v / r

Firstly, any number of the components of v might not be exactly represented by floating point numbers. Take 0.1 for example. The binary representation would require an infinite number of bits. i.e. only an approximation can be held with an error of +1/-1 bit (lsb is off by fraction of bit). Many components can be exactly represented in binary (whole numbers, those fractions that are an inverse of the products of powers of 2 (1/2, 1/4, 3/4, 1/8, ...). Taking the square of an inexact component doubles the error, should all three components be approximations, the sum of squares have sixtuple the error, then taking the sqrt of this might reduce the error somewhat, possibly half. The result being r could potentially be off by 3 x fraction of bit (range of 3 bits to 0 bits). Then dividing the original vector v (holding approximate components), yields potentially greater error. The fact that you observed 0.9999999999999999 is quite remarkable as to how close this is to 1.0.

Calculations based on floating point have to be written such that they take into consideration that the numbers are (often) approximations.

Jim Dempsey


1.0000000000000002 - 1.0 ~= 0.0000000000000002 and this is Epsilon for a double-precision ( DP ) data type. Actual value of Epsilon for DP is:


and I would add some logic ( put 1.0 instead ) in that case.

Thanks for all the replies. I'll probably have to settle for some logic to catch the error. In fact, I'm surprised someone didn't have it there already. It's probably true that digits beyond machine precision are causing problems. Again, thanks for the help.

aReal4 = EPSLON(anyReal4) ! smallest real(4) detectible delta relative to 1.0 (2^-23)
aReal8 = EPSILON(anyReal8) ! smallest real(8)  detectible delta relative to 1.0_8 (2^-52)

Normal use is to obtain the epsilon of the type desired, then scale it to within the proximity of the values being compared.

Jim Dempsey


>>aReal4 = EPSLON(anyReal4) ! smallest real(4) detectible delta relative to 1.0 (2^-23)
>>aReal8 = EPSILON(anyReal8) ! smallest real(8) detectible delta relative to 1.0_8 (2^-52)

Here are Epsilon values from:

[ 1. Intel C++ compiler v12.x ]

#define FLT_EPSILON 1.19209290e-07F
#define DBL_EPSILON 2.2204460492503131e-16
#define LDBL_EPSILON 1.0842021724855044340075E-19L

[ 2. Borland C++ compiler v5.x ]

#define DBL_EPSILON 2.2204460492503131E-16
#define FLT_EPSILON 1.19209290E-07F
#define LDBL_EPSILON 1.084202172485504434e-019L

[ 3. Microsoft C++ compiler ( VS 2005 ) ]

#define DBL_EPSILON 2.2204460492503131e-016 /* smallest such that 1.0+DBL_EPSILON != 1.0 */
#define FLT_EPSILON 1.192092896e-07F /* smallest such that 1.0+FLT_EPSILON != 1.0 */

[ 4. IPP v7.x ]

#define IPP_EPS23 ( 1.19209289e-07f )
#define IPP_EPS52 ( 2.2204460492503131e-016 )

And here is a Summary of all Epsilon values for the double-precision floating-point type:

1 2.2204460492503131e-16
2 2.2204460492503131E-16
3 2.2204460492503131e-016
4 2.2204460492503131e-016

They are consistent and the same up to 16 digits.

Thanks for the correction. I was thinking of the binary representation of the mantissa.

The IVF documentation uses this description, my fault, IVF uses:

print *,EPSILON(0.0),EPSILON(0.0_8), EPSILON(0.0_16)

 1.1920929E-07  2.220446049250313E-016  1.925929944387235853055977942584927E-0034

 Keep an eye on me Sergey, sometimes I goof up.

Jim Dempsey



Now here is the corrected version:

print*, 1.0 / 2.0**23, 1.0_8 / 2.0_8**52, 1.0_16 / 2.0_16**113

  1.1920929E-07  2.220446049250313E-016  1.925929944387235853055977942584927E-0034
  1.1920929E-07  2.220446049250313E-016  9.629649721936179265279889712924637E-0035

Close enough. The numbers I gave were right. They just were not expessed familliar terms.

Jim Dempsey


Many years ago, I wrote the following function to give a robust solution to round-off and (possibly) at that time, the lack of an ACOS function:

       REAL*8    ZACOS, X, Y
       real*8, parameter :: one = 1 
       Y = one - X**2
       IF (Y.LT.0) Y = 0.0
       IF (Y.GT.0) Y = SQRT (Y)
       ZACOS = ATAN2 (Y, X) 

I would define (assuming you do all your computations in double precision)


in a module and USE it. I think you should avoid testing against 0. or 1., rather use tests against 0.0d+0 or 1.0d+0, or against ZERO and UNITY if you have defined them as module parameters. It is still wise to to test the magnitude of your ACOS argument against UNITY before using it, or, to cover your particular case, use IF(ABS(ARGUMENT-UNITY).LT.(some multipleof)*EPSILON) ARGUMENT=UNITY
before you use the argument in ACOS (or ASIN for that matter)

Had similar issue when call to one of the trigo functions was throwing Double.NaN.Corrected it by using simple logic exactly as app4619 described in his post.Even usage of java.StrictMath fp strict library version was not helpful. 

>>>1.0000000000000002 - 1.0 ~= 0.0000000000000002>>>

Is it not the classical case of catastrophic cancellation?

>>... I'm surprised someone didn't have it there already. It's probably true that digits beyond machine precision are
>>causing problems...

This is because nobody uses exactly the same processing and the same hardware ( similar to what you have ). You've mentioned that it happens on the Server computer and it doesn't happen on the Local computer. It is also Not clear what CPUs these two computers have.

Leave a Comment

Please sign in to add a comment. Not a member? Join today