# Understanding floating point value with Intel® Fortran Compiler

By Chen, Yuan, published on July 22, 2012

**Problem: **When calculate the following expressions with Intel® Fortran Compiler (assuming A is in single precision, real(4) data type):

real(4) A

A = 60283000.0

A = A + 3

We always get 6.0283000E+07. Why the last digit 3 is lost in the result?

On the other hand, if I change the expressions to following:

A = 10283000.0

A = A + 3

I will be able to get an expected value: 1.0283003E+07.

Another problem is different FORMAT editor can induce unexpected print values. For example:

real(4) A

A=1.65

WRITE (*, 9999) A

9999 FORMAT ( F20.10 )

We will get:

1.64999999762

What's the reason?

**Environment :**

All platforms

**Root Cause :**

In computer, the internal representation of numbers are in binary instead of decimal with a scientific notation as following:

*Significant digits* × *base*^{exponent}

The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754 (aka. IEC 60559). This standard is followed by almost all modern machines. See http://en.wikipedia.org/wiki/Floating_point.

Under IEEE, typically a 32 bits real number will have 1 Sign bit, 8 bits for Exponent, and 23 bits for Significant digits - the mantissa. (On a typical computer system, a 'double precision' (64-bit) binary floating-point number has a coefficient of 53 bits (one of which is implied), an exponent of 11 bits, and one sign bit. )

Since the first digit of the mantissa should always be 1 for normalized number, it can be saved and extends one more digit for precision. Hence the corresponding decimal number in mantissa would be at most 2^24-1, which is 16772215 with 7~8 significant decimal digits. (Similarly, for a 64-bit 'double precision' binary, the significant decimal number in mantissa is at most 2^53-1=1125899906842623, around 15~16 significant decimal digits. )

In the first problem, variable **A** initially equals to 60283000, which is 11100101111101100001111000 in binary with 26 significant digits. Thus the bottommost 2 digits: 00 will be rounded away, since only the top (1+23) can be stored. When A is added by 3, the last digit 3 (11 in binary) will not be saved, and rounded away due to the same reason.

However if **A** equals to 10283000 initially, it only contains 24 significant digits in binary. Thus the bottommost digits are still preserved. When A is added by 3, the last digit 3 is preserved as well.

In the second problem with FORMAT (F20.10), **A** equals to a decimal number 1.65. When converting 1.65 to binary, it is represented as 1.101001100110011001…. Actually it cannot be precisely represented in binary. Thus bottommost digits that do not fit into the fixed width are dropped (rounded or truncated). The respresentation of 1.65 in computer in 32bits single precision is as following:

0 0 1 1 1 1 1 1 1 10100110011001100110011

sign exponent(8-bit) Fraction(23-bits)

If converting back to decimal, it represents 1.64999997615814208984375. As you see only the first 8 digits value 1.6499999 are close to the original decimal value 1.65. That matches the precision of the representation for 32 bits real number. Although bigger fractional digits number in printing format gives more precise decimal approximation to the exact binary number stored in computer, it still prints out a rounded data with some unexpected digits, like 1.64999999762. That's because the rounding occurred earlier when the decimal input was converted to binary in the computer.**Resolution : **

For the first problem, as the last digit requires an extension to the precision of 32 bits constants, we must extend the precision in one of following ways:

1. Define your variable as Real*8. And compile with switch -fpconstant to make sure constants will be evaluated in double precision as well.

2. If you define A as default Real type, use switch /real-size:64(Windows) or –real-size 64(Linux) to change default Real size, and compile with -fpconstant as well.

For the second problem, the above solutions also apply. Besides that, you have two other choices:

1. Adjust your print Format with an appropriate factional digits number for the corresponding precision. For example, if A is in signle precision, change format to

9999 FORMAT ( F20.7)

We will get an expected result printed:

1.6500000

2. Another solution is to make the literal constants explicitly double precision, e.g.

WRITE(*,9999) 1.65_8

You will get an expected result printed:

1.6500000000