Commutativity of Floating Point Arithmetic

Commutativity of Floating Point Arithmetic

That FP arithmetic can be tricky in multi-threaded environment is well known. However, I am getting strange results depending on whether I use -O2 optimization or not, even though the code itself is serial. In particular, in the enclosed sample code, four complex arrays, z1, z2, z3, and z4, are initialized with random numbers so that their value ranges are pairwise at different scales. Then the operation z1*z2 + z3*z4 performed elementwise and stored in an array z. Finally sum(z) is computed. If I compile with -O2 flag, the result depends on whether the operands are in the above order, or like: z2*z1 + z4*z3 (controlled at runtime by a command line argument). If I compile with -g, or with -fp-model precise, the two results are identical. Can anyone explain why this could be happening?

Thanks,
George

AttachmentSize
Download test_FP.f901.17 KB
26 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
mecej4's picture

A one-liner answer is: "Ahem, random numbers are random!"

According to the Fortran standard, the sequence of random numbers you generate is processor dependent.
The Intel Fortran Reference manual states in the entry for RANDOM_NUMBER:

"If RANDOM_SEED is not used, the processor sets the seed for RANDOM_NUMBER to a processor-dependent value."

Since you are computing SUM( z1.z2 + z3.z4) and SUM(z2.z1 + z4.z3) in separate runs, you may be using random sequences generated from different seeds. Thus, you end up comparing two numbers (the sums) which are also random numbers, albeit with much smaller standard deviations.

If you want to make a valid comparison test for commutativity, you should explicitly do CALL RANDOM_SEED, and with the same initial set of seeds.

Or, to avoid having to set the seeds, in your program declare za and zb instead of z, and compute and print as follows.

za = z1*z2 + z3*z4
zb = z2*z1 + z4*z3

print *, sum(za)
print *, sum(zb)

I disagree about the random numbers. If you do not call RANDOM_SEED, the seed is set at compile time and will always generate the same random numbers in the same order every time you call the program. If you recompile, then the random numbers change. For example:

PROGRAM RandTest
   IMPLICIT NONE

   REAL :: randNum
   INTEGER :: I


   DO I = 1, 10
      CALL RANDOM_NUMBER(randNum)
      PRINT *, randNum
   END DO
END PROGRAM RandTest

And then I run it:

|tgallagher@harpy:test|> ./a.out 
  3.9208680E-07
  2.5480442E-02
  0.3525161    
  0.6669145    
  0.9630555    
  0.8382882    
  0.3353550    
  0.9153272    
  0.7958636    
  0.8326931    
|tgallagher@harpy:test|> ./a.out 
  3.9208680E-07
  2.5480442E-02
  0.3525161    
  0.6669145    
  0.9630555    
  0.8382882    
  0.3353550    
  0.9153272    
  0.7958636    
  0.8326931    
|tgallagher@harpy:test|> ./a.out 
  3.9208680E-07
  2.5480442E-02
  0.3525161    
  0.6669145    
  0.9630555    
  0.8382882    
  0.3353550    
  0.9153272    
  0.7958636    
  0.8326931    

As you can see, all the numbers are the same and in the same order because the seed is set at compile time. So, provided the code isn't recompiled, the differences in the sum's is due to floating point math consistency and not due to different random numbers.

This is discussed here:
http://software.intel.com/en-us/articles/consistency-of-floating-point-r...

Tim

mecej4's picture

Well, I have to see the magnitudes of the differences that the OP alluded to. I stand by my position that one cannot count on how a program will behave when the Fortran Standard says it is processor-dependent.

Steve Lionel (Intel)'s picture

The way our implementation works is that if you do not call RANDOM_SEED, the seed chosen is a fixed value. If you call RANDOM_SEED with no arguments, a seed based on the time-of-day clock is used.

That said, the order of operations, and vectorization, can slightly change values. Floating point arithmetic operations are often not computationally communtative.

Steve

You may be interested in checking whether both loops are vectorized, and whether all arrays are aligned, and then treated by the compiler as aligned. In principle, the recent compilers should take care to produce same results regardless of alignment on this kind of loop.

mecej4's picture

Here are the results from one run, in which the same z1 and z2 arrays were used to compute za and zb using vector assignments rather than loops, as in #1:

za: (1.786555654004733E-007,1.688695763736762E-007)
zb: (1.786555654004733E-007,1.688695763736758E-007)

There is a discrepancy of about 2 m in the imaginary part, and none in the real part.

Considering that 2.6 million complex numbers were added, the evaluation of each of which involved 2 X ( 4 multiplications + 3 additions ), I think that the discrepancy is negligibly small, and that fact is not surprising either; because the random numbers used are uniformly generated between (-a,a) and (-b, b), one need not fear accumulation of round-off error.

I have deliberately omitted mention of the compiler version, options, etc. I think that I need to be convinced that there is a problem to be solved.

I also ask whether we should not be more worried about the value itself (about 2.10-7) being unacceptably large, given that the expected value is zero and the computed value is over 1000 times the range of magnitudes of za, zb.

mecej4's picture

Steve, I have two related comments/questions. One is that the documentation of RANDOM_SEED at

http://software.intel.com/sites/products/documentation/hpc/composerxe/en...

says, in the entry for RANDOM_SEED that "...If no argument is specified, a random number based on the date and time is assigned to the seed." This does not quite agree with the first line of output from the program

program getseed
  integer :: n,seed(10)

  call random_seed(size=n)
  if(n.gt.10)stop 'seed array too small'

  call random_seed(get=seed)
  write(*,10)'Before PUT : ',seed(1:n)

  seed(1:n)=seed(1:n)*2
  call random_seed(put=seed)
  call random_seed(get=seed)
  write(*,10)'After  PUT : ',seed(1:n)

10 format(A,10(2x,Z11))

end program getseed

The first line did not change even after I changed the PC clock to a different date.

Before PUT :      7FFFFFAA     7FFFFF06
After  PUT :      7FFFFFAA     7FFFFF06

The second comment is that I did not expect the second line to be the same as the first. With other compilers, I do see that the second line has values twice those of the first (with possible overflow ignored)

So I can confirm that your example gives the same seeds/random numbers. But we cooked up an example that generates a difference:

MODULE Random_m
   IMPLICIT NONE
   PRIVATE
   PUBLIC :: InitializeRandomSeed

CONTAINS
   SUBROUTINE InitializeRandomSeed(seed_input)
      INTEGER, INTENT(IN) :: seed_input
      INTEGER :: i, n, clock
      INTEGER, DIMENSION(:), ALLOCATABLE :: seed
     
      CALL RANDOM_SEED(SIZE=n)
      ALLOCATE(seed(n))

      IF (seed_input > 0) THEN
         seed = seed_input
      ELSE
         CALL SYSTEM_CLOCK(COUNT=clock)
         seed = clock + 37 * (/ (i - 1, i = 1, n) /)   
      END IF

      CALL RANDOM_SEED(PUT=seed)
      CALL RANDOM_SEED(GET=seed)
      PRINT *, seed
      DEALLOCATE(seed)
   END SUBROUTINE InitializeRandomSeed
END MODULE Random_m

PROGRAM TestRandom
   USE Random_m
   INTEGER :: i, n, rsInput
   REAL :: x
   n = 10
   rsInput = -1

   CALL InitializeRandomSeed(rsInput)
   DO i = 1, n
      CALL RANDOM_NUMBER(x)
      WRITE(*,*) 'x = ', x
   END DO

   WRITE(*,*) 'AGAIN'
   CALL InitializeRandomSeed(rsInput)
   DO i = 1, n
      CALL RANDOM_NUMBER(x)
      WRITE(*,*) 'x = ', x
   END DO
END PROGRAM TestRandom

Ironically, the random numbers/seeds are different on Intel but when we compile and run with gfortran, the seeds and random numbers are always the same before and after the second call to IntializeRandomSeed.

Tim

mecej4's picture

This is a shot in the dark: perhaps the Intel RNG needs to be primed by having it generate a few output numbers before it actually uses a new seed given to it with a PUT request.

GFortran (I have V 4.5) uses an 8-element seed vector. Your example in #8 sends it a single value, which is spread to the whole vector by the statement

  IF (seed_input > 0) THEN  
          seed = seed_input

It would not be surprising if GFortran ignored a PUT request with an incomplete/degenerate seed vector.

Steve Lionel (Intel)'s picture

mecej4, I don't see that your program called RANDOM_SEED with no arguments. But I agree that the behavior is unexpected and I will look into that. I modified the program to call RANDOM_SEED with no arguments and was puzzled to see that I got the same seed back each time - didn't expect that either. I'll have a word or three with the library developers tomorrow about that.

Steve
mecej4's picture

Steve,

> I don't see that your program called RANDOM_SEED with no arguments.

Quite so. I assumed that if RANDOM_NUMBER was called with no preceding call to RANDOM_SEED in the current process, the state of the RNG would be the same as if an implicit call to RANDOM_NUMBER had been made with no arguments.

Thanks for looking into this.

So now I'm more confused.

The example calls the Initialize routine with -1 always, so it goes into the ELSE clause that uses SYSTEM_CLOCK.

So I modified it so that the seed is no longer degenerate by:

      IF (seed_input > 0) THEN
         seed = seed_input
      ELSE
         DO I = 1, n
            CALL SYSTEM_CLOCK(COUNT=clock)
            seed(I) = clock*(I-1)
         END DO
      END IF

With that change, gfortran generates different random numbers but now this example with Intel generates the same random numbers and the same seed after the PUT.

Tim

jimdempseyatthecove's picture

George,

The problem you are observing in apparent non-commutativity is your additions are saturating the mantissa of the floating point numbers. This means you are experiencing rounding errors at different points in the calculations dependent on the sequence of operations.

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

>>If I compile with -g, or with -fp-model precise, the two results are identical

With those options, the generated code uses FPU instructions as opposed to SSE instructions. FPU internal TEMP (FPU stack) are 80-bit (greater precision than double). Therefore you will see less round off error (no apparent error in many more cases). Rounding errors are a fact of life with some sequences of arithmatic.

Jim Dempsey

www.quickthreadprogramming.com

The "feature" of changing from SSE to X87 at -O0 (implied by -g, when no -O level is set) is present only in 11.1 and earlier compilers. As I understand it, the 12.0 compilers generate X87 code only when explicitly requested (by the 32-bit -mia32 option).
-fp-model precise (same as -fp-model source, for ifort, but not for icc) by itself doesn't change from SSE to X87 instructions. It does restore gradual underflow and strict compliance in the C as well as Fortran sense with the order of operations implied in your source code.
In the 12.0 compilers, the new option -standard-semantics is available to comply more fully with Fortran standards, apparently without setting gradual underflow. I haven't checked whether it disables optimization of sum and dot product reductions, as -fp-model precise|source do.

Steve Lionel (Intel)'s picture

I've got some interesting info regarding RANDOM_SEED. It seems that if you do a PUT with a seed that has negative values, it ignores the PUT. At least this is the way it looks to me. More investigation is needed.

Calling RANDOM_SEED with no arguments does use a time-based seed and PUT does work otherwise. I have attached a program which demonstrates this.

Attachments: 

AttachmentSize
Download ran_test.f901.25 KB
Steve
Steve Lionel (Intel)'s picture

Ok, mystery solved. I was clued in by the following text in the Fortran standard:

The value returned by GET
need not be the same as the value specied by PUT in an immediately preceding reference to RANDOM SEED.

For example, following execution of the statements

CALL RANDOM_SEED (PUT=SEED1)
CALL RANDOM_SEED (GET=SEED2)

SEED2 need not equal SEED1. When the values dier, the use of either value as the PUT argument in a
subsequent call to RANDOM SEED shall result in the same sequence of pseudorandom numbers being generated.

For example, after execution of the statements

CALL RANDOM_SEED (PUT=SEED1)
CALL RANDOM_SEED (GET=SEED2)
CALL RANDOM_NUMBER (X1)
CALL RANDOM_SEED (PUT=SEED2)
CALL RANDOM_NUMBER (X2)

X2 equals X1.

In the test programs here, the static initial seed was retrieved, multiplied by 2, and then PUT, but when GET was done, the "old value" was retrieved. I found that BOTH values in fact delivered the same RANDOM_NUMBER harvest. So RANDOM_SEED is doing some "filtering" of the incoming seed value to make sure it is within proper limits for the algorithm. No bug.

Steve

Thanks to Jim Dempsey and TimP for explaining the effect of fp-model and -O2 compiler options on the swtich to/from SSE instruction set. It looks like the only thing that makes a difference in this case is the fact that the FPU uses an 80-bit register for intermediate results. What is still puzlling is why, as it turns out, the difference appears only in the imaginary part?

The discussion on the random number generation is very interesting, but irrelevant for this problem (perhaps move to a new thread?). One can see the result of the non-commutative operations by using, for instance, the following (very deterministic) constants:

a = dcmplx(1.102323407845943d-10, -2.132323407845943d-10)
b = dcmplx(-2.392085985049895d-3, 3.3989085985049895d-3)
c = dcmplx(6.293823948579384d-10, 8.291823948579384d-10)
d = dcmplx(8.129834587968582d-3, 1.1298342879685822d-3)

The code output for a*b + c*d is:
(4.641008061166196E-012,8.336953268281271E-012)
while the output for b*a + d*c is
(4.641008061166196E-012,8.336953268281273E-012)

The reason I used randomly generated arrays was to save myself the trouble of finding specific quadrupules of complex numbers with this non-commutative property (it's not that easy to do manually). The sum of all elements is simply used as a scalar diagnostic, although it most likely contributes further to the discrepancy because of differences in the order of addition between SSE and FPU. In a modified version of the test code, I've done pairwise comparison of the elements of the two z arrays and it turns out that about 2/3 are different.

--George

mecej4's picture

> What is still puzlling is why, as it turns out, the difference appears only in the imaginary part.

It is easy to see that the perceived difference is an artifact of binary->decimal conversion, at least for the scalar example that you showed in #18.

If you are going to compare 64-bit reals to the last bit, please print using hexadecimal notation.

[fxfortran]      program cm
double complex a,b,c,d,z1,z2
integer iz1(2 4),iz2(2 4)
equivalence (iz1,z1),(iz2,z2)
a = dcmplx(1.102323407845943d-10, -2.132323407845943d-10)
b = dcmplx(-2.392085985049895d-3, 3.3989085985049895d-3)
c = dcmplx(6.293823948579384d-10, 8.291823948579384d-10)
d = dcmplx(8.129834587968582d-3, 1.1298342879685822d-3)
z1=a*b+c*d
z2=b*a+d*c
write(*,*)z1
write(*,*)z2
write(*,20)iz1(2),iz1(1),iz1(4),iz1(3) (underlined part added later)
write(*,20)iz2(2),iz2(1),iz2(4),iz2(3)
20 format(1x,2Z8,2x,2Z8)
end
[/fxfortran]

gives the result

 (4.641008061166196E-012,8.336953268281273E-012)
 (4.641008061166196E-012,8.336953268281271E-012)
 3D94694F7FCB1DC7  (this is the real part; see below)
 3D94694F7FCB1DC7

GFortran seems to do the formatting slightly better:

 ( 4.64100806116619663E-012, 8.33695326828127092E-012)
 ( 4.64100806116619663E-012, 8.33695326828127092E-012)
 3D94694F7FCB1DC8
 3D94694F7FCB1DC8

You may wish to explore whether the same explanation applies to your original test with big arrays of random elements.

Note, added subsequently:

The above program was in error, and I offer my apologies. Here is a corrected version.

[fxfortran]      program cm
double complex a,b,c,d,z1,z2
integer iz1(4),iz2(4)
equivalence (iz1,z1),(iz2,z2)
a = dcmplx(1.102323407845943d-10, -2.132323407845943d-10)
b = dcmplx(-2.392085985049895d-3, 3.3989085985049895d-3)
c = dcmplx(6.293823948579384d-10, 8.291823948579384d-10)
d = dcmplx(8.129834587968582d-3, 1.1298342879685822d-3)
z1=a*b+c*d
z2=b*a+d*c
write(*,*)z1
write(*,*)z2
write(*,20)iz1(2),iz1(1),iz1(4),iz1(3)
write(*,20)iz2(2),iz2(1),iz2(4),iz2(3)
20 format(1x,2Z8,2x,2Z8)

end[/fxfortran]

The results from the corrected program, compiled using the 12.0.2 compiler, with /Od, show that all bits of z1 and z2 agree. With /O2, there is a difference in the least significant bit of the imaginary part.

 (4.641008061166196E-012,8.336953268281273E-012)
 (4.641008061166196E-012,8.336953268281271E-012)
 3D94694F7FCB1DC7  3DA255499696C399
 3D94694F7FCB1DC7  3DA255499696C398



Since a*b is about a tenth of c*d, it is not surprising that adding them causes a loss of 1 bit of precision or that the compiler optimization level has a small but noticeable effect.

The strictest criteria offered by IEEE754 for WRITE conversions apply only down to 1e-10, but it's a little surprising to see this non-repeatability come in at this point, appearing to indicate an off-by-1-ULP in the first case produced by ifort.

jimdempseyatthecove's picture

George,

When accumulating (adding or subtracting) FP numbers who's resultant value's mantissa (non-exponent part) requires more bits than the system's FP mantissa holds, then rounding occurs. If you change the order of the accumulations, then you also can change the points at which the rounding occurs. This can change the result (usually by a small amount).

While working the code (options or statements)such that the same sequence and same precisions are used, thiswill produce reproducibility in your runs, it will not necessarily assure the highest precision. Higher precision (of accumulations) will occure if you accumulate the smallest values first and the largest last.
i.e. adding a very small number to a very large number where difference in magnitudes exceed precision of mantissa, is equivilent to accumulating 0's. Whereas adding all the small numbers together, might produce a result, who's difference in magnitude from the larger numbers is less than the precision of the mantissa. Coding this way runs slower, since it may require a sort first.

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

George,

To illustrate the point:

! rantest.f90
program rantest
    implicit none
    intrinsic random_number
    real, allocatable :: array(:)
    real :: sum1, sum2, clipLow, clipHigh
    real(8) :: dsum
    integer, parameter :: N = 100000
    integer :: I
    allocate(array(N))
    call random_number(array)
    sum1 = 0.0
    dsum = 0.0_8
    do I=1,N
      sum1 = sum1 + array(I)
      dsum = dsum + dble(array(I))
    end do
    sum2 = 0.0
    clipLow = 0.0
    clipHigh = 1.0E-30
    do while(clipLow .lt. 1.0)
        do I=1,N
          if((array(I) .le. clipHigh) .and. (array(I) .gt. clipLow)) sum2 = sum2 + array(I)
        end do
        clipLow = clipHigh
        clipHigh = clipHigh * 2.0
        if(clipHigh .gt. 1.0) clipHigh = 1.0
    end do
    write(*,100) sum1, sum2, dsum
100 format(x,Z8,x,Z8,x,Z16)
    write(*,200) sum1, sum2, dsum
200 format(x,E15.10,x,E15.10,x,D15.10)
end program rantest

Output:

 4743AF59 4743AF77 40E875F115F2F480
 .5009534766E+05 .5009546484E+05 .5009553393D+05

The second line of output shows a difference illustrates how the accumulation of disparate numbers behave like truncation.

The first float is the summation without regard to size. This is less than...
The second float is the summation by approximate order of size. This produces a slightly larger number that is closer to the true result.
The third number (double) is larger yet, and excepting for formatting/printing issues represents the correct (or near correct) result.

Jim Dempsey

www.quickthreadprogramming.com
jimdempseyatthecove's picture

Follow-up

I extended the above program to tally dsum1 and dsum2 along side the sum1 and sum2. The results indicating:

4743AF59 4743AF77 40E875F115F2F480 40E875F115F2F480

.5009534766E+05 .5009546484E+05 .5009553393D+05 .5009553393D+05

Note the Z formatting of the two doubles (dsum1 and dsum2) are equal. Which should be the case (for this program) since array(N) was single precision. Conversion from single to double is exact. And for this program, the summation of the converted reals did not saturate the mantissa of the doubles (dsum1, dsum2).

George has to answer to himself: do you want consistent results or accurate results (or accurate consistent results)?

Jim Dempsey

www.quickthreadprogramming.com

The original examples don't raise the well-known issues with sum accumulation; as no "fused multiply add" was under discussion, the expectation of commutivity was reasonable, aside from possibilities related to inconsistent use of x87 extra precision.

jimdempseyatthecove's picture

>> the expectation of commutivity was reasonable

The expectations are reasonable, but the practicalities are it is not always possible. "fused multiply add" is not at issue here, but may be one of the players in the issue when used by the compiler. The root cause is the sequence in which the accumulations are performed using values who's summation(s) do not fit within the precision of the calculator (in this case FPU, FPU temp, or SSE register all differ). Additionaly, when the precision is insufficient, then roundoff rules apply and sequence of operations will affect outcomes.

This expectation is based on the requirement of evaluations made with exacting precision. Compiler optimizations, in particular common sub-expression elimination, combined with instruction/statement reordering to improve common sub-expression elimination, will (can) affect the sequence in which the accumulations occur. When these numbers cannot maintain exacting precision comutivity cannot be assured.

Jim Dempsey

www.quickthreadprogramming.com

Login to leave a comment.