x64 object runs much slower than win32 object when do loop has increment parameter

x64 object runs much slower than win32 object when do loop has increment parameter

Hi,

I encountered a strange behavior.

When there exists

1. a DO LOOP index which has as an increment parameter of a variable,

2. reference to the DO LOOP index after the LOOP,

the object compiled in x64 RELEASE mode runs much slower than that in win32 RELEASE mode.

Here is a minimal sample program.

PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: t0, t1

  CALL CPU_TIME(t0)
!
  k = 1
  DO i = 1, 10**8
   DO j = 1, 10**2, k  ! 1. use variable k as an increment parameter  
    !                  
   END DO
  END DO
  PRINT *, j           ! 2. reference to the loop index j
!
  CALL CPU_TIME(t1)
  PRINT *, t1 - t0 

  STOP
END PROGRAM x64_Release_runs_slow

This sample takes a few seconds in x64 RELEASE mode, while practically 0 seconds in win32 RELEASE mode.

This singularity disappears when changing k to constant 1 or commenting out the line "PRINT *, j".

In DEBUG mode x64 and win32 run with almost the same cpu_time.

I suppose this might be a optimization problem.

I attach a more realistic program with which I encountered this problem. (Option/assume:realloc_lhs is required.) In this case x64 version is ~40% slower than win32.

Yamajun

AttachmentSize
Download Shell_Tokuda.f901.25 KB
6 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Nobody interested?

Here is a simpler example. There are 4 semantically identical DO LOOPs. Only the second one runs slow.

Result: x64 RELEASE

  2000000001  0.0000000E+00
  2000000001  0.3588023
  2000000001  0.0000000E+00
            2000000001  0.0000000E+00

PROGRAM test
  IMPLICIT NONE
  INTEGER :: i, j, k
  INTEGER(8) :: jj
  REAL :: t0, t1
  
  CALL CPU_TIME(t0) 
  DO j = 1, 2 * 10**9
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
!
  CALL CPU_TIME(t0)
  DO j = 1, 2 * 10**9, 1
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, j, t1 - t0
!
! 
  k = 1
  CALL CPU_TIME(t0)
  DO jj = 1, 2 * 10**9, k
  !  
  END DO
  CALL CPU_TIME(t1)
  PRINT *, jj, t1 - t0
!
!
  STOP
END PROGRAM test

None of these loops do anything. The compiler probably optimizes some away and not others - it is not a useful test program.

Try again with a more realistic program - remembering that the optimizer is smarter than you might think.

Steve - Intel Developer Support

Just want to add my experience. Do you have loop unrolling / changed the threshold for auto-parallelization ? Loop unrolling might slow down some specific segments of code when used with O3.In otherwods agressive optimization results are sometimes code specific.

Steve,

Yes, these loops do nothing and I know the optimizer is quite clever.

I met this phenomena in more realistic program which I attached in my first post.

But anyway I found out a reason. It was a vectorizer. (Thanks ragu, you gave me a hint.) In x64 RELEASE, DO LOOP with increment parameter variable.

There used be info messages when LOOPs were vectorized. So I thought there was no vectorization nor parallelization.

Here is another simple example.

PROGRAM x64_Release_runs_slow
  IMPLICIT NONE

  INTEGER :: i, j, k
  REAL :: s, t0, t1

  k = 1
  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4, k
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 


  CALL CPU_TIME(t0)
  DO i = 1, 10**5
   s = 0.0
   DO j = 0, 10**4
     s = s + REAL(j)
   END DO
  END DO
  CALL CPU_TIME(t1)
  PRINT *, 'time =', t1 - t0, 'n(n+1)/2=', 0.5 * 10.0**4 * (10.0**4 + 1.0), 'calc.', s 
 

  STOP
END PROGRAM x64_Release_runs_slow

Win32 compiler vectorizer message

1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: LOOP WAS VECTORIZED.
1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.

win32 output

 time =  0.2340015     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07
 time =  0.1716011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07

x64 compiler vectorizer message

1>C:FortransortConsole1Console1.f90(9): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(11): (col. 4) remark: loop was not vectorized: existence of vector dependence.
1>C:FortransortConsole1Console1.f90(20): (col. 3) remark: loop was not vectorized: not inner loop.
1>C:FortransortConsole1Console1.f90(22): (col. 4) remark: LOOP WAS VECTORIZED.

x64 output

 time =   1.201208     n(n+1)/2=  5.0005000E+07 calc.  5.0002896E+07
 time =  0.1872011     n(n+1)/2=  5.0005000E+07 calc.  5.0005000E+07

The first loop is slow and the sum is not correct.

I agree, it looks like x64 compilation does not recognize

k=1
DO i=...
DO j=s,e,k

as having k as "fixed at 1"

It looks like an area where the optimizer missed an opportunity.

Jim

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today