openmp Reduction differences with reduction flag

openmp Reduction differences with reduction flag

Hello,

     I have a code where I have parallelized one of the intensive subroutines using openmp.  I have a reduction variable which, which I I use the optimization flags -O2 or -O3 (but now -O1), the value of the reduction variable that I am outputing is about 1000x to large.  This does not happen when I run the code in serial with or without the -O3 flag, nor does it happen when I compile the code using gfortran and the -O3 flag.  Is there something about the optimization flags that are incompatbile with OMP REDUCTION?

Here is the portion of the code related to the reduction variable:

!$OMP PARALLEL DEFAULT(Shared) PRIVATE(Num,ID,jsr,Thread,CX,CY,CZ,i,j,k,
!$OMP& II,JJ,KK,XX,YY,ZZ,RSQ,Btest,BX,BY,BZ,RR,BondR,FFx,FFy,FFz,TH)
Thread = OMP_GET_THREAD_NUM()
 TH = Thread + 1
!$OMP Do SCHEDULE(Static) REDUCTION( + : Nbond)
 Do Num = 1,N
 jsr = seed(Thread)
XX = DABS(RX(Num) - RX(ID))
 YY = DABS(RY(Num) - RY(ID))
 ZZ = DABS(RZ(Num) - RZ(ID))
 RSQ = XX * XX + YY * YY + ZZ * ZZ
IF (RSQ < RcutB2 .AND. BindOn) 
 if (Bond(Num,ID) == 0) Then
 Btest = UNIF(jsr)
 if (Btest < Bon) Then
 Bond(Num,ID) = 1
 Nbond = Nbond + 1
 End if
 End if
 End if
If (Bond(Num,ID) == 1) Then
 Btest = UNIF(jsr)
 if (Btest < Boff) Then
 Bond(Num,ID) = 0
 Nbond = Nbond - 1
End Do
!$OMP END Do
!$OMP END PARALLEL

This is a code snippet with some of the other computations taken out.  All the variables and whatnot are defines properly, as the code runs fine in serial and in parallel without O2 and O3 optimizations.

12 posts / novo 0
Último post
Para obter mais informações sobre otimizações de compiladores, consulte Aviso sobre otimizações.

ID is declared as private, but is not initilaized before used (defined).

Jim Dempsey

www.quickthreadprogramming.com

All the variables have been defined as evidenced by the program working with the -openmp and -O1 flags but now with the -openmp -O2 or -openmp -O3.  I think that's my part for only posting what I considered the relavent parts of the subroutine.  Here is the complete subroutine in which ID is defined.

SUBROUTINE FORCE (N, Gridon, Cell, seed, jsr, kn, fn, wn,
 & Thread_Num)
 Implicit None
Include 'omp_lib.h'
 Include 'maxarray.inc'
 Include 'system.inc'
C *******************************************************************
INTEGER I, J, K, NUM, II, JJ, KK
 INTEGER ID, CX,CY,CZ, jsr, Thread, Thread_Num, TH
 INTEGER kn(128)
 INTEGER seed(0:Thread_Max-1)
 LOGICAL Gridon, Can_Bind
 REAL*8 RXI, RYI, RZI, FXIJ, FYIJ, FZIJ, FIJ
 REAL*8 FXI, FYI, FZI, RIJ, FFx, FFy, FFz
 REAL*8 XX, YY, ZZ, BX, BY, BZ
 REAL*8 RSQ, RR, BondR, Btest
 REAL*8 FX_Priv(N,Thread_Num)
 REAL*8 FY_Priv(N,Thread_Num)
 REAL*8 FZ_Priv(N,Thread_Num)
 REAL fn(128), wn(128)
 REAL*8 Attract, A, B, C, wtime
C *******************************************************************
If (Gridon) Then
FX(1:N) = 0.0
 FY(1:N) = 0.0
 FZ(1:N) = 0.0
 FX_Priv(1:N,1:Thread_Num) = 0.0
 FY_Priv(1:N,1:Thread_Num) = 0.0
 FZ_Priv(1:N,1:Thread_Num) = 0.0
!$OMP PARALLEL DEFAULT(Shared) PRIVATE(Num,ID,jsr,Thread,CX,CY,CZ,i,j,k,
!$OMP& II,JJ,KK,XX,YY,ZZ,RSQ,Btest,BX,BY,BZ,RR,BondR,FFx,FFy,FFz,TH)
Thread = OMP_GET_THREAD_NUM()
! Thread = 0
 TH = Thread + 1
!$OMP Do SCHEDULE(Static) REDUCTION( + : Nbond)
 Do Num = 1,N
 jsr = seed(Thread)
 CX = INT(RX(Num)/RXcut)+1
 CY = INT(RY(Num)/RYcut)+1
 CZ = INT(RZ(Num)/RZcut)+1
 Do I=CX-1,CX+1
 Do J=CY-1,CY+1
 Do K=CZ-1,CZ+1
 II = i
 JJ = j
 KK = k
C FOR X 
 If (II.LT.1) II=II+NCellx
 If (II.GT.NCellx) II=II-NCellx
C FOR Y
 If (JJ.LT.1) JJ=JJ+NCelly
 If (JJ.GT.NCelly) JJ=JJ-NCelly
C FOR Z
 If (KK.LT.1) KK=KK+NCellz
 If (KK.GT.NCellz) KK=KK-NCellz
ID = Cell(II,JJ,KK)
 Do While (ID.NE.0)
 If (ID <= Num) Then
 ID = Linker(ID) !Don't count same/twice
 Cycle
 End If
 XX = DABS(RX(Num) - RX(ID))
 YY = DABS(RY(Num) - RY(ID))
 ZZ = DABS(RZ(Num) - RZ(ID))
 if (XX.GT.HLx) XX = Lx - XX !P
 if (YY.GT.HLy) YY = Ly - YY !B
 if (ZZ.GT.HLz) ZZ = Lz - ZZ !C
 RSQ = XX * XX + YY * YY + ZZ * ZZ
IF (RSQ < RcutB2 .AND. BindOn) Then
 if (Bond(Num,ID) == 0) Then
 Btest = UNIF(jsr)
 if (Btest < Bon) Then
 Bond(Num,ID) = 1
 Nbond = Nbond + 1
End if
 End if
 End if
If (Bond(Num,ID) == 1) Then
 Btest = UNIF(jsr)
 if (Btest < Boff) Then
 Bond(Num,ID) = 0
 Nbond = Nbond - 1
End if
 End if
If ((Bond(Num,ID) == 1) .OR. (RSQ < BondEQ)) Then
 BX = RX(Num) - RX(ID)
 BY = RY(Num) - RY(ID)
 BZ = RZ(Num) - RZ(ID)
 if (BX.GE.HLx) Then
 BX = BX - Lx
 Elseif (BX.LT.-HLx) Then
 BX = BX + Lx
 End If
 if (BY.GE.HLy) Then
 BY = BY - Ly
 Elseif (BY.LT.-HLy) Then
 BY = BY + Ly
 End If
 if (BZ.GE.HLz) Then
 BZ = BZ - Lz
 Elseif (BZ.LT.-HLz) Then
 BZ = BZ + Lz
 End If
RR = DSQRT(RSQ)
 BondR = RR - BondEQ
 FFx = -Spring*BondR*(BX/RR)
 FFy = -Spring*BondR*(BY/RR)
 if (Zon) FFz = -Spring*BondR*(BZ/RR)
FX_Priv(Num,TH) = FX_Priv(Num,TH) + FFx
 FY_Priv(Num,TH) = FY_Priv(Num,TH) + FFy
 FX_Priv(ID,TH) = FX_Priv(ID,TH) - FFx
 FY_Priv(ID,TH) = FY_Priv(ID,TH) - FFy
 If (Zon) Then
 FZ_Priv(Num,TH) = FZ_Priv(Num,TH)+FFz
 FZ_Priv(ID,TH) = FZ_Priv(ID,TH) - FFz
 End If
End If
 ID = Linker(ID)
 seed(Thread) = jsr
 End Do
 End Do
 End Do
 End Do
FX_Priv(Num,TH) = FX_Priv(Num,TH)+Scale*GAUSS2(jsr,kn,fn,wn)
 FY_Priv(Num,TH) = FY_Priv(Num,TH)+Scale*GAUSS2(jsr,kn,fn,wn)
 IF(Zon) Then
 FZ_Priv(Num,TH)=FZ_Priv(Num,TH)+Scale*GAUSS2(jsr,kn,fn,wn)
 End If
End Do
!$OMP END Do
Do j=1,Thread_Num
!$OMP DO
 Do i=1,N
 FX(i) = FX(i) + FX_Priv(i,j)
 FY(i) = FY(i) + FY_Priv(i,j)
 if (Zon) FZ(i) = FZ(i) + FZ_Priv(i,j)
 End Do
!$OMP END DO
 End Do
!$OMP END PARALLEL
End If
RETURN
 END

Here, UNIF() and GAUSS2() are calls to a random number generators, and some of variables are defined in common statements that are included in the 'include system.inc' statement.

Citação:

Mike escreveu:

All the variables have been defined as evidenced by the program working with the -openmp and -O1 flags

Sorry, but that's not evidence of anything.

Steve - Intel Developer Support

I do not see Nbond declared. Is it in one of your .inc files?

In any event, the REDUCTION( + : Nbond), I believe would belong on the !$OMP PARALLEL, not on the enclosed !$OMP DO

Jim Dempsey

www.quickthreadprogramming.com

Hello,

Yes, Nbond is defined in my system.inc file.  I tired moving the REDUCTION(+:Nbond) into the !$OMP PARALLEL section, but it still produced the same results.  Here is a sample output of my program showing the difference.

Run in serial with -O3:

Time (us)  Nbonds     N

100.0000  13328        5041
200.0000  13474        5041
300.0000  13568        5041
400.0000  13667        5041
500.0000  13707        5041
600.0000  13751        5041
700.0000  13800        5041
800.0000  13844        5041
900.0000  13852        5041
1000.0000 13878       5041

Run using 8 threads and -O1 -openmp

Time (us)  Nbonds    N

100.0000  13348       5041
200.0000  13532       5041
300.0000  13589       5041
400.0000  13701       5041
500.0000  13775       5041
600.0000  13784       5041
700.0000  13814       5041
800.0000  13883       5041
900.0000  13869       5041
1000.0000 13906      5041

Run with 8 Threads and -O3 -openmp

Time (us)  Nbonds     N

100.0000  7499843     5041
200.0000  9781784     5041
300.0000  10867080   5041
400.0000  11468020   5041
500.0000  11824365   5041
600.0000  12042995   5041
700.0000  12178306   5041
800.0000  12265422   5041
900.0000  12319681   5041
1000.0000 12354566  5041

The middle number of the output corresponds to the Nbond variable that I use in the reduction clause.  It is a stochastic variable, so the numbers between the runs should not be the exact same, but they should be close.  When I use the -O3 compiler flag, the output Nbond variable icreases by about 1000?

Save a copy of your code to reference later. Make these edits and give it a try:

integer :: NbondLocal
...
!$OMP PARALLEL DEFAULT(Shared) PRIVATE(Num,ID,jsr,Thread,CX,CY,CZ,i,j,k,  
!$OMP& II,JJ,KK,XX,YY,ZZ,RSQ,Btest,BX,BY,BZ,RR,BondR,FFx,FFy,FFz,TH, NbondLocal) ! the local copy 
!$OMP& REDUCTION( + : Nbond) ! the .inc copy
nBondLocal = 0
...
!$OMP DO SCHEDUAL(Static) ! *** no reduction here
...
NbondLocal = NbondLocal + 1 ! **** replace all nBond with NbondLocal inside the !$OMP DO
!$OMP END DO
Nbond = Nbond + NbondLocal ! accumulate into the reduction variable here
...
!$OMP END PARALLEL ! performs the reduction from inner Nbond to outer Nbond here
 

Jim Dempsey

www.quickthreadprogramming.com

Thank you for the response.  I tried switching to using Nbondlocal within the inner loop.  I am still experiencing the same problems, but this does seem to be a better way of doing things.  I think I have tracked things down to the calls to the random number generator (Btest = UNIF(jsr) in the code).  It seems that when I use the -O1 compiler flag the value of jsr is being changed (as it is changed within the UNIF function) but when I use the -O3 compiler flag, the value of jsr is the same throughout.  Therefore, I am generally not getting into the section of the code where Nbondlocal (or Nbond) is decreased.  Is there a difference in how ifort handles function calls when the -O1 optimization flag is used versus the -O3 optimization flag?

The random number generator function looks like this:

Function UNIF(jsr)
 Implicit None
C ** Fuction to Generate Uniformly Distributed ********************* 
C ** Random Numbers on [0,1] ****************************************
INTEGER jsr, jsr_input
 REAL UNIF
jsr_input = jsr
jsr = ieor (jsr, ishft(jsr, 13))
 jsr = ieor (jsr, ishft(jsr, -17))
 jsr = ieor (jsr, ishft(jsr, 5))
UNIF = 0.5E+00 + 0.2328306E-9 * REAL( jsr_input + jsr )
Return
 End

When I output the value of jsr within the function, it does change from the beginning to the end of the function.  However, when I output the value of jsr before and after the call to the function, the result is jsr is different with the -O1 flag set and jsr is the same when the -O3 flag is set.  I tried declaring jsr within the function with intent(inout), but that did not solve the problem.  I am not sure if this is the exact cause for the value of Nbond to be so greatly increased, but it couldn't help.  Also, this problem (neither the Nbond or the random number generator problem) occur when compiling with gfortran and the -O3 flag.

Thanks,

Mike

You should have disclosed that UNIF were a random number generator.

Try something like this:

!$OMP Do SCHEDULE(Static) REDUCTION( + : Nbond)
Do Num = 1,N
  jsr = seed(0)
  DO I=1,N
    Best = UNIF(jsr)
  END DO

This way, each iteration of the Do Num loop has a unique random seed, that is the same, regardless of which thead or how many threads (including 1 thread)

Jim Dempsey

www.quickthreadprogramming.com

You want reproducability, use a reproducable method.

For performance, you could build (once) an array of N seed values, then re-init using a lookup: jsr=jsr_seed(Num)

Jim Dempsey

www.quickthreadprogramming.com

Thanks for your input, but I am a little confused.  In the code listed above, the array seed is previously initialized with random integers corresponding to the maximum number of threads and then passed to the subroutine.

Thread = OMP_GET_THREAD_NUM()
!$OMP Do SCHEDULE(Static) REDUCTION( + : Nbond)
Do Num = 1, N
jsr = seed(Thread)
...
if (Bond(Num,ID) == 0) Then
Btest = UNIF(jsr)
...
End If
...
End Do

Is this not giving me a unique random seed for each thread?  What seems to be the difference is that with the -O1 flag, when I call UNIF(jsr) with a particular value of jsr, the value of jsr is changed during the call to UNIF(jsr).  However, when I use the -O3 flag, the value of jsr is the same before and after the call to UNIF(jsr) for a given thread (the values of jsr are different among the different threads, but the value of jst for the same thread does not change between successive calls to UNIF(jsr))

The code I provided gives a unique random seed for: Num=1, Num=2, Num=3, ...
and the same random seed for the given Num
no matter which order the Num's are encountered

With 2 threads, the !$OMP DO might have in order of time

Thread(0), Num=1
Thread(1), Num=N/2
Thread(0), Num=2
Thread(0), Num=3 (thread 0 beat thread 1 this time)
Thread(1), Num=N/2+1
...

Additionaly, depending on how the code flows, a differing number of UNIF's may be called by each thread.

The code I provided, gives random sequences, that are consistent with respect of Num, regardless of the order in which the Num is processed.

In -O3, the value of jsr may reside in a register, watching the memory location reserved for jsr will not show you what is in the register (register gets copied only when a save is required). Also, the debugger gets at times gets goofed up with respect to finding variables across a shared/private boundary.

Jim Dempsey

www.quickthreadprogramming.com

Deixar um comentário

Faça login para adicionar um comentário. Não é membro? Inscreva-se hoje mesmo!