Inconsistent reduction of arrays

Inconsistent reduction of arrays

I have a very large program parts of which are parallelized

using OpenMP. Running on Core 2 Extreme XC8600 and

Core 2 Quad QX6700. The code segment below works most

of the time but occasionally gives a wrong result.

============CODE SEGMENT=============

!

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(tnumber,nst,pswk1) REDUCTION(+:rho,tau,currnt,sodens,spinden,kinvden,spincur)

!

tnumber = OMP_GET_THREAD_NUM()

nst = (npmin(iq) - 1) + (tnumber + 1) + (ido - 1)*maxthreads

if (nst npsi(iq)) then

stop " processor index not in wavefunction index range"

end if

!

call densit (lpsi, ncolx, ncoly, ncolz, nst, iq, vocc, rho, tau, currnt,&

sodens, spinden, kinvden, spincur, der1x, der1y, der1z,&

psi(1,1,1,1,nst,iq), pswk1, itheta, itimrev)

!

!$OMP END PARALLEL

======================================

The arrays in reduction statement are all calculated in

the subroutine from private copies of the psi() array and

do not interact with each other.

I have tried ifort 9.1.037 and on another similarly

configured machine running 9.1.040. I am using Fedora

Core 6 with all the updates.

Thanks

18 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

When you have a large enough sum to motivate use of sum reduction, single precision typically isn't sufficient for satisfactory results, or to avoid visible differences associated with number of threads. Many large applications take specific actions to maintain consistency.

I forgot to mention that all arrays are double precision.

In addition, I am not talking about small deviations but

rather large ones. For example the density rho would

integrate to 64.0000 but suddenly it becomes 63.0000

and other quantities that integrate to e-26 level become

e-6. Again, in some runs which execute the above

thousands of times everything works perfectly. On other

occasions (starting the same executable under identical

conditions) it suddenly messes up during one iteration.

Almost like it does not do the complete sum.

Thanks

The number of threads available to a parallel region may be less than the maximum number of threads. Use OMP_GET_NUM_THREADS() issued inside the parallel region instead of maxthreads obtained outside of the parallel region. This may not be your bug this time, but it may be at a later time. This, though, may goof up your striping of the array being processed by densit. Therefore more work is required to handle the situation when the number of threads available to the parallel region is less than the maximum number of threads.

Examine densit to verify the DEFAULT(SHARED) variables lpsi, ncolx, ncoly, ncolz, iq, vocc, der1x, der1y, der1z, psi(array), itheta, and itimrev won't present a problem when called in parallel.

And what bites me now and again is assuming that what looks like stack local storage is in subroutine static storage.

real :: vec(3)

is either on stack or in static storage depending on compiler options. If densit contains something like that then consider changing to

real, automatic :: vec(3)

Then you do not have to worry about the compiler switch that places vectors on stack.

Jim Dempsey

Thank you very much for your analysis.

I will do a number of changes and post my findings in a

couple of days. First, I will define a new private pswk4

array and load it with the segment of the psi() array

for each thread. All other variables are just shared and

not changed at all in the subroutine. Other than that the

subroutine has only a few local scalar variables (no

arrays). I can use the automatic clause but using -openmp

implies -automatic to my understanding.

Happy New Year!

Well, after many hours of trying to figure out what was

wrong I came back to one of your suggestions. The

problem was that the environment variable OMP_DYNAMIC

was set to TRUE. Consequently during the run the system

occasionally changed to number of availably threads for

performance reasons.

Setting OMP_DYNAMIC to FALSE solves the problem.

Thank you all

Sammy

I caution you against assuming maxthreads are always available to a parallel region. Fix the code now while you are focused on this problem. This way, as you later make improvements to your progam the problem won't resurface and take you or your successor on an unnecessary debugging session.

Jim

For now I put in a check and stop with a comment.
The problem is the logic of dividing up the number of
states among threads change, which means I need to
think about this a little more.
Thanks

Look at using SCHEDULE with Chunk. You can specify the chunk size to the total size divided by maxthreads. This way, regardless of the number of available threads the array is divided up accordingly. If, for example, maxthreads = 4 but you have 3 threads available. Then optimally the code will run 2x faster(3 at 1/4, plus 1 at 1/4 in an additional iteration). YMWV

Your program need not stop if the number of available threads does not equal the number of expected threads.

Jim

I have been implementing your suggestions and ran into some other
problems:

1. The compiler seems to not give an error message if an OMP line is
longer that the maximum line length (say 132). It may be compiling
them correctly but I believe it should complain.

2. My long loops work fine, however I have one loop that is just length
2 and I would like to run it on two processors/threads. The loop is:

!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq,tpsi)
!$OMP DO SCHEDULE(static)
do iq = 1, 2
call schmid (iq, lpsi, ncolx, ncoly, ncolz, npsi, npmin, spnorm, psi, tpsi, itimrev, wxyz)
end do
!$OMP END DO
!$OMP END PARALLEL

This always segfaults but this one always works:

nthreads = 2
call OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(tnumber,iq,tpsi)
tnumber = OMP_GET_THREAD_NUM()
iq = tnumber + 1
if(iq > 2) stop "Wrong iq for schmid"
call schmid (iq, lpsi, ncolx, ncoly, ncolz, npsi, npmin, spnorm, psi, tpsi, itimrev, wxyz)
!$OMP END PARALLEL

is the NUM_THREAD() option implemented? Because this runs correctly if
I use gfortran from gcc-4.1.1.
Thanks

Umar,

1. When I am experiencing too long of !$OMP statement it is usualy due to too many variables appearing on SHARED or PRIVATE. When the statement gets very long (e.g. wider that fits in the edit window of the IDE) it is a good time to cleanup the code.

If SHARED is causing the problem with too many references to stack local variables then it becomes a good time to create a derived typefor acontainer and use that to pass your arguments.

If SHARED is causing the problem with too many references to global variables (in modules or COMMON)then it becomes a good time to create a subroutine out of the code enclosed in the parallel section.

If PRIVATE is causing the problem with too many references then it is a good time to create a subroutine out of the code enclosed in the parallel section.

Note, on !$OMP DO, and !$OMP PARALLELthe FortranDO and END DO are not part of the enclosed parallel section, instead they are part of the adjacent (and enclosing)!$OMP statements.

2. Try using the Chunk parameter to SCHEDULE on the !$OMP DO.

Using chunk of 1 has each thread gulping in a group of 1 indicies of the do loop.

!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq,tpsi)
!$OMP DO SCHEDULE(static, 1)
do iq = 1, 2
or
 nthreads = 2
call OMP_SET_NUM_THREADS(nthreads)
!$OMP PARALLEL DO SCHEDULE(static,1) DEFAULT(SHARED) PRIVATE(iq,tpsi)
do iq = 1, 2

Jim Dempsey

Jim,

I may have found a compiler bug or missing something in my understanding.
I can reproduce this with a sample program. I shall paste the program
below. When you compile this and run, the "write(*,*) iq,nst" from the first
two parallel sections show iq as 0, where it should clearly be 1 or 2.

On the other hand if you go down to the last paralled section and change
all iq's to ix, than the top two paralled sections give the correct values for
iq. Here is the test program:

program test
implicit none
integer :: OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
integer :: OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
integer :: nthreads, tnumber
integer :: nprocs, maxthreads
integer :: iq, nst, ix
!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------
call OMP_SET_DYNAMIC(.FALSE.)
nprocs = OMP_GET_NUM_PROCS()
call OMP_SET_NUM_THREADS(nprocs)
maxthreads = OMP_GET_MAX_THREADS()
if (nprocs /= maxthreads) then
write(*,*) "WARNING: Number of processors different than number of threads"
endif
!
iq = 1
!
!$OMP PARALLEL NUM_THREADS(maxthreads) DEFAULT(SHARED) PRIVATE(nst)
!$OMP DO SCHEDULE(static,1)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END DO
!$OMP END PARALLEL
!
iq = 2
!
!$OMP PARALLEL NUM_THREADS(maxthreads) DEFAULT(SHARED) PRIVATE(nst)
!$OMP DO SCHEDULE(static,1)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END DO
!$OMP END PARALLEL
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)
do iq = 1, 2
write(*,*) iq
end do
!$OMP END DO
!$OMP END PARALLEL
!
end program test

I havn't run your test but I think I see theproblem.

The !$OMP PARALLEL...DEFAULT(SHARED) decalares a parallel block where iq is shared. However !$OMP DO creates a new (nested) parallel block where iq becomes whatever the default is. So iq is "shared" for zero statements between the nested parallel sections and it appears a private copy of iq is created for the DO. The!$OMP DO should have SHARED(iq) or DEFAULT(SHARED)

Jim Dempsey

I ran your test in W_FC_C_9.1.032 in Windows XP x32 and this does seem to be a code generation problem. I rewrote the code as

program test
implicit none
integer :: OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM
integer :: OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
integer :: nthreads, tnumber
integer :: nprocs, maxthreads
integer :: iq, nst, ix
!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------
call OMP_SET_DYNAMIC(.FALSE.)
nprocs = OMP_GET_NUM_PROCS()
call OMP_SET_NUM_THREADS(nprocs)
maxthreads = OMP_GET_MAX_THREADS()
if (nprocs /= maxthreads) then
write(*,*) "WARNING: Number of processors different than number of threads"
endif
!
iq = 1
!
!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END PARALLEL DO
!
iq = 2
!
!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END PARALLEL DO
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)
do iq = 1, 2
write(*,*) iq
end do
!$OMP END DO
!$OMP END PARALLEL
!
end program test
And the dissassembly snippets
iq = 1
0040110A C7 45 D0 01 00 00 00 mov dword ptr [ebp-30h],1 
...
write(*,*) iq,nst
00401281 C7 85 38 FF FF FF 00 00 00 00 mov dword ptr [ebp+FFFFFF38h],0 
0040128B 8B 45 D0 mov eax,dword ptr [ebp-30h] 
The iq in both sections is referenced at [ebp-30h]. ebp is the stack frame pointer.
What is not shown here is ebp is different for the main program level
and the inner OMP DO level (as it should be). The offset (-30h) shouldn't be the same
unless by chance the variable placements were forced to be at the same offsets.
I reworked the program to the following
program test
use omp_lib
implicit none
integer :: iq, nst, ix
!-----------------------------------------------------------------------
! determine/set the number of processors on each node for openmp
!-----------------------------------------------------------------------
!
iq = 1
!
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END PARALLEL DO
!
call do1(iq)
iq = 2
!
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END PARALLEL DO
!
!$OMP PARALLEL NUM_THREADS(2) DEFAULT(SHARED) PRIVATE(iq)
!$OMP DO SCHEDULE(static,1)
do iq = 1, 2
write(*,*) iq
end do
!$OMP END DO
!$OMP END PARALLEL
!
end program test

subroutine

do1(iq)

integer

:: iq, nst

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE(nst)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END PARALLEL DO

end subroutine

do1

The code of the 1st do loop was copied to a subroutine and called
following the 1st loop. The subroutine, performing the
same statements excepting that iq is a dummy argument. nst
is a stack local argument to the subroutine.
The subroutine produces the correct results. Therefore it appears
that the compiler is producing bad code.
I performed an additional test changing the !$OMP PARALLEL DO to
!
!$OMP PARALLEL DO SHARED(iq) COPYIN(iq) PRIVATE(nst)
do nst = 1, 24
write(*,*) iq,nst
end do
!$OMP END PARALLEL DO
!
call do1(iq)
Now both produce the correct result.
It seams like there is a problem with DEFAULT and/or in
combination with COPYIN.
The following produces incorrect results
!$OMP PARALLEL DO DEFAULT(SHARED) COPYIN(iq) PRIVATE(nst)
Whereas using
!$OMP PARALLEL DO SHARED(iq) COPYIN(iq) PRIVATE(nst)

I suggest filing an issue with premier support.

I haven't tested this on 9.1.033
I have that installed on my notebook I will see if I can test this on 9.1.033 tonight.

***

Note the USE OMP_LIB

You really need to use the USE as the interface blocks are required for some of the functions to work.

Jim Dempsey

I can reproduce these results all the way upto 9.1.040 and on both
i386 and em64t versions.

Two more observations for clarification:

1. The culprit is the use of iq as the do loop variable in the last parallel
section. If we remove the last parallel section than the first two
sections print the correct value of iq.

2. The bad test codes give different answers for -O0 and -On for n>0.

So in a sense your sample program was using iq as both shared and private within the scope of the !$OMP PARALLEL.

I don't know if this is a bug or a violation of programming rules. I do think the compiler should at least issue a warning "Ambiguous use of SHARED or PRIVATE".

My personal opinion is a private iq should have been created for the last loop.

Jim

I filed this as a bug with the premier support. I think the PRIVATE and
SHARE should only apply to the parallel sections, and they mostly seem
to do. Only in this context when it applies to a do loop index and when this
index name is used as a reqular variable at some other place in the code.
I believe it it were used as a do loop index again at other locations it would
be ok.

Anyway, we'll see what the folks at Intel say.
Thank you for your help.

Leave a Comment

Please sign in to add a comment. Not a member? Join today