Can threadprivate custom type variables be changed from within a parallel region?

Can threadprivate custom type variables be changed from within a parallel region?


Hello world,

the title already states what I would like to find out. I have a global variable which is used by OpenMp and declared threadprivate.

Within the parallel loop, the code may find the need to dynamically reload data from the disk. In this case the code currently

extends the data in in allocatable array of the custom type.

This seems to have been working for some time, however now after the parallel region I added some code for extra analysis and

suddenly I get access violations from time to time (not every run) and I think that this might be due to growing the real array within the custom type.

 

Even if this is non-standard (and I should probably code something around it), I am still wondering why this would be a violation, as the results from the parallel loop look as excepted and once the omp parallel do is finished, I would have thought that memory that belongs to a hibernating thread should be freed?

Thanks for any help you can provide!

31 posts / 0 new

Can you show:

What is thread private
What you declare as shared or private (or default) when you enter the parallel region
What you code after the parallel region is doing.

Taking a flying leap of a guess, assuming no glaring coding problems, an un-glaring problem may be the the code preceding a loop sets an initial size for the thread private allocatable objectes (to be allocated in the subsiquent parallel region). After the parallel region, the main thread expands the size to use (and possibly expands the main thread's thread private allocation), the loop cycles and the main thread has the new size, the other team member do not.

Jim Dempsey


Hi Jim,

thanks for taking the time to respond. I was trying to avoid the detailed explanation since the overall code is very large and in essence the question in the title should allow for a clear answer according to the specifications (at least I was thinking so).

The behavior you mention would fit my observations. Nevertheless I have not been able to find any documentation about this. Neither in the specifications or the intel pages. I know that typically one would not want to do this and the way the code is now, the variables aren't changed from within a parallel context, but still I was hoping for some explanation.

 

I can provide some further details though:

In the initialization phase the tool has to read in some files (astrodynamics tables) of data. The routines were coded by others but essentially since the data is linked to days, arrays of a custom type are created for each line. To access the data via the integer day they belong to (modified julian date), the array is initialized using boundaries such as 58000:58900.

These allocatable arrays are part of a larger custom type, which is global in scope and declared threadprivate.

The initialization of the variable is handled via the standard copyin clause in case of a parallelized loop (classical omp parallel do).

Now what happened and led to the bug mentioned here: https://stackoverflow.com/questions/60799041/fortran-openmp-ghost-thread...

During the loop (which operates on other data), it was found that not enough input lines were read in in the beginning of the tool. In this case the data is automatically reloaded from file. For my specific application I was the first to call a routine that has this functionality in parallel. Hence the threadprivate global variable was modified from within the threads.

This worked perfectly (at least I was thinking so and in fact I did not even notice that data was dynamically reloaded).

Now I added some further data processing after the parallel loop and surprisingly my code started to behave weird. I received this very strange access violation that I was asking about on stack overflow and debugged it for days. In the end I found out that everything was back to normal simply by avoiding any change of the threadprivate variable in the first parallel loop.

-> Hence the hopefully generic question:

Can a global, threadprivate modle variable that has a shape like this:

Type tGlobal
   ... other stuff

    Type(tEOP), allocatable, dimension(:)             :: EOP                  
   ... other stuff
End Type tGlobal

where EOP contains only scalar integers and double-precision reals, be modified, i.e. the array dimensions be changed, from a function called in parallel context?


>>the variables aren't changed from within a parallel context,

Then they need not be threadprivate.

>>initialization of the variable is handled via the standard copyin clause

*** copyin is not used with declared threadprivate. Doing so makes a thread local stack copy of its own threadprivate data. IOW inside the parallel region the local stack copy of its own threadprivate data would be updated but not the threadprivate data itself.On exit from the parallel region, the updated  local stack copy of its own threadprivate data would be lost.

>>For my specific application I was the first to call a routine that has this functionality in parallel. Hence the threadprivate global variable was modified from within the threads.

SOP is to read in the data from the file into a shared array. Either from outside the parallel region .OR. from the main thread (or single thread) with a !$OMP BARRIER following the read-in.

>>Can a global, threadprivate modle variable that has a shape like this:

An instance of a user type can be threadprivate. The contained allocatable array descriptors would be threadprivate as well.

*** if your master thread (or single thread) is reading the input data into its own thread private array, the other threads will not see it.

Jim Dempsey


Additional information.

Likely the main thread's threadprivate data would be copied in to the other team members stack copy of their own threadprivate array.

This data would be used in the visible scope of the parallel region **** but not to the same named threadprivate arrays of called functions/subroutines. They would reference the declared threadprivate data and not the stack copy of their own threadprivate array of the call level of the !$OMP PARALLEL...

Jim Dempsey


I mayby did too much of abstraction. The overall global type contains parts that are modified in parallel regions. Therefore they need to be threadprivate. I am now just avoiding to alter the internal EOP array. If I detect the need to reload data, I simply do not write it back to the threadprivate global variable but to a local one and continue the iteration with this value. Also I increases the timeframe for which data is loaded initially. Hence for this sub-problem I am not modifying the variables anymore, yet they need to be threadprivate.

The copyin is only used in another module which has the use-clause to load the module varible and then uses the copyin to initialize the sub-threads. The loading itself is handeled using a critical section.

As mentioned earlier, the program did its job up to the point I coded it. Then I added further subroutines for data analyzation and then it started to act strange. Sometimes it worked, sometimes it did not and crashed at a random allocate statement. So I had memory corruption going on - despite the fact that the module variable was threadprivate.

As you note though, once we go into details, things are easily either misunderstood or poorly desribed. I only noticed that whenever the threadprivate copy of the module variable is modified, some hinky things happen after the parallel loop -> also if this part is not within a parallel context anymore!


I ran a test program, and interestingly, the behavior has changed, and may be a bug fix or feature enhancement. This is what I learned with tp as a threadprivate variable in a module:

!$omp parallel private(tp)
... A SHARABLE or THREADPRIVATE entity is not permitted in a PRIVATE, FIRSTPRIVATE, LASTPRIVATE, SHARED or REDUCTION clause.   [TP]

!$omp parallel copyin(tp)
This copies into the threadprivate copy of the module variable tp

!$omp parallel private(i) ! i is not threadprivate, IOW local or dummy
{ here i is a private variable to each thread

Jim Dempsey

 

 


Hi Him,

in my case the variable, call it tp, is defined in module Mod_TP.

The module containing the parallel loop is e.g. Mod_Work, which then has a

use Mod_TP, only: tp

The parallel loop then has a Default None clause and only the copyin(tp). Your first message is known to me, as a threadprivate variable, as the name states, has to be private to the thread. Anything else would not make sense.

This way the program compiled perfectly and it also worked until the loop was finished. But: Any allocate after the loop may have caused a sudden stop due to memory corruption due to changing the array dimensions (growing in my case) within the custom type tp.

Long story short: there is no compiler warning, and any test program should not stop after the loop but do further allocation/deallocation afterwards.

Since my program sometimes worked and sometimes not, I guess that one needs some "look" for the machine to run into the corrupted area.

 

I guess that the following might be happening:

In my situation some thread may change the memory dimension of the threadprivate variable. Probably this works fine for any thread except the master thread. If however the memory growing takes place for the master thread, then this has an impact on the memory layout after the parallel loop and causes memory corruption. Could this be the case?

Otherwise I do not see why this, even if it is bad style, should not be supported, cause if the value of the threadprivate variable after the parallel loop is guaranteed to me equal to the one before entering the parallel loop, then there should not be any memory corruption afterwards.


>>Your first message is known to me, as a threadprivate variable, as the name states, has to be private to the thread. Anything else would not make sense.

The distinction I was trying to make is the distinction between:

integer :: tp
!$omp threadprivate(tp)

and

integer :: iThread
...
!$omp parallel private(iThread)

Within the parallel region, both tp and iThread are private with respect to each thread. IMHO only one is declared as threadprivate. The declared threadprivate variables have persistence between parallel regions whereas private(...) do not.

Maybe I am too picky about semantics?

Can you show the affected user defined type? (you can change names to obscure the purpose if you wish)

>>In my situation some thread may change the memory dimension of the threadprivate variable. Probably this works fine for any thread except the master thread. If however the memory growing takes place for the master thread, then this has an impact on the memory layout after the parallel loop and causes memory corruption. Could this be the case?

Supposition:

1) initial size determined for UDT enclosed array
2) parallel region to perform threadprivate allocations | end parallel region
3) loop
4) parallel region with copyin(yourTypeVar)
5) ... | master thread modifies size of its yourTypeVar%array | others do not
6) end parallel region
7) end loop

In the above case, the size of  the master threads threadprivate copy of yourTypeVar%array was enlarged whereas the other threads threadprivate copy of yourTypeVar%array were not enlarged (or not to the same extent). On the next iteration of the loop, the source and target arrays did not have the same size. However the copyin(yourTypeVar%array) assumed (required) the arrays to be of the same size (or at least target >=)

It is not clear as to if Reallocate Left Hand Side rules apply to copyin.

The above is supposition.

In the OpenMP 5.0 spec

24 If the list item is a polymorphic variable with the ALLOCATABLE attribute, the behavior is
25 unspecified.
Fortran

I could not locate a reference to user defined type containing allocatable

Jim Dempsey


Hi Jim,

I like your 7-step example. In fact it matches my initial question, except for the case that the original allocation (step 2) takes place in sequential context.

But you mention that on the next iteration the sizes are different, which puts up the question:

I thought that openmp uses a threadpool, so once the work of one thread is finished and a new iteration is assigned to the thread, there is no additional copy-in step. If it is, then the question is indeed of the Reallocate left hand rule then applies.

Also I have only been growing the array. Since the data is indexed by date, the growing does not shift any indices, even if the threadprivate globals persisted between thread iterations.

I would like to stress again, that I ran massive simulations on 36 threads over a day using this setup (without knowing that the array was sometimes enlarged in some threads). It always worked, however the code after the massive parallel loop contained no or maybe only 1-2 allocate statements. Now that I added another post-processing step I started to observe the memory corruptions. So for me it seems that simply by the observations, that it works while the parallel area is active, however once it exits, then there is corruption.

Regarding the overall type, this is it with different names and without comments:

Type tGlobal
    integer(i4)                                       :: a            
    Type(tEOP), allocatable, dimension(:)             :: EOP  !-> this is the allocatable array which got changed                
    Character(len=48)                                 :: b              
    Character(len=48)                                 :: c
    Character(len=48)                                 :: d
    Character(len=48)                                 :: e   
    integer(i4)                                       :: f            
    logical                                           :: g  
    logical                                           :: h 
    integer(i4)                                       :: i       
    logical                                           :: j = .False.
    logical                                           :: k
    logical                                           :: l
    logical                                           :: m
    logical                                           :: n
    logical                                           :: o
    logical                                           :: p
    real(dp)                                          :: q
    real(dp)                                          :: r
    integer(i4)                                       :: s
    integer(i4)                                       :: t
    real(dp)                                          :: u
    real(dp)                                          :: v
    integer(i4)                                       :: w
    integer(i4)                                       :: x
End Type tGlobal

 


I wrote a test program:

!  ThreadPrivateGoodBad.f90 
module sample
type tp_t
    integer, allocatable :: a(:)
end type tp_t

type(tp_t) :: tp
!$omp threadprivate(tp)
end module sample
    
program ThreadPrivateGoodBad
    use sample
    use omp_lib
    
    implicit none

    integer :: initialSize, i
    
    call omp_set_num_threads(2)
    
    initialSize = 10
    
    ! preallocate workspace
    !$omp parallel
    allocate(tp%a(initialSize))
    !$omp end parallel
    
    do i=1,size(tp%a)
        tp%a(i) = i
    end do
    
    print *,"region without expansion"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%a
    !$omp end parallel
    
    do i=1,size(tp%a)
        tp%a(i) = i*10
    end do
    
    print *,"region with expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%a
    !$omp master
    call bigger
    !$omp end master
    !$omp end parallel

    print *,"region after expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%a
    !$omp end parallel

end program ThreadPrivateGoodBad
    
subroutine bigger
    use sample
    integer, allocatable :: a(:)
    allocate(a(size(tp%a) + 10)) ! local a is larger
    a(1:size(tp%a)) = tp%a ! copy
    a(size(tp%a)+1:size(a)) = -1 ! mark remainder as unused
    ! local a should be automatically deallocated here
end subroutine bigger

Output:
 region without expansion
           0           1           2           3           4           5
           6           7           8           9          10
           1           1           2           3           4           5
           6           7           8           9          10
 region with expansion by master
           1          10          20          30          40          50
          60          70          80          90         100
           0          10          20          30          40          50
          60          70          80          90         100
 region after expansion by master
           1          10          20          30          40          50
          60          70          80          90         100
           0          10          20          30          40          50
          60          70          80          90         100

This illustrates that for user defined types containing allocatables that the allocatabel got enlarged as part of the copyin

Jim Demspey


Your edit is interesting, but I am not sure how to relate it exactly. Let's assume that nothing is changed within a parallel region. Then I have no problem at all using the the tGlobal-type variable in a copyin clause. Everything works and the data within the threads is initialized using the master-thread's data.

From the way I read the 5.0 spec (I am using openmp 4.5 btw), this sounds like we cannot have any type containing an allocatable sub-component in a copyin clause, which would be shocking.

It looks though as if changing the array is undefined behavior. Technically speaking however, is there anything that prohobits this from being a feature request?


In your case, you have a nested UDT.

what is your Type(tEOP)

Jim Dempsey


>>this sounds like we cannot have any type containing an allocatable sub-component in a copyin clause

Spec stated polymorphic, UDT alone is not polymorphic

Jim


Now we are typing at the same time. Great test program!

I see that you pre-allocated the space for each array copy. I did not do that (I thought it is only required if the main object is allocatable and not one of its sub-components. But maybe this was already the issue?

Your code demonstrates that the master-threads changes persist also after the parallel region. Interestingly your code does not result in an access violation. So I wonder if this is because I did not do the allocation of the array component in a parallel section before, or if there is another reason (maybe your code and the number of threads is too small to result in the issue to show up? or maybe one should define the array in tp to be of a custom type as well?)

You are right about the UDT (=user defined type) and the polymorphism obviously!

-> So your example works because it is not a UDT, but an integer array?

tEOP looks like this and contains the astrodynamics data relating to a certain day:

Type tEOP
    integer(i4)                               :: a               
    integer(i4)                               :: b                 
    integer(i4)                               :: c                
    integer(i4)                               :: d                  
    integer(i4)                               :: e                  
    real(dp)                                  :: f                   
    real(dp)                                  :: g                   
    real(dp)                                  :: h                 
    real(dp)                                  :: i                  
    real(dp)                                  :: j                 
    real(dp)                                  :: k             
    real(dp)                                  :: l                  
    real(dp)                                  :: m                  
    integer(i4)                               :: n                  
    real(dp)                                  :: o            
    real(dp)                                  :: p            
    real(dp)                                  :: q          
    real(dp)                                  :: r           
    real(dp)                                  :: s          
    real(dp)                                  :: t      
    real(dp)                                  :: u           
    real(dp)                                  :: v           
End Type tEOP

 


This seems to work:

!  ThreadPrivateGoodBad.f90 
module sample
integer, parameter :: i4 = 4
integer, parameter :: dp = 8

Type tEOP
    integer(i4)                               :: a               
    integer(i4)                               :: b                 
    integer(i4)                               :: c                
    integer(i4)                               :: d                  
    integer(i4)                               :: e                  
    real(dp)                                  :: f                   
    real(dp)                                  :: g                   
    real(dp)                                  :: h                 
    real(dp)                                  :: i                  
    real(dp)                                  :: j                 
    real(dp)                                  :: k             
    real(dp)                                  :: l                  
    real(dp)                                  :: m                  
    integer(i4)                               :: n                  
    real(dp)                                  :: o            
    real(dp)                                  :: p            
    real(dp)                                  :: q          
    real(dp)                                  :: r           
    real(dp)                                  :: s          
    real(dp)                                  :: t      
    real(dp)                                  :: u           
    real(dp)                                  :: v           
End Type tEOP

Type tGlobal
    integer(i4)                                       :: a            
    Type(tEOP), allocatable, dimension(:)             :: EOP  !-> this is the allocatable array which got changed                
    Character(len=48)                                 :: b              
    Character(len=48)                                 :: c
    Character(len=48)                                 :: d
    Character(len=48)                                 :: e   
    integer(i4)                                       :: f            
    logical                                           :: g  
    logical                                           :: h 
    integer(i4)                                       :: i       
    logical                                           :: j = .False.
    logical                                           :: k
    logical                                           :: l
    logical                                           :: m
    logical                                           :: n
    logical                                           :: o
    logical                                           :: p
    real(dp)                                          :: q
    real(dp)                                          :: r
    integer(i4)                                       :: s
    integer(i4)                                       :: t
    real(dp)                                          :: u
    real(dp)                                          :: v
    integer(i4)                                       :: w
    integer(i4)                                       :: x
End Type tGlobal

type(tGlobal) :: tp
!$omp threadprivate(tp)
end module sample
    
program ThreadPrivateGoodBad
    use sample
    use omp_lib
    
    implicit none

    integer :: initialSize, i
    
    call omp_set_num_threads(2)
    
    initialSize = 10
    
    ! preallocate workspace
    !$omp parallel
    allocate(tp%EOP(initialSize))
    !$omp end parallel
    
    do i=1,size(tp%EOP)
        tp%EOP(i)%a = i
    end do
    
    print *,"region without expansion"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%EOP(1)%a, size(tp%EOP)
    !$omp end parallel
    
    do i=1,size(tp%EOP)
        tp%EOP(i)%a = i*10
    end do
    
    print *,"region with expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%EOP(1)%a, size(tp%EOP)
    !$omp master
    call bigger
    !$omp end master
    !$omp end parallel

    print *,"region after expansion by master"
    !$omp parallel copyin(tp)
    print *, omp_get_thread_num(), tp%EOP(1)%a, size(tp%EOP)
    !$omp end parallel

end program ThreadPrivateGoodBad
    
subroutine bigger
    use sample
    Type(tEOP), allocatable :: EOP(:)
    allocate(EOP(size(tp%EOP) + 10)) ! local EOP is larger
    EOP(1:size(tp%EOP)) = tp%EOP
    EOP(size(tp%EOP)+1:size(EOP))%a = -1 ! mark remainder as unused
    tp%EOP = EOP ! copy with rlhs
    ! local EOP should be automatically deallocated here
end subroutine bigger

 region without expansion
           0           1          10
           1           1          10
 region with expansion by master
           1          10          10
           0          10          10
 region after expansion by master
           1          10          20
           0          10          20

Jim Dempsey


If your array expansion is due to an array expression that creates a stack temporary array, the crash my be due to stack overflow. You can mitigate this with explicit programming (create your own temp on heap).

Jim Dempsey


Jim, thanks a lot for your effort!

So, I think we can agree on:

- Your second test shows that it apparently works, yet it is undefined behavior, right? So I guess one cannot rely on this to work?

In my case this also worked. But when I put lots of data-postprocessing computations in, then any

allocatable could suddenly give me an access violation.

If you open up the stackoverflow-thread, you see that the memory corruption resulted in an arbitrary allocate-statement of a sequential (non-parallel) code to attempt to load OMP-based allocation code. The final crashes were always within frontend.cpp of OMP, which apparently uses TBB somehow, as this file belongs to TBB.

The EOP reloading code used to look like this (executed in a critical section):

call read_EOP(EOPfile=global%EOPfile, start_time=start_time, end_time=end_time,&
                      EOP_out=global%EOP)

So it would directly overwrite the array-component. Now I use a local copy of the EOP array for this any only write it back to the global variable if OMP_IN_PARALLEL() does return true.


>>yet it is undefined behavior, right?

UDT behavior is defined, polymorphic UDT is undefined.

IOW because COPYIN does not have the optional argument of ",MOLD=xxxx" it knows not how to perform the copy. Should be such an option, then it could select the appropriate (MOLD) type from the polymorphic type. Your tGlobal type is not polymorphic, it should be fine to copyin.

RE: EOP reloading

I was not aware that your reloading came from file I/O.

Is your OPEN, READ(s), CLOSE within read_EOP protected by an !$omp critical section?

If not, then the file position pointer could get bunged up, and/or your OPEN may be using the same unit number for all threads, ...

Jim Dempsey


Hi Jim,

as mentioned above, in parenthesis there is a named omp critical block around the read_EOP routine, so this should be fine.

I am a bit confused now. What do we know?:

1) The copyin of the type is fine

2) Your example shows that growing from within is possible and that if the master thread changes the dimensions of the array then they persist also after the loop

We do not know, if:

3) Does the change persist if any other thread changes the array dimensions?

4) Even though the loop execution looks fine, there could be memory corruption that has not been revealed?


>>3) Does the change persist if any other thread changes the array dimensions?

No. The copyin will use the master threads copy and values.

you may be able to use associate( tp0 => tp) outside the parallel region such that all threads have access to master threads tp inside the parallel region.
But be careful about how you modify the access. Change could be done in a barrier within the parallel region.

>>4) Even though the loop execution looks fine, there could be memory corruption that has not been revealed?

Full runtime checks in Debug build indicate no corruption.

Jim Dempsey

 


Thanks for answering 3). Regarding 4): The runtime checks also showed no corruption for me. In fact I even fired up Intel inspector and the most expensive threading check showed no problem for a run where the program went through. In a second attempt it just crashed and then inspector reported corrupt data.

The inspector memory check is running since 90 hours, but has not shown any progress anymore since one day (it seems stuck at the end of the parallel loop). So I guess that the runtime checks along are no guarantee that no memory corruption has happened.

Or asked vice versa: do you think that growing the threadprivate variable is defined behavior? In the end I guess if so, there should have been no memory corruption and independently if the master thread increased the array or not, the indexing would still be valid, as it is via the integer day.

 


I think at this time we are spinning our wheels. A look at the compete code may be in order.

All solutions are obvious once you see the solution.

Jim


It seems so... Great quote by the way. Do you know the original author?

If it wasn't a company confidential code I would be the first to share for the learning purpose, but

I am obviously not allowed to do so. The entire projects has a couple hundred thousands lines of code, and I guess your great effort of coding the examples is all that can be done without the entire project.

From my side I take away that changing the array dimensions of the threadprivate variable from within the array might work, but if one encounters sudden strange behavior this would be the first place to look. At least I could only use it up to a certain point. In any case this is certainly not good style, but I am also a bit releaved to see that also the coding experts here cannot directly state: yes/no, but need to test it :)

Once again thanks a lot for the discussion!


Have you looked at your code to see of you do not have a global variable, something like:

integer :: nEOPs

Where nEOPs is the initial size of the master thread's pre-parallel region array EOP and inside the parallel region, one of the threads needs to expand its threadprivate EOP array. It may even update nEOPs. Then on exit of the parallel region through the code that loops back to the beginning of the loop containing the parallel region, that the master thread does not notice the size change. Upon reentry, and copyin, all threads EOP arrays get set (if necessary) to the smaller size of EOP of the main thread, but then use the updated nEOPs  for accessing their arrays.

I cannot say why the Debug runtime check for accessing an array out of bounds would not catch this.

You may have some statements that require the array size to be correct. For example having EOP being passed to a subroutine as

subroutine foo(EOP)
       use yourModuleHere
       Type(tEOP), dimension(1:nEOPs)             :: EOP

IOW the reference to the threadprivate tp%EOP  is passed on the CALL, however the (now incorrect) size of the array is assumed to be in the global nEOPs.

Some thing like that could result in the problem you observe.

Jim Dempsey   


Hey Jim. There is no variable tracking the size of the array. The array is allocated using the modified julian date of the data. Hence it has bounds like 58000:58100 or so. When the array is grown, then the bounds change to something like 57800:58300. Hence, when data for a day, say, 58000 is requested, the index is valid in both cases and I never have an out of bounds.

The reloading is only performed if the requested day is prior to the covered data and then the threadprivate array dimensions were extended.


Ahhh, LBOUND and UBOUND changing as well as size. I overlooked this aspect from earlier discussions.

Is this the sketch

program
  read in 101 Julian date records (into main thread's) array dimensioned as (58000:58100)

do
  ... code A
  parallel region copyin(main's threadprivate array)
     ... code B
     ! some thread needs prior data
     if(need more) then
       save my threadprivate array into temp
       deallocate my threadprivate array
       allocate my threadprivate array dimension(57800:58300)
       read in records for 57800:57999
       copy in 58000:58100 from temp
       delete temp
     end if
     ... code C
  end parallel
  ... code D
end do

Now you want the new prior records, after the parallel region available to the main thread (and thus be available for the next iteration of the outer loop)

Is that what you expect?

Jim Dempsey


Hi Jim. Yes, Lbound and Ubound change as well, but your sketch is not capturing it fully.

It is more like this:

program

use global variable from other module

initialize and allocate global variable
!-> should this already happen in a parallel region??

... code A

OMP parallel do 
copyin(main's threadprivate global variable containing the EOP data in the private array)

   ... Calculations requiring EOP data

   ! some thread needs prior data

   if(current timestamp not in EOP array) then
   
	!This is what the code looked like before
	Read in an extended range of data into a local array of type tEOP 
	
	!The read-in array is larger than the original one, as the bounds are now
	!say 57800:58300
	Overwrite the array component of the threadprivate global variable with the
	array created from the file 
	!yes, also the data that was available has been read in again, not just the missing data
   
	work with the enlarged array

omp end parallel do

... postprocessing

Note I never intended to use the data which has been reloaded from file also in another thread or after the parallel region. These out of bound times that require reloading are very rare. Hence, if any other part of the code (sequential after the omp parallel do or any other thread within omp parallel do) encounters the need to reload data, it is simply reloaded again.

Consequently I basically do not care if the reloaded data persists after the parallel region or not. Since the indexing works via the day and arrays can start and stop at any index (which is a great feature of fortran!!) I will always get the EOP data I need. Either it is already in the array component of the global variable or reloaded from disk.

The only issue was that in the postprocessing step I started to encounter access violations from time to time whenever a reload took place (by setting breakpoints in release mode with debugging I was able to correlate the access violations with those iterations where a reload of data from file took place)


You are aware that copyin only occurs on entry into the parallel region and not on each DO iteration.

From the above description

1) main thread reads in an anticipated range of dates
2) parallel DO loop copyin(anticipated range of dates)
   3) do work on local threads EOP data (assuming date data available)
   4) thread determines EOP data required is not available
   5) new LBOUND or UBOUND is determined
   6) EOP array for this thread is reallocated to new bounds (old data discarded)
   7) file EOP data file read for records from LBOUND to UBOUND
8) end parallel DO
9) post processing not referencing any of the EOP data

IOW

all threads using same EOP data file, starting with the same initial window into the EOP data file (copied in from master thread).
during parallel DO loop, each thread may expand this window as opposed to moving the window.

Note READ will not reallocate an array to fit the size required for data read (I assume you know this)
IIF indexing is now out of bounds, and coding visually looks good, then this is indicative of the code in the parallel region is using a shared variable (may be in the module) as opposed to assumed private variable.
.OR. what visually looks good is actually in error (uninitialized, initialized but not reset, etc).

I suggest you insert some debug code that explicitly checks the bounds prior to access (i.e. do not rely on runtime checks to catch the error)

index = ...
call DebugEOP(index)
... code that uses tc%EOP(index)
index = ... ! new index
call DebugEOP(index)
... code that uses tc%EOP(index)



subroutine DebugEOP(index)
  use moduleGloabal
  implicit none
  integer :: index
  if( (index < lbound(tc%EOP)) .or. (index > ubound(tc%EOP))) then
     print *,"bug" ! place break point here
  end if
end subroutine DebugEOP 

Note, the subroutine DebugEOP can be in a seperate source file, compiled as Debug (full debug symbols no optimizations)
while the main code can be compiled and linked as Release build, but also with full debug symbols.

You may find it handy to compile using FPP (Fortran PreProcessor) which then provides for the FPP macro __LINE__

index = ...
call DebugEOP(__LINE__,index) ! compile with FPP
... code that uses tc%EOP(index)
index = ... ! new index
call DebugEOP(__LINE__,index)
... code that uses tc%EOP(index)



subroutine DebugEOP(line, index)
  use moduleGloabal
  implicit none
  integer :: line, index
  if( (index < lbound(tc%EOP)) .or. (index > ubound(tc%EOP))) then
     print *,"bug", line, index ! place break point here
  end if
end subroutine DebugEOP 

Jim Demspey


Also, after you exit the parallel region (your sketch #28) do you loop back up to code preceding the parallel region #28...
... then reenter the parallel region, with copyin of master threads old/new range of EOP records
... .AND. assume your threads prior bounds or remembered work point indexes (held in threadprivate variables) are that use for current iteration

Then these remembered values are ONLY good should the master threads range of EOP records include all threads remembered index values.

This can be corrected by examination of the LBOUND and UBOUND at entry to the parallel regon.

Jim Dempsey

 


Hi Jim,

I take care of the allocation bounds as part of the reloading routine. There is no out of bound memory access within the loop.

Also the results of the final computations match those of a commercial tool, which would not be the case if random data was accessed.

In the overall code I do not explicitly go back to the same loop again, however it can happen that a recursive call of the function is made, so that the loop is accessed a second time. I assume that since the resursive call is made after the first call has taken place, that the value which is then stored within the master thread is copied in once, which is fine for me. Again, if data is not available, is is dynamically reloaded.

What might be confusing:

the writing back of the data into the global variable was coded by some one else and was never intended to actually take place from within parallel context. I solved my problems simply by using the read-in data, but not writing it back to the threadprivate variable.

The only bad thing here is that the data may be part of a numerical integration function, so it slows down performance a lot, if every step requires reading in the data again, because I cannot modify the threadprivate variable and hence avoid multiple reloading. Thus I extended the data range wich is originally read in. For the time being this works.

I posted the question out of curiosity, if changing the data from within the tread is generally a valid operation.

Leave a Comment

Please sign in to add a comment. Not a member? Join today