large data sets with OMP

large data sets with OMP

Hi,

I have a long-running program (many days) which uses many (20 - 40) OMP threads.  It uses a large amount of input data, which is fixed during execution.  I've run into the 2 Gbyte stack limit issue and I'm now somewhat confused about the exact definitions of various compiler/linker options and the windows resource monitor reports.  I'm using the 2013 1.1.139 fortran compiler on 64-bit windows 7.  The data arrays are allocatable but, at the moment, are in modules or common...I understand that makes them static.  I'm thinking of moving the largest one into the main program and passing it to subroutines as needed.

When the stack and heap commit sizes are set to 2 GB and the reserves are set to 0, the program dies when it invokes the OMP section.  Setting /heap-arrays to 0 does not help.  No diagnostics but I assume I'm overrunning the stack or heap.  Adding 2 GB to the stack and heap reserve sizes works, however, when the OMP section is invoked, the Windows Resource monitor zooms to 10s of GB while the percent of used memory and working set reported don't change much.

Questions:

One of Steve Lionel's suggestions (5/16/2011) is to use allocatable arrays in modules rather than in common.  Did that to no effect when reserves set to 0.  It is during the module allocations that the committed and working set values reported by MS Resource Monitor get much larger.  This is in a serial part of the code but I'll soon be working with data sets that will exceed the 2 GB limits in the module as well.  Don't I have to get them out of the module?  Even if I do make everything allocatable in the main program, don't I still have the same problem when my resident data set exceeds 2GB?  If that makes them dynamic (8TB size), I assume the passing of the arrays to the OMP threads in subroutine calls will leave the data in one memory location (with fortran pointer) and not fill the memory with multiple copies.

Are the reserves placed in memory, if available or, as described, virtual memory (=disk?)?  How can the MS committed be much larger than the Linker maximums?

What is the difference, if any, between 'commit' memory according to the Intel compiler and the MS Resource Monitor report?

Why does the MS Monitor report a very large committed size while reporting almost no change in Used Physical Memory when the OMP section starts?  Does each thread get the committed or reserve memory allocation?  If it is committed, doesn't that mean it is used?  The large data sets are shared as the Used Physical Memory seems to indicate.  It seems that the invocation of the OMP, even though it does not consume much more memory, requires a much larger committed space, which somehow causes overflows.

When the MS committed exceeds the actual physical memory (while the used physical memory is well below it), does this affect execution, virtual memory, or ...?

The MS 'working set' has stayed under 2 GB.  Is the working set the amount of memory actually used?  If so, why am I blowing off if the reserves are not set?

Is Windows 8 also limited to these 2GB limits?

I gather that 'large address aware' has nothing to do with this issue? (I've tried it but not in all possible combinations.)  Is there some other method folks are using to get around this issue with large static data sets?

Once I more fully understand what these choices and reports mean, I'll be able to properly change to code to accommodate the forthcoming much larger data sets while not sacrificing speed as the programs already take up alotta time.

--Thanks,  Bruce

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The limitation of 2GB stack is fixed in the design of the executable file format, it is not OS version dependent. Setting stack reserve to 2GB is a recipe for failure as the static code and data size share the same part of the virtual address space.

My suggestion to move away from COMMON is that COMMON is limited to that same 2GB space as the stack. ALLOCATABLE memory is not (on 64-bit Windows.) You cannot have an ALLOCATABLE array in COMMON. If it is a module variable, the data itself is not static - just the small (typically under 100 bytes) descriptor.

"commit size" reserves virtual address space that is backed by RAM but not pagefile. You almost certainly do NOT want to use this. "reserve size" reserves that much space but doesn't commit it to RAM (or pagefile) unless you actually use it.

Steve - Intel Developer Support

Hi Steve,

I appreciate your suggestions and guidance; I only wish there was some more information on why these recommendations make sense/ how these settings actually function.  From your response, I took it that I should set the reserve heap to 2 GB and everything else to 0.  That doesn't work, with or without /heaparray =0.  If both heap and stack reserves are set to 2 GB, w/ or w/o /heaparray, it does work.

Most of the data is allocated in a module, which I took from your 5/16/2011 post to mean that it would be static, but I think, from these results that you're right in that they seem to be dynamic.  I do allocate several smaller arrays that are kept in common.  I guess I should move them to a module.

Can you point me to a discussion of the background for some of your guidance (e.g., ". Setting stack reserve to 2GB is a recipe for failure as the static code and data size share the same part of the virtual address space.")?

So, the only way, with the current code and data set size, the program runs is with both reserves set to 2 GB.  It will also work if both reserves and stack and heap are set to 2 GB, but that causes, for the current data set, 40 GB to be committed.

Any references are greatly appreciated.

--Bruce

Bruce,

Make sure your paging space and hopefully memory support the data size you want to allocate.

If you declare your arrays as allocatable in a module and then use the module, you should find this a workable (and flexible) solution.

Put a stat= on the allocate statements, to make sure they are working.

Also, make sure your data is being shared and not private in the threads.

John

You don't need to do anything with the heap reserve value.

You have not said how the program "dies". What is the exact error message? With OMP, you have thread stack sizes to deal with as well. There is an environment variable KMP_STACKSIZE which you can set before running the program to set the size of each thread's stack. You will need to set stack reserve to an adequate amount, but not any higher than required to allow the program to run.

Steve - Intel Developer Support

Hi,

John: thanks for the thoughts.  The computers were bought with this work in mind so they have 64 and 128 GB memories which is plenty for these problems that are less that 10 GB.  I'll try the stat=, although, from the function of the program and the MS reporting, it is pretty clear that the allocations are working and the data are public, not private.

Steve: The program just dies.  Even with all the reporting turned on in debug mode, there is no error message, just a statement from the OS that the program stopped running. 

You're right, it works with just the stack reserve set to 2 GB. 

I had guessed that, when I had set the stack size to 2 GB and when the reported committed value went to 40 GB  as the OMP section kicked in, that the thread stacks were inheriting the stack/heap commit/reserve values of the serial part of the program.  I suppose that using the KMP_Stacksize would reduce that; I assume that it can be set with a variable (if I can figure out how to compute that). 

My worst data set will be about 3.5 times the size of the current test set, so I suspect the current state of affairs will not work in that case.  use

thanks all,

Bruce

 

Steve:

The very large array is already allocated in a module.  I'll move the smaller arrays, currently allocated but kept in common, into a module as well to see if that helps.

--Bruce

Zitat:

Steve Lionel (Intel) schrieb:

The limitation of 2GB stack is fixed in the design of the executable file format, it is not OS version dependent. Setting stack reserve to 2GB is a recipe for failure as the static code and data size share the same part of the virtual address space.

What's the link between the stack reserve size and the "static code and data size" on a 64 bit machine?  Are you just talking about virtual address space exhaustion if you have lots of threads and hence lots of stacks?

I always thought commit mapped through to MEM_COMMIT | MEM_RESERVE while reserve was just MEM_RESERVE in the equivalent VirtualAlloc call that set the stack up.  If so (note the if!), you'd still have pagefile backing if required - commit you get it all up front, reserve you get it as required (and if you don't actually use all of the reservation, you won't get the same total commit).  If so, not much point to commit for most applications.  I had a look at msdn to see if my thoughts were right, but it just confused me.

To the OP - do you expect that you need to have a lot stack usage?  Do you have big private arrays?

Ian, the link is that the stack and static code/data share the same 2GB address space on both 32-bit and 64-bit Windows. If one gets large, it restricts the size of the other.

I too looked at MSDN and agree it was none too clear. But I have learned that stack reserve is the only setting that is worth diddling with.

Steve - Intel Developer Support

Are you sure they are shared on 64 bit and not independently limited to 2GB?  For the test program:

PROGRAM static

	  COMMON /a/array

	  INTEGER :: array(490000000)

	 

	  array = 0

	  READ *

	END PROGRAM static

compiled with ifort /Od static.f90 /link /stack:1990000000 can be run.  When analysed using vmmap gives the following:

I've marked what I think are the stack and static data reservations, and they aren't being put in the same chunk of address space (but static data is in the same chunk as code) and collectively they exceed 2GB (though it took me about five goes to get the number of zeros right).

 

Attachments: 

AttachmentSize
Download lots_of_stack.png56.87 KB

Am I sure? No. I always thought this was the case, but I just did a test and indeed it seems that the stack is in the lowest 2GB but the static data is not. I do know that you are still limited to 2GB of static code/data but maybe they don't share the space with the stack after all.

Steve - Intel Developer Support

Some more vmmap spelunking, which absolutely supports the good doctors advice that setting stack commit is pointless and reserve is the key (apologies if that wasn't clear)... compiling a test program (doesn't matter what) with the following link options:

ifort /Od static.f90 /Fe:LackingCommitment.exe /link /stack:1990000000
ifort /Od static.f90 /Fe:OnlyWillingToGoHalfway.exe /link /stack:1990000000,1000000000
ifort /Od static.f90 /Fe:BootsAndAll.exe /link /stack:1990000000,1990000000

then the stack related memory allocations look like on Windows 7 x64 (in the same order):

What you see in all three cases, working from the top of the stack down (the way it grows) is a zone of committed (read/write) memory (either commited by linker specification, or committed automatically as the program uses more stack), some guard pages (part of the automatic stack growing mechanism - memory operations to the guard page prompt the memory manager to commit the next page of memory and then make it the guard page) and then a reserved space for future stack growth.

All the commit specification appears to be doing is changing the initial stack commitment (and hence the location of the guard pages and the boundary between "available for use" and "can be made available for use in future.  Given movement of that boundary happens automatically unless you are doing something exotic (like accesses skipping over the guard page, which the compiler explicitly defends against) then there's no point forcing it to happen early - all you are typically doing is forcing the memory manager to set aside physical storage(*) earlier and in excess of what it might otherwise do.

Throw some threads into the mix, which pick up by default the stack settings of the main executable, and you are hitting the memory manager with the need to commit a huge amount of physical storage for memory, which you might never need.

(* I have a recollection, to confuse the issue further, that commit doesn't actually commit physically until the page is touched (might be getting my operating systems and/or a conversation I had with my then fiancée a few years back confused here) - which makes pre-committing even more pointless.)

Attachments: 

My experience, which is not specifically in this area, is to limit the stack (and heap) usage as much as possible. I try to eliminate any local variables that could require stack allocation.

This can be achieved by using ALLOCATE for all local arrays. There is a problem with this approach as ALLOCATE differentiates between "small" and "large" arrays, with large being about 20kb in size. I think Steve has indicated there is an option to force most arrays to be considered as "large". In the past this was achieved by placing most arrays in COMMON, but now with the 2gb limit on common, allocating them in a module is an effective way.

Use of array sections is another problem source, such as call subxx ( array(3,:) ). This will create a temporary array for the array section and should be avoided.

You mentioned you can have 20 to 40 threads. Doesn't each thread get allocated it's own stack, so the approach of minimising the stack requirement of each thread and minimising the stack allocation to each thread might solve the problem.

You also indicate that the program stops, without any error report. Could it be that you have an error test in the program which determines that the program should stop, but in haste you have not provided an adequate report to the log file you are checking.

John

 

Zitat:

John Campbell schrieb:
Use of array sections is another problem source, such as call subxx ( array(3,:) ). This will create a temporary array for the array section and should be avoided.

"may", not "will" for that example.  Typically, if the called procedure has a dummy argument that is assumed shape (declared with (:) ), then you won't get a temporary. 

(Because the array inside the procedure won't be contiguous you may then incur a performance penalty inside the procedure.  But fixing that performance penalty (by flipping dimensions of the array everywhere or by manually creating a temporary copy) might end up being worse medicine that the disease, so after considering your options - if you still have to take a section, then just take a section.)

You pretty much will get a temporary if you use vector subscipts - i.e call subyy(arrray([1,2,3,5,7,11])).  But the temporary is only going to be as big as the number of elements in the vector.

You also get temporaries if you use sections (or vector subscripts) in write or print statements.  Some of these seem a bit unnecessary on the part of the implementation and would be nice to see reduced in future

A very common cause of temporaries is use of pointers.  Perhaps use of array functions and expressions operating on derived types too.

Ian,

You are right, as using : in the first dimension will be contiguous and should not create a temporary array. However, using the : in the second or more dimension is asking for trouble, ( call subxx ( array(3,:) ) especially when you are scaling up the size of the problem.
While it may be better to say array(3,1:nx), the problem is that this array section will go on the stack, while explicitly using an allocated variable will not.

The point I am making is that it is best to locate all use of the stack by these structures and replace then with ALLOCATE arrays which do not.
I have never found a stack problem that I could fix by changing the stack size. The most robust solution is to move all these stack arrays into ALLOCATE variables, otherwise you are repeatidly patching the same problem. Even some of the Polyhedron benchmark programs have this problem and applying this change certainly changes the run time performance. Look for the problems that require a stack size adjustment.

This "stack overflow" problem has been around for all the years I have used Fortran. Perhaps the solution would be to direct the compiler that all these temporary arrays to be supplied by malloc and not the stack, but again the solution is to remove the problem. Often the cause will be a lazy array section of a large array.

John

 

Hi all,

This discussion has been very helpful.  VMMAP lets me look into how much is going where and that has let me tune the stack reserve.  It seems by switching most of my arrays to allocatable a few weeks ago, the requirements for large stacks was reduced by an order of magnitude and the data were moved into private data.  I had forgotten that we had moved the common into a module some months ago so there was nothing to be gained there. 

The failure of the code w/o error message was probably due to the setting of the reserve and the committed of the stack and the heap to 2GB each.  When the threads kicked in, perhaps they grabbed another 100 MB of stack which put it over the limit.  There was no unannounced stop in the code.

I made a larger data set (about 2.5x) but, with the current stack setting, it still runs.  I don't yet understand why I'm seeing page tables of 8 MB and what that means but that is minor on the scale of things.  I've attached a PDF of the latest VMMAP display for the current dataset and a small and large number of threads.  I'll generate a larger data set in the next day or two and run with more threads but, as it takes more than an hour for the program to get to the OMP section, it'll be a day or two before I get there.  It looks like the thread stacks add to the committed total stack space.

--Bruce

Attachments: 

AttachmentSize
Download two o30_50s.pdf1.74 MB

Bruce, et al.

The following is snipped from a simulation program I use. I will post it here, if after digesting it, you have questions then please ask

RECURSIVE SUBROUTINE TNSPRG(pTether)

	  ...

	  USE MOD_SCRATCH

	  implicit none

	  ... 

	  type(TypeTether) :: pTether

	  ...

	  type(TypeTNSPRG), pointer :: pTNSPRG
  ...

	  pTNSPRG => ScratchTNSPRG(pTether)
-------------------
! MOD_SCRATCH.F90

	! interfaces module (no code)

	module MOD_SCRATCH

	...

	interface

	recursive function ScratchTNSPRG(pTether)

	 USE MOD_UTIL

	    use MOD_ALL

	    use MOD_TOSS

	    type(TypeTNSPRG), pointer :: ScratchTNSPRG

	    type(TypeTether) :: pTether

	end function ScratchTNSPRG

	end interface

	...
----------------------------
! MOD_SCRATCHcode.f90

	! code module
recursive function ScratchTNSPRG(pTether)

	#ifdef _OPENMP

	    use omp_lib

	#endif    

	    use GlobalData

	    use MOD_TOSS

	    implicit none

	    type(TypeTNSPRG), pointer :: ScratchTNSPRG

	    type(TypeTether) :: pTether

	    integer :: iStat,iThread,i,NBEAD,nb,ni

	    ! sanity check

	    if(.not. associated(pTether.pFiniteSolution)) call DOSTOP('Finite Solution not allocated')

	#ifdef _OPENMP

	    iThread = OMP_GET_THREAD_NUM()

	#else

	    iThread = 0

	#endif   
    if(ALL.allocate_tnsprg) then

	        ! once only code

	!$OMP CRITICAL(CRITICALScratchTNSPRG)

	        if(ALL.allocate_tnsprg) then

	            ! Here only on first time call for any thread (and while in critical section)

	            allocate( &

	                & ALL.tnsprg(1:iMaxThreads), &

	                & STAT=iStat)

	            if(iStat .ne. 0) call DOSTOP('TNSPRG - Memory Allocation problem')

	            ! Wipe array pointers

	            ! Do not assume allocate will always wipe for us

	            do i=1,iMaxThreads

	                nb = loc(ALL.tnsprg(i).LastInteger)-loc(ALL.tnsprg(i).FirstInteger)

	                ni = (nb / sizeof(ALL.tnsprg(i).FirstInteger)) + 1

	                call WipeIntegers(ALL.tnsprg(i).FirstInteger, ni)

	            end do

	            ! wipe done, now indicate allocation complete

	            ALL.allocate_tnsprg = .false.

	        endif

	!$OMP END CRITICAL(CRITICALScratchTNSPRG)   

	    endif

	   

	    ! iThread is 0 based, ALL.tnsprg is 1 based

	    ! Obtain pointer to this thread's ALL.tnsprg data area

	    ScratchTNSPRG => ALL.tnsprg(iThread+1)

	    ! See if 1st call

	    if(ScratchTNSPRG.MAXBBS .eq. 0) then

	        ! 1st call for this thread, must allocate initial arrays

	        ! sanity check

	        if(iMaxBeadMax .eq. 0) call DOSTOP('TNSPRG - iMaxBeadMax problem')

	        ! Must equate prior to allocation

	        ScratchTNSPRG.MAXBBS = iMaxBeadMax

	        allocate( &

	            & ScratchTNSPRG.TDUM(ScratchTNSPRG.MAXBBS), &

	#ifdef _Use_AVX

	            & ScratchTNSPRG.TDUMymm(ScratchTNSPRG.MAXBBS), &

	       

	            & ScratchTNSPRG.TDUMymm02(ScratchTNSPRG.MAXBBS), &

	       

	            & ScratchTNSPRG.TDUMxmm(ScratchTNSPRG.MAXBBS), &

	       

	#endif       

	            & STAT=iStat)

	        if(iStat .ne. 0) call DOSTOP('TNSPRG - Memory Allocation problem')

	    endif

	    ! sanity check

	    NBEAD = pTether.pFiniteSolution.iNBEAD

	    if(NBEAD .gt. ScratchTNSPRG.MAXBBS) call DOSTOP('ScratchTNSPRG.MAXBBS')

	end function ScratchTNSPRG

	

Jim Dempsey

www.quickthreadprogramming.com

Forum gripe:

Fix the CRLF shows empty line problem. The above was pasted from Notepad on Windows system. It would be nice if we could paste directly from a copy form Visual Studio (inclusive of highlights).

Jim Dempsey

www.quickthreadprogramming.com

Some commentary on prior post.

The application partitions the workspace by outermost level OpenMP thread number. Some of the threads, but not all of the threads, will call the TNSPRG subroutine. It was (is) desirable for each thread that calls TNSPRG to have a private copy of temporary arrays. It is also desirable to avoid ALLOCATE and DEALLOCATE on each call to the subroutine (doing so would pass through a critical section). The "trick" was to create an array of pointers to the scratch data structures. First caller for type of scratch data is thread-safely responsible for allocating the array of potential pointers for all threads and nullifying it (and anything else in the type). Subsequently to allocating (or waiting for allocation) of array of  scratch type (ALL.tnsprg) type TypeTNSPRG, the array element containing the TypeTNSPRG for the specific thread is then tested to see if allocations have been made (occurs first time).

The object holding the arrays is as follows:


	type TypeTNSPRG

	    integer :: FirstInteger

	    integer :: MAXBBS

	    real, pointer :: TDUM(:)

	#ifdef _Use_AVX

	    type(TypeYMM), pointer :: TDUMymm(:)

	    type(TypeYMM02), pointer :: TDUMymm02(:)

	    type(TypeXMM), pointer :: TDUMxmm(:)

	#endif

	    integer :: LastInteger

	end type TypeTNSPRG

	

*** Note,

Pointers are used above instead of using allocatable arrays due to the code being derived from IVF 8.1 where there used to be issues relating to using allocatable arrays within allocatable user defined types. In reading some of the other threads there still seems to be some lingering issues (at least with respect to deallocation of array of user defined types containing arrays of user defined types.

Also, excuse the use of "." where "%" is the formal member separator. For me % visually creates run-on text.

Jim Dempsey

www.quickthreadprogramming.com

Leave a Comment

Please sign in to add a comment. Not a member? Join today