Dynamic Array Memory Problem

Dynamic Array Memory Problem

I have defined a Type Called A_Type which has dynamic array inside, 

then I construct a dynamic array of  A_Type,

When I  test the memory I could use ,

I find the available memory decreases after I deallocate all the dynamic array.

Why?  The phenomenon looks like memory leak.

Is there any suggestion ?

 

33 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

The test example

Attachments: 

AttachmentSize
Downloadapplication/zip dynamicarraymemoryproblem.zip12.93 KB

I don't see evidence of a leak here, and Intel Inspector XE's memory analysis does not indicate a leak. What may be happening is that memory is fragmented by the repeated allocation and deallocation of smaller objects. It may be that the memory pool is not coalescing adjacent free blocks perfectly.

Steve - Intel Developer Support

Frank,

I tested your example and changed it to recycle 20 times. This identifies that memory is being lost, probably due to the defragmentation as Steve has indicated.
I changed the de-allocate loop from "    DO K=1,M"  to "    DO K=M,1,-1" and this reduced the amount of memory loss taking place. This would indicate that defragmentation of the available memory table is part of the problem, but this change to releasing in the opposite order to allocating did not eliminate it. I have found problems related to this in the past.

The test I carried out was on another 32-bit compiler, which demonstrates the problem. I suspect ifort will perform in a similar way. This example also demostrates that with Windows 7-64 OS, that 32-bit programs can get up to 3.8 gb of memory, although the maximum array size is limited to 2gb.

John

Attachments: 

John,

I to reversed the deallocation loop but saw no change in "fragmentation". I also added calls to explicitly enable Low Fragmentation Heap (for Releas build), but this too made no change. I am running on Windows 7 x64 platform.

[txt]

Run as x32:
 Position =           0 Memory that could be used(M):        1750
           0
           0
 Position =           1 Memory that could be used(M):         460
 Position =           2 Memory that could be used(M):         460
Fortran Pause - Enter command<CR> or <CR> to continue.

Run as x64
 Position =           0 Memory that could be used(M):                 31890
           0
           0
 Position =           1 Memory that could be used(M):                 35850
 Position =           2 Memory that could be used(M):                 37640
Fortran Pause - Enter command<CR> or <CR> to continue.
[/txt]

It appears that there is a node consolidation issue with the heap when running x32 on x64 platform

Jim Dempsey

www.quickthreadprogramming.com

Jim,

Try the two versions I attached, which repeat the process 20 times. I found a difference in available memory using another compiler. I'd be interested if ifort does not have a similar problem, as I think the problem is with the memory management system ( I assume called malloc ?).

I have found other examples where the memory availability estimate of malloc gets confused. Perhaps it is because of the non-contiguous chunks of memory being poorly managed, rather than the memory being leaked.
I think there needs to be a better strategy to cleaning up the available memory, at each DEALLOCATE, to improve the availability of big chunks of contiguous memory. It would be interesting if we could get access to the table of available memory that MALLOC is using, to see what is going on.

I think the problem is not memory leakage but having the available pool of memory being split up by small "allocations". It is hard to identify the source of these in the program attached, unless MALLOC is not merging adjacent free memory areas, which I'd expect MALLOC should be able to overcome.

I have written my own F77 memory manager, using a large array, and this clean-out is a basic strategy. My routine, called MEM_ALLOC, has two features that ALLOCATE does not allow, which are, give me the biggest available array size, and a re-size option, for reducing the size of a previously allocated array. Not sure why these were not provided in ALLOCATE. There are a few useful features omitted from the standard!

John

John,

Made minor modifications to each program. This was to exit allocation loop on first failure, and exit deallocation loop on first .not. allocated. This removed buku number of messages.

Both report similar results:

 TESTING PASS 1
 Position = 0 : Memory that could be used 1396 mb  (28)
 allocate error for B(K)%C(B(K)%N)     1811830         250          41
     1811830  allocated    1907.349      mb
 Position = 1 : Memory that could be used 0 mb  (28)
     1811830  deallocated
 Position = 2 : Memory that could be used 76 mb  (28)
 
 TESTING PASS 2
 Position = 0 : Memory that could be used 76 mb  (28)
 allocate error for B(K)%C(B(K)%N)     1813748         250          41
     1813748  allocated    1907.349      mb
 Position = 1 : Memory that could be used 0 mb  (28)
     1813748  deallocated
 Position = 2 : Memory that could be used 76 mb  (28)
 
 TESTING PASS 3
 Position = 0 : Memory that could be used 76 mb  (28)
 allocate error for B(K)%C(B(K)%N)     1813619         250          41
     1813619  allocated    1907.349      mb
 Position = 1 : Memory that could be used 0 mb  (28)
     1813619  deallocated
 Position = 2 : Memory that could be used 76 mb  (28)

After 1st allocation the heap appears fragmented. This is in x32 Debug build.

Note, IVF is calling MS CRTL runtime routines. However, IVF support team should use your example to determine if the observed fragmentation is a result of MS CRTL alone, or a result of how they use the MS CRTL routines.

Jim Dempsey

www.quickthreadprogramming.com

Many thanks to Steve.

Many thanks to John.

Many thanks to Jim.

I agree the heap memory is fragmented .Maybe the next work is that ifort improves it and I have to change my data structure now.

Meanwhile, I find that the CVF6.6 has the same problem too.

 

 

Jim,

I have returned to work and run the programs I posted on the weekend. I have identified the problem, using ifort IA-32. There are a few errors being reported !!

I modified the program to address the problem you identified and also better report testing options.

1)    Include an exit on the first ALLOCATE error,
2)    Correctly report the memory allocated,
3)    Report the DO loop order for deallocate, and
4)    Report the value of %n being used for memory size test.

There are two key options that identify the problems being identified in this test.

When %N = 150, B is fully allocated, but when B is deallocated, the amount of free memory available, that is identified by ALLOCATE, is reduced with each cycle of the test. The amount of memory identified changes with the order of the DEALLOCATE loop, being 336 mb, after PASS 20, with DO k=1,M, but 411 mb with DO k = M,1,-1. This identifies that the order of deallocate can change the estimate of free memory for subsequent ALLOCATE requests.

When %N = 250, B is only partially allocated, but when B is deallocated, the amount of free memory available, is identified as only 76mb by ALLOCATE, but on a subsequent ALLOCATE of B%C, 1,872 mb of memory is available for B to be allocated. This instability in the amount of free memory available is a problem that I have identified in other programs. There is instability in the amount of free memory being identified to ALLOCATE. This problem identifies that the assessment of the available memory pool is changing.

The example I posted on the weekend also identified that for a 32-bit application running on a 64-bit OS, that up to 3.8gb of memory can be accessed using the equivalent of the /3gb memory option. It would be good to enable this for 32-bit ifort applications running on 64-bit OS, for those applications that can not be easily changed to 64-bit. ( LOC needs to report > 2gb)

I have attached the two DO loop versions from the weekend, with the corrections noted above and for the two cases of %n=150 and %n=250. X90-bat.txt is the batch file I use to test each option. These identify the problems I have reported.

There is a definite problem in the way the pool of free memory is being assessed. It would be good if this problem could be better identified and hopefully addressed.  (could a routine defragment_memory_pool or something similar be provided, although DEALLOCATE should do this ?)

John

ps: problems attaching files, as gives an error report

I have attached an updated version that takes an argument of N, with -ve for backwards deallocate. Gives flexibility to see the change to memory loss for different values of N. The pattern of loss, say for N = 100 or -100 is interesting.

Attachments: 

Jhon,

Thank you very much.I hope this problem could be solved by ifort as soon as possible.

Frank

Quote:

John Campbell wrote:

Jim,

I have returned to work and run the programs I posted on the weekend. I have identified the problem, using ifort IA-32. There are a few errors being reported !!

I modified the program to address the problem you identified and also better report testing options.

1)    Include an exit on the first ALLOCATE error,
2)    Correctly report the memory allocated,
3)    Report the DO loop order for deallocate, and
4)    Report the value of %n being used for memory size test.

There are two key options that identify the problems being identified in this test.

When %N = 150, B is fully allocated, but when B is deallocated, the amount of free memory available, that is identified by ALLOCATE, is reduced with each cycle of the test. The amount of memory identified changes with the order of the DEALLOCATE loop, being 336 mb, after PASS 20, with DO k=1,M, but 411 mb with DO k = M,1,-1. This identifies that the order of deallocate can change the estimate of free memory for subsequent ALLOCATE requests.

When %N = 250, B is only partially allocated, but when B is deallocated, the amount of free memory available, is identified as only 76mb by ALLOCATE, but on a subsequent ALLOCATE of B%C, 1,872 mb of memory is available for B to be allocated. This instability in the amount of free memory available is a problem that I have identified in other programs. There is instability in the amount of free memory being identified to ALLOCATE. This problem identifies that the assessment of the available memory pool is changing.

The example I posted on the weekend also identified that for a 32-bit application running on a 64-bit OS, that up to 3.8gb of memory can be accessed using the equivalent of the /3gb memory option. It would be good to enable this for 32-bit ifort applications running on 64-bit OS, for those applications that can not be easily changed to 64-bit. ( LOC needs to report > 2gb)

I have attached the two DO loop versions from the weekend, with the corrections noted above and for the two cases of %n=150 and %n=250. X90-bat.txt is the batch file I use to test each option. These identify the problems I have reported.

There is a definite problem in the way the pool of free memory is being assessed. It would be good if this problem could be better identified and hopefully addressed.  (could a routine defragment_memory_pool or something similar be provided, although DEALLOCATE should do this ?)

John

ps: problems attaching files, as gives an error report

I have attached an updated version that takes an argument of N, with -ve for backwards deallocate. Gives flexibility to see the change to memory loss for different values of N. The pattern of loss, say for N = 100 or -100 is interesting.

Frank,

From John's and my testing, it appears that there is indeed a heap issue that results in fragmentation whereby adjacent nodes of free memory are not consolidated into larger nodes (at least as exhibited by the example programs posted). This leaves you in the position of a) waiting for a fix, or b) producing a work around. I suggest a work around where the large buffers are:
a) allocation of large buffer first checks private list of buffers, if one available take one from private list, if none available ALLOCATE
b) free of large buffer, returns to private list (i.e. does not DEALLOCATE)

Jim Dempsey

www.quickthreadprogramming.com

I did an experiment replacing the Fortran ALLOCATE calls with calls to MALLOC (which is a wrapper to the C malloc routine). I saw identical behavior, so the fragmentation is happening within the C heap management. There's nothing Fortran can do about this.

Steve - Intel Developer Support

Steve,

Thank you.

But I dont agree with you.

This bug is caused by the double layers of dynamic array in custom type.

If there is only one layer of dynamic array,the fragmentation will not be happen.

It should be the fortran's problem and could be corrected.

Frank

Quote:

Steve Lionel (Intel) wrote:

I did an experiment replacing the Fortran ALLOCATE calls with calls to MALLOC (which is a wrapper to the C malloc routine). I saw identical behavior, so the fragmentation is happening within the C heap management. There's nothing Fortran can do about this.

You have not shown evidence that it's Fortran's problem. What you showed instead is that the pattern of allocations and deallocations creates the fragmentation. As I said, I replaced the Fortran ALLOCATEs with C mallocs and saw the same behavior.

Steve - Intel Developer Support

Steve,

We have shown that it is a problem of using ALLOCATE ( B(K)%C(B(K)%N) )

You have been able to reproduce the problem, using MALLOC.

I don’t think that qualifies as not being a Fortran problem. It is certainly a problem for Fortran users.

Do we have any way of identifying the defragmentation of available memory, being either:

  • ·        Small remnants of the allocation process, although they are difficult to identify in the code, or
  • ·        The free memory table in MALLOC becomes large and unordered, due to the allocation of 2,000,000 B(K)%C(N)’s.

If Malloc uses a table of free memory, is it possible to get access to this list. It would certainly be interesting to analyse this and identify what is the reason for the defragmentation, being either an inability to recognise and merge adjacent free memory blocks, or there are small remnants left over from the B%C(:) structure.

Is there a MALLOC function to sort the free memory table and merge adjacent free blocks, as this could be a useful function to select after many DEALLOCATE operations.

I do note that the worst case occurs when the allocation of B%C is not complete. Could it be there is not a suitably clean recovery from an ALLOCATE error in this case.

It would certainly be good to know a bit more about what causes the problem, rather than it is not Fortran’s problem.

It is clear it is a Fortran user’s problem and it is a problem with Ifort’s implementation of DEALLOCATE (B(K)%C).

John

Random thoughts from previous explorations, the passage of time may have corrupted things somewhat:

- The C runtime's malloc is a thin wrapper on top of a base operating system library (HeapAlloc and friends) - which I what it looks like ALLOCATE hits directly for small allocations anyway (I previously thought it was malloc based and can find some ifort compiler test stuff based on that (I've gone to the trouble of writing a module that provides interfaces for all the stuff in crtdbg.h!) - has this changed with ifort version?)  (Edit to note - /libs:static is the key!)

- That heap manager works (as you would expect) by reserving largish chunks of address space and then divvying it up for the individual allocations. I recall reading that it is limited in how many chunks it can reserve (perhaps that's an OS specific thing and perhaps I am completely off base and this should be ignored - I can't find whatever documentation made me "recall" this).  Once it has reserved a chunk of address space I don't know whether the heap manager ever gives it back to the OS once all the little allocations have been freed - I'd expect not (and simple tests confirm my expectations), but I might be wrong. 

(You can have multiple heaps extant at once in a Windows program - one way of forcing the heap manager to give the virtual address space back to the operating system is to destroy a particular heap, but I suspect that's well outside the domain of what you could do considering Fortran allocatable variables.)

- For large allocations ALLOCATE goes via VirtualAlloc and friends - i.e. its strategy changes depending on the size of the allocation (which is sensible given normal use cases).

- Given the above, I wonder whether the method of testing for available memory could potentially be confusing the issue or be part of the problem itself.  If your heap manager has reserved lots of address space to cover lots of little allocations, then when VirtualAlloc is called for a big allocation there won't be anything to give out.

- You can walk an OS provided heap using HeapWalk and query the maximum size using HeapCompact (though I'm not sure whether that is the maximum-maximum size, or just the maximum size given the address space that the heap manager has already acquired).  It would be easy enough to write a little diagnostic routine using HeapWalk.  If you use a tool like vmmap you can see where the address space reserved for heap allocations versus other address space uses sit relative to each other.  Some of the other windows debugging tools probably give you further insight (whenever I've had to do heap stuff I've been working via the C runtime's diagnostic support, which might or might not be applicable here).

- The OS heap allocation routines can change their behaviour depending on whether a debugger is attached!  Current windows versions may use a so called low fragmentation heap by default, but if a debugger is sensed it will switch to the previous "compatible" style of heap allocation.

Steve,

As John has said, it is a problem of using ALLOCATE ( B(K)%C(B(K)%N) ) which I called  "the dynamic array of double layers with personal type".

I do have no enough evidence to show it is the problem of malloc of C.

But why the memory could be recoverd to the original when we deallocate dynamic arrays with the system data type like integer/real/double ? There is no fragmentation at all.

If ifort could dealloccate the A(N) clearly with no fragmentation, why ifort deallocate A(N)%B(M) with fragmentation (A and B are both dynamic array)? There is no reasonable explaination but ifort has some certain bug in deallocate with the custom type.

The data structure of A(N)%B(M) is very ordinary and popular. If there is some unconvience, it should be improved even if it is the problem of malloc.

I do hope the deallocate could be optimized .

Frank

 

 

 

Perhaps I have not made myself clear. When you do a Fortran ALLOCATE, it calls C malloc to allocate the storage. (Though for large requests it uses the WIndows API routine VirtualAlloc - that's not happening here.) For DEALLOCATE, it calls C free. It is the C library that actually manages the storage. Any overhead Fortran adds is included in the malloc request and freed appropriately.

My experiment replacing the ALLOCATE/DEALLOCATE with malloc and frree, using the identical amounts of storage, demonstrates the exact same fragmentation issue, which shows that Fortran is not doing something extra resulting in the fragmentation. It is the Microsoft C library that is not properly consolidating the freed blocks.

I will do some more tests to see if I can prove this further, but I see no evidence that the Fortran library is responsible.

Steve - Intel Developer Support

The fragmentation is happening at an even lower level - at the level of the Windows heap manager and even at the level of the process' address space.

The only way I can think of to "fix" it (assuming this is actually a problem and not just an observation by the OP) is to switch the smaller allocations over to being pointers set up using C_F_POINTER, where the backing memory is taken from a heap dedicated to those allocations.  When you have finished with those smaller allocations you then destroy the entire heap.  (Or, because of the fixed size and consistent lifetime of the allocations it would be trivial to roll your own allocator that didn't fragment.) 

The behaviour of the heap manager will have been "tuned" for typical application usage.  The OP's example isn't typical by a long shot, particularly when you consider you have millions of allocations all of exactly the same size, in conjunction with the "search" for a large chunk of contiguous available address space.

Steve,lanH,

Thank you very much.

I have made a similar data structure with C++.

No fragmentation at all.

Does this mean ifort could be improved?

Frank

 

Attachments: 

AttachmentSize
Downloadapplication/zip cppdynamicarray.zip9.64 KB

Position =1Memory that could be used(M):1800
Position =2Memory that could be used(M):275
Position =3Memory that could be used(M):1800

Your c++ Memory_Total_Test is testing a different thing to your Fortran case.  The C++ test ultimately allocates its blocks from a heap managed by an operating system library using HeapAlloc.  The Fortran case supplies the allocation (because it is large) directly from the process' address space using VirtualAlloc.  This is the aspect that I was referring to above, when I said that the Fortrans programs way of testing for available contiguous memory might be confusing things.

If you rewrote your memory_total_test to use VirtualAlloc I expect that you would see very similar results.

Your Fortran approach implies that you want both one large contiguous chunk and many, many smaller identical size chunks.  Is that really your use case?  If you are worried about fragmentation preventing the availbility of the large chunk, and easy solution is to get (and retain) that allocation early - before you do the smaller allocations.  That's easy enough to do in standard Fortran.

Steve,

Could you explain the difference between C malloc to allocate the storage and the WIndows API routine VirtualAlloc ? At what allocation size does the switch take place ?

It is my understanding form your earlier explaination that smaller amounts of memory are allocated off the heap. However Frank's example allocates 2,000,000 small blocks of 600 bytes, which is 1.2gb in total. At some point, does the heap overflow ? If so does a new heap become created or does it revert to VirtualAlloc ?

There appears to be some memory management structure left over from the DEALLOCATE. The situation appears to be worse when N=250 and the ALLOCATE fails before 2,000,000 are allocated.

Given that there are two types of allocation methods, could the routine "Memory_Total_Test" be getting different answers, depending on if it uses MALLOC or VirtualAlloc ?

John

Perhaps I should modify the test program to report the address and size of each successfull ALLOCATE. This would then provide a memory map of where all the memory is being used, and identify what memory is not being used. I could then analyse for contiguous blocks and report them.

Gap Size (bytes)                 Number
                        40                      14
                        56        1,994,948
                        88                           1
                     456                4,726
                     584                           7
                     648                           4
                     680                           7
                     696                           9
                  1,480                    198  
                 1,672                           1
               35,128                           1
               62,920                      52
               63,944                      14
               64,968                      12
             196,040                           1
             262,600                           2
         4,758,248                           1
     160,088,856                           1

The table represents the count of the gaps between each B(i)%C allocation address, after allowing for 600 bytes in size. The minimum gap is 40 bytes and the typical is 56 bytes. So typically each allocation is stepping 656 bytes through memory.  Is this an overhead of the %C index or a memory gap from MALLOC ?
There are 5,037 gaps larger than 56 bytes.

The allocation : ALLOCATE(B(M),STAT=Error) appears to allocate 2,000,000 x 80 bytes of memory, based on the report :
write (*,*) 'mem_index',m, loc(B), loc(B(m)), B(m)%N, error ; giving
 mem_index     2000000               6094920             166094840         150
this would represent the 160,000,000 gap identified above.

Why is there 80 bytes per entry of B(:) ?

 

Edit : 06/11/2013
The test I reported above was done using Intel(R) Visual   Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64,   Version 12.1.5.344 Build 20120612
I have now re-tested the example, using Intel(R) Visual   Fortran Compiler XE for applications running on IA-32, Version 12.1.5.344   Build 20120612

The changes are that:
There are now 40 bytes per B(:) entry. With the address information changing from 8 bytes to 4 bytes, there is potentially 10 addresses per B(:) entry ?, or is the default integer in ifort_64 8 bytes ?

The table of gaps between each B(i)%C now has a mixture of 40 bytes or 56 bytes, as the main gap, but a few 24 bytes (?).

Gap   (bytes)Count of gap
                           24                       7
                           40          999,982
                           56          995,037
                        136                       4
                        216                       1
                        232                    14
                        392                       1
                        408              4,792
                     1,432                    78
                  35,064                       1
                  63,896                    25
                  64,920                    28
                  65,944                    24
                262,552                       2
                634,056                       1
            2,492,888                       1
          81,658,264                       1
 Grand Total     1,999,999 

These large gaps between each entry still indicate there is potential for extra information being associated with each B(k)%C(150) allocation.

Attachments: 

AttachmentSize
Downloadapplication/octet-stream dynamicarraymemoryleak-1a.f902.88 KB

lanH,

Thank you for your help.

But I don't care and have no ability to know what have been done in the lower level with fortran or C++;

As a programer ,I will always focus on the algorithm  with the program tool,meanwhile I will not  think what happened in deep level and how

 the program tool works, or I will be a compiler instead of a coder now.

In my opinion, when I  have made the same data structure and algorithm  in different program language ,it should have the same  or similar result.

You explains to me that my c++ Memory_Total_Test is different with my Fortran case.

 But  I can't tell the difference between them from algorithm.They are twins just with different program languages.

As Steve has said,

"My experiment replacing the ALLOCATE/DEALLOCATE with malloc and frree, using the identical amounts of storage, demonstrates the exact same fragmentation issue, which shows that Fortran is not doing something extra resulting in the fragmentation. It is the Microsoft C library that is not properly consolidating the freed blocks",

Due to Steve's Opinion, malloc and free maybe has some problem under such situation, if I didnt make a mistake to understand him.

But after my test with C++, this doesnt happen. So I feel puzzled.

Now I need such data structure to program some complex algorithm,what should I do,do you have any suggestions?

Change to C++ because of no fragmentation? I do hope not.

Frank

 

Lanh,

As you have explained,

"The C++ test ultimately allocates its blocks from a heap managed by an operating system library using HeapAlloc.  The Fortran case supplies the allocation (because it is large) directly from the process' address space using VirtualAlloc.  "

Which is larger? Both are large. The algorithm of Memory_Total_Test is to get the biggest continuous memory, the array of Temp in both C++ and Fortran will be about 2GB at the final successful allocation.

From your opinion, HeapAlloc or  VirtualAlloc maybe have some problem, if so ,why not change to the better one?

 "Your Fortran approach implies that you want both one large contiguous chunk and many, many smaller identical size chunks.  Is that really your use case?  If you are worried about fragmentation preventing the availbility of the large chunk, and easy solution is to get (and retain) that allocation early - before you do the smaller allocations.  That's easy enough to do in standard Fortran."

I don't need one large contiguous chunk.The function of Memory_Total_Test ,as I have said above,is just to test the biggest continuous memory.  

During the process of my code, the dimension of dynamic array is always changing .They are identical just at the beggining. After many cycles of deallocate/allocate,about 300 MB Memory is lost. The fortran case is just a abstract case to show this problem.

Hang a tick - the requested allocations are wildly different between the Fortran examples and the C++ example!

When I adjust the C++ program to have the same characteristics as one of the Fortran programs (2 million allocations of about 600 bytes or so) - I see the same fragmentation behaviour. 

Alternatively, if I change the Fortran code to have the same characteristics as the latter C code (500 small allocations of 3MB each) - I see no fragmentation.

With the search for maximum allocation strategy used (based off John's code) the size of allocation used by new in the C++ code is always bigger than what the underlying heap manager appears to accomodate - it just ends up setting aside memory separate to the "normal" heap memory using VirtualAlloc anyway - so the results from Fortran "directly" calling VirtualAlloc and the C runtime calling HeapAlloc become consistent for the largest contiguous free block search.

I can reproduce the behaviours using language agnostic calls direct to the Win32 api.  When fragmentation occurs it has nothing to do with the Fortran language runtime.  I guess fragmentation is an occasional threat for any dynamic memory system where allocated memory blocks cannot be moved around.  If fragmentation is causing you issues there are some workarounds - one example mentioned previously - in the Win32 API code you can see how obliterating a fragmented heap and then creating a new heap lets you "reset" things.  But implementing this wouldn't be pretty.

VirtualAlloc and HeapAlloc have different roles to play.  VirtualAlloc is not appropriate to use for small allocations - the minimum it can allocate is a page of memory and it requires a kernel call.  HeapAlloc can efficiently dish out smaller allocations and, unless the heap is full and more address space needs to be added to the heap (in which case HeapAlloc calls VirtualAlloc to do that reservation) doesn't necessarily have to call into the kernel.  There is a small overhead associated with HeapAlloc for large allocations that VirtualAlloc doesn't incur, and VirtuaAlloc gives you more control over the access settings, etc for the allocated memory.  But anyway - that's all detail - let the language runtimes worry about what's best.

Attachments: 

Two points I don't understand from my previous post:
1) Why are there these 56 byte gaps between the c(100) allocations ?  Is this an indication of something else being allocated, or just a lot of wastage. I tried another compiler and it allocated 16 byte gaps. The next question is if this full 600 + 56 bytes is being released and then merged with the adjacent memory pool. The symptoms of the problem suggest this is not taking place.
2) The total allcation is over1.2gb. I didn't think the heap was so large, so this "heap" must be a more general definition, rather than the bottom of the stack.

Maybe this is what we have, in regards to the capability of ALLOCATE / DEALLOCATE, although I'm surprised that there can not be more clean-out achieved when removing the Type structure.

John

LanH,
Thank you very much for you help.
I am very sorry I have made a unfair test example between C++ and Fortran.Without you detail
explaination,I will not find out such bad test design case.Steve has also pointed it out,
but I didnt catch it.I should make an apology.

It is really complicated.Is this a bug or a bad design of heap management ? Maybe or Maybe not!
But now I do know some algorithm should be changed under such situation.
Some data stucture should be optimized too.Millions of dynamic arrays with random size
should be forbidden.And cycles of deallocate/allocate should be limited to some level.

Now I begin to worry about the data structure such as list/tree,which have dynamic nodes
with random amount.Once there is a very big list or very huge tree with small data nodes ,cycles of construct and reconstrcuct,
will the available memory become smaller and smaller?If so,it is really horrible.

By the way, will this happen under x64 OS? Do you have any idea?
I have made some test,but it is hard to explain.
Some evidence shows it will not happen under X64.But I am not sure.
 
Thank you !

Frank

 Two dimmension dynamic array has the same problom too. 

 int m,n;
 int row,col;
 int **p;
 /*Allocate*/

Memory_Total_Test(1);
 m =1000000;
 p=(int **)malloc(m * sizeof(int *));
 for (row=0;row<m;row++)
 {
  n = 200;
  p[row]=(int *)malloc(n * sizeof(int));
  for(col=0;col<n;col++)
   p[row][col]=0;
 } 
Memory_Total_Test(2);

/*deallocate*/
 for (row=0;row<m;row++)
 {
  free(p[row]);
 }
 free(p);

Memory_Total_Test(3);

Position =1Memory that could be used(M):1728
Position =2Memory that could be used(M):941
Position =3Memory that could be used(M):941

(This can all vary a bit depending on you runtime library options (static vs DLL, debug vs release, etc).  Because the underlying heap manager is an operating system service - it will also vary with operating system version and perhaps system settings.  As mentioned before, the mere presence of a debugger can also change behaviour.)

The Fortran programmer executes an ALLOCATE statement - as part of that the Fortran runtime [may] call the malloc function in the C runtime.  The C runtime then asks the operating system provided heap manager to allocate some memory using HeapAlloc (with earlier versions of Microsoft C++ the C runtime provided its own heap manager.  I'm pretty sure earlier versions of Windows (pre W2K?) didn't provide a heap manager for general use either.)  The heap manager that sits behind HeapAlloc will have some overhead associated with the allocation (things like the size of the block, pointer to the next block, status of the block, etc.).  You can query that overhead using the HeapWalk API - it appears to be about 24 bytes.  I expect that 24 bytes is probably located immediately before the allocation, though it could be before and after, or part could even be located remotely.

The debug variant of the C runtime then has further overhead located immediately before and after the allocated block.  If you have a professional edition of VS you can actually inspect the source code for this in crtdbg.h, but it is also well documented on msdn.  Effectively the allocation returned by HeapAlloc is filed with a structure in Fortran that looks like:

  TYPE, PUBLIC, BIND(C) :: CrtMemBlockHeader
    TYPE(C_PTR) :: pBlockHeaderNext         ! struct _CrtMemBlockHeader*
    TYPE(C_PTR) :: pBlockHeaderpREV         ! struct _CrtMemBlockHeader*
    TYPE(C_PTR) :: szFileName               ! char *
    INTEGER(C_INT) :: nLine
    ! On x64 the order of the next two members are reversed from the 32 bit 
    ! variant to keep the structure packed and aligned all nicely.  Academic 
    ! for me until the CFO ponies up more cash, but hopefully this 
    ! directive covers it.
!DEC$ IF DEFINED(_M_IA64) .OR. DEFINED(_M_AMD64) .OR. DEFINED(__x86_64__)
    INTEGER(C_INT) :: nBlockUse
    INTEGER(C_SIZE_T) :: nDataSize
!DEC$ ELSE
    INTEGER(C_SIZE_T) :: nDataSize
    INTEGER(C_INT) :: nBlockUse
!DEC$ END IF
    INTEGER(C_LONG) :: lRequest
    CHARACTER(C_CHAR) :: gap(nNoMansLandSize)
    ! Psuedo fields that follow.
    !CHARACTER(C_CHAR) :: data(nDataSize)
    !CHARACTER(C_CHAR) :: anotherGap(nNoMansLandSize)
  END TYPE CrtMemBlockHeader

where the data field is the actual memory that holds the data of the Fortran allocatable variable.  (The allocatable variable will itself have its own descriptor, but that is located elsewhere in the storage for the derived type itself).  The no mans land gaps are used to detect heap overruns.and are four bytes long - so C runtime overhead here on 32 bit systems is 36 bytes.

(36 + 24 is more than your 56, but we're within engineering tolerance...)

Then you might also have a small amount of overhead associated with various runtimes ensuring that the memory is appropriately aligned.

The underlying heaps (there can be more than one - one is created for the process when the process is created, but other heaps are created by various subsystems.  The C runtime creates its own heap separate to the process default heap - and that is typically the fifth heap whenever I've gone process memory spelunking) are grown as required - see the discussion for the HeapAlloc etc.  This growth happens by the heap manager calling VirtualAlloc and asking for another chunk (referred to as a region) of the process' address space - it looks like it does this in a doubling sense (first it might request 4K, then 8K, 16K, etc, up to a maximum of a 16 MB increment).  That region may not be contiguous with the previous regions.  Typically once the heap manager has added a region it doesn't give it back to the operating system until the heap is destroyed (effectively when the program terminates, unless the programmer explicitly destroys the heap).  From experimentation yesterday I see exceptions - "super" regions are requested for each large allocations (more than 16MB) and are effectively managed separately to the regions that make the heap up proper.  There is a small amount of overhead for each region. 

From what I can tell (not categoric), allocations within a region are coalesced when they are released, but blocks in separate regions that happen to be adjacent are not.

Importantly - when you ALLOCATE a  large block of memory (16 MB or more - or something like that) - the Fortran runtime doesn't go via the C runtime - it directly asks the operating system via VirtualAlloc for a chunk of address space.  This request cannot be satisfied by address space that the heap manager has already requested, even if the heap is completely empty.  In extreme cases ALLOCATE might fail to find a single 16MB spot for its data, but could quite happily allocate two hundred or so 10 MB blocks.  The Fortran test cases in this thread show that problem - they make a whole heap of small allocations, which are fulfilled from a heap - so the heap grows, then they free those small allocations (the heap is mostly empty), then they try and find a single large allocation.

If you want to go investigating further:

- install the vmmap application that Microsoft put out through their sysinternals "site".

- I translated bits of crtdbg.h to Fortran when debugging a memory leak associated with polymorphic components a while back.  It provides various routines that can be useful for analysing the debug heap (there are more still that I didn't translate across and I see I've been hacking without testing - hopefully it still works). 

- The AllocateTest routine shows use of HeapWalk and friends - you could use this to build a memory map of heap allocations.  vmmap has similar capability.

To the OP - are you sure this is really a problem for you?  It is one thing to observe symptoms with test cases, it is another to experience those in production code.  If you do run into problems I think it is more due to the simple fact that you are trying to use the majority of your process' address space, and not so much the performance of the heap manager.  A consequence of this is that moving to 64 bit would help massively here.  With dynamic allocation you are always going to have some overhead/loss associated with things like fragmentation. I don't see this as a heap manager bug or as a heap manager design issue - its simply a question of whether your program is making resonable use of the resources of the machine.  The heap manager performs acceptably for the vast majority of Windows applications, and most of those have far more complicated memory allocation and deallocation patterns than your typical Fortran program, but perhaps less total memory demand.

Attachments: 

AttachmentSize
Downloadapplication/octet-stream crtmemoryutilities.f908.66 KB

If this is a pressing problem, you can replace the MS VS CRTL malloc and free with a different thread-safe malloc and free with the required behavior. Just link in before the CRTL.

Jim Dempsey

www.quickthreadprogramming.com

lanH,

Many thanks for your help.

The problem did happen in production code due to the algorithm I have designed.
Millions of dynamic arrays have been used to collect and group some result data
with inserting and sorting operation.

But when I find out the problem is related to the data structure,
I have changed the data structure to avoid this problem with some efficiency loss.

In the area of computational mechanics,with the development of program language,
due to the flexible use of some new features,millions of dynamic arrays do may happen.
In the past,all the computation may be done in a very big array,which is hard to manage and read,
we hope to do some improvement.But since the fragmentation problem, the big array looks a good way.

For it is a commercial code, we should consider the need of Win32 users, so just
updating it to win64 is not a perfect solution.We will change the algorithm to avoid such problem.

Thank you for the great help.

Thanks to Steve.

Thanks to Jhon.

Thanks to Jim.

Frank

Sorry I missed the earlier question about VirtualAlloc - it is used for allocations 16MB and larger, otherwise it calls aligned_malloc.

Steve - Intel Developer Support

Leave a Comment

Please sign in to add a comment. Not a member? Join today