segfault (ulimit -s unlimited)

segfault (ulimit -s unlimited)

Hello,

I have run into the common segfault error message (again), but the best
known remedies seem not to work.

System info: Intel Q9650, 8GB ram, OpenSuse 11.1, ifort 11.0
ulimit set to unlimited (also tried a large number.)

The code is a fortran 3D fluid simulation code multithreaded - the problem
remains with one or four threads. The memory required to run the code is
below the system capacity. The program fails to load:

~/duct3d >./duct3d
Segmentation fault

Compiler options

ifort -c -O3 -openmp -fpp -parallel

tried also

ifort -c -O3 -openmp -fpp -parallel -heap-arrays

Any ideas what may be happening?
--

20 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

Quoting - arktos
Hello,

I have run into the common segfault error message (again), but the best
known remedies seem not to work.

System info: Intel Q9650, 8GB ram, OpenSuse 11.1, ifort 11.0
ulimit set to unlimited (also tried a large number.)

The code is a fortran 3D fluid simulation code multithreaded - the problem
remains with one or four threads. The memory required to run the code is
below the system capacity. The program fails to load:

~/duct3d >./duct3d
Segmentation fault

Compiler options

ifort -c -O3 -openmp -fpp -parallel

tried also

ifort -c -O3 -openmp -fpp -parallel -heap-arrays

Any ideas what may be happening?
--

If you haven't already done this, add a simple print statement as the first statement in the program:
write(*,*) "started OK"

Assuming you never get to this, that the program faults immediately as it tries to load, you probably have too much data allocated in static arrays - COMMON for instance. To see this:

size ./duct3D

look for the size of bss, this is where your static arrays are allocated. Does this fit in your physical memory?

Typically when execs die immediately it is because BSS will not fit in memory.

ron

Thanks Ron.

The code does not execute its first (write) statement.

The mystery remains:

This version (requiring more memory) segfaults:

~/duct3d >size duct3d
text data bss dec hex filename
8202967 50328 763420256 771673551 2dfecdcf duct3d

This one loads and runs:

~/duct3d >size duct3d
text data bss dec hex filename
8202879 50328 763420256 771673463 2dfecd77 duct3d

The bss seems the same in both - I have not increased the common dimensions yet.

Quoting - arktos

Thanks Ron.

The code does not execute its first (write) statement.

The mystery remains:

This version (requiring more memory) segfaults:

~/duct3d >size duct3d
text data bss dec hex filename
8202967 50328 763420256 771673551 2dfecdcf duct3d

This one loads and runs:

~/duct3d >size duct3d
text data bss dec hex filename
8202879 50328 763420256 771673463 2dfecd77 duct3d

The bss seems the same in both - I have not increased the common dimensions yet.

What is the difference in the 2 duct3d binaries? Same compiler option?

And what version of 11.0 are you using? I would hope 11.0.081 or .083.

The binaries are built on the same server on which you run them?

Since OpenSUSE 11.1 is not officially supported, I do not have a comparable system to run testing. Hopefully the answers you provide above will help.

ron

Quoting - Ronald Green (Intel)

What is the difference in the 2 duct3d binaries? Same compiler option?

And what version of 11.0 are you using? I would hope 11.0.081 or .083.

The binaries are built on the same server on which you run them?

Since OpenSUSE 11.1 is not officially supported, I do not have a comparable system to run testing. Hopefully the answers you provide above will help.

ron

The difference is that the first which fails uses 50% more memory.
I am using the compiler version 11.0.074
UPDATE: NOW INSTALLED 11.0.083. Problems persist.
Binaries created on the machine which is to run them.
I may have to change linux distro. A bit of a nuisance.

I have carried out the following test: The small progam (see below)
uses about the same memory as the simulation code.

ifort -O3 main.f

main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
/tmp/iforti9wySb.o: In function `MAIN__':
main.f:(.text+0x1b6): relocation truncated to fit: R_X86_64_32S against `.bss'
main.f:(.text+0x1e3): relocation truncated to fit: R_X86_64_32S against `.bss'
/opt/intel/Compiler/11.0/074/lib/intel64/libifcore.a(for_init.o): In function `for_get_fpe_counts_':
for_init.c:(.text+0x2a): relocation truncated to fit: R_X86_64_PC32 against symbol `for__l_undcnt' defined in .bss section in /opt/intel/Compiler/11.0/074/lib/intel64/libifcore.a(for_init.o)
...

The test program main.f

      Program TEST

      Parameter( NX=771, NY=11, NZ=11, NLY=24, NLZ=24 )

      IMPLICIT REAL*8 (A-H,O-Z)

      Real*8 U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ),
     .       W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)

      Real*8 H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ),
     .       H3(NX*NY*NZ*NLY*NLZ)


       write(6,*) ' -- STARTS:'
       pi = 4.0D0*DATAN(1.0D0)

       Do I = 1, NX*NY*NZ*NLY*NLZ
          U(I) = 0.434*DSIN( 0.001*I*PI)
          V(I) = 0.003*DCOS( 0.001*I*PI)
          W(I) = 0.002*U(I)**2 - DSQRT( DABS(V(I)) )
       End do

       Do I = 1, NX*NY*NZ*NLY*NLZ
          H1(I) = 0.343* U(I) * ( V(I) + 0.001 )
          H2(I) =  W(I) * DLOG( DABS( U(I)  ) )
          H3(I) =  W(I) * DLOG( DABS( H2(I)  ) )
       End do

       Summ =  0.0D0
       Do I = 1, NX*NY*NZ*NLY*NLZ
         summ = summ + 0.003*DSQRT( DABS( H1(i) + H2(I) + H3(I) ) )
       end do
       write(6,*) ' Summ = ', summ

       Stop
       End

Quoting - arktos

The difference is that the first which fails uses 50% more memory.
I am using the compiler version 11.0.074
UPDATE: NOW INSTALLED 11.0.083. Problems persist.
Binaries created on the machine which is to run them.
I may have to change linux distro. A bit of a nuisance.

I have carried out the following test: The small progam (see below)
uses about the same memory as the simulation code.

ifort -O3 main.f

main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
main.f(17): (col. 8) remark: LOOP WAS VECTORIZED.
/tmp/iforti9wySb.o: In function `MAIN__':
main.f:(.text+0x1b6): relocation truncated to fit: R_X86_64_32S against `.bss'
main.f:(.text+0x1e3): relocation truncated to fit: R_X86_64_32S against `.bss'
/opt/intel/Compiler/11.0/074/lib/intel64/libifcore.a(for_init.o): In function `for_get_fpe_counts_':
for_init.c:(.text+0x2a): relocation truncated to fit: R_X86_64_PC32 against symbol `for__l_undcnt' defined in .bss section in /opt/intel/Compiler/11.0/074/lib/intel64/libifcore.a(for_init.o)
...

The test program main.f

      Program TEST

      Parameter( NX=771, NY=11, NZ=11, NLY=24, NLZ=24 )

      IMPLICIT REAL*8 (A-H,O-Z)

      Real*8 U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ),
     .       W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)

      Real*8 H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ),
     .       H3(NX*NY*NZ*NLY*NLZ)


       write(6,*) ' -- STARTS:'
       pi = 4.0D0*DATAN(1.0D0)

       Do I = 1, NX*NY*NZ*NLY*NLZ
          U(I) = 0.434*DSIN( 0.001*I*PI)
          V(I) = 0.003*DCOS( 0.001*I*PI)
          W(I) = 0.002*U(I)**2 - DSQRT( DABS(V(I)) )
       End do

       Do I = 1, NX*NY*NZ*NLY*NLZ
          H1(I) = 0.343* U(I) * ( V(I) + 0.001 )
          H2(I) =  W(I) * DLOG( DABS( U(I)  ) )
          H3(I) =  W(I) * DLOG( DABS( H2(I)  ) )
       End do

       Summ =  0.0D0
       Do I = 1, NX*NY*NZ*NLY*NLZ
         summ = summ + 0.003*DSQRT( DABS( H1(i) + H2(I) + H3(I) ) )
       end do
       write(6,*) ' Summ = ', summ

       Stop
       End

I am not sure why your previous 'size' command on the binaries did not reveal a change. When I change NX, NY, NZ, NLY, NLZ I see dramatic changes in my BSS size.

Now, you probably will note that your memory requirements are:

NX*NY*NZ*NLY*NLZ * 8bytes per element * 7 arrays or 3,009,194,496 bytes. Note that this is greater than 2GB and hence will require 64bit data pointers. Thus you will need to compile with:

-mcmodel medium -i-dynamic

the 'relocation error' above was trying to warn you that your BSS requirements exceed 2GB (which is what a 32bit signed integer will give for a positive offset).

in order to get this to compile and to set up the binary to run properly using the correct 64bit pointers. See the -mcmodel compiler option for more details.

ron

Quoting - Ronald Green (Intel)

I am not sure why your previous 'size' command on the binaries did not reveal a change. When I change NX, NY, NZ, NLY, NLZ I see dramatic changes in my BSS size.

Now, you probably will note that your memory requirements are:

NX*NY*NZ*NLY*NLZ * 8bytes per element * 7 arrays or 3,009,194,496 bytes. Note that this is greater than 2GB and hence will require 64bit data pointers. Thus you will need to compile with:

-mcmodel medium -i-dynamic

the 'relocation error' above was trying to warn you that your BSS requirements exceed 2GB (which is what a 32bit signed integer will give for a positive offset).

in order to get this to compile and to set up the binary to run properly using the correct 64bit pointers. See the -mcmodel compiler option for more details.

ron

With

ifort -O3 -mcmodel=medium -i-dynamic

everything seems ok. Both the test code i posted and the proper simulation code
with large memory requirement load and run. No problem thus far.

But, with

ifort -c -O3 -openmp -fpp -parallel -mcmodel=medium -i-dynamic (-heap-arrays)

still segfaults.

Would it be likely that things will be better with Ubuntu 9.04?

Quoting - arktos

With

ifort -O3 -mcmodel=medium -i-dynamic

everything seems ok. Both the test code i posted and the proper simulation code
with large memory requirement load and run. No problem thus far.

But, with

ifort -c -O3 -openmp -fpp -parallel -mcmodel=medium -i-dynamic (-heap-arrays)

still segfaults.

Would it be likely that things will be better with Ubuntu 9.04?

No, the version of Ubuntu will not change the behavior. I see the same on 9.04.
I am investigating.

The issue here is that -openmp option also sets -automatic. -automatic will cause the allocation of your large arrays on stack. Now IN THEORY you have unlimited stack, however most linux distros will have some fixed limit on the amount of stack that a process can allocate - some distros this is around 1GB. I'll have to research this a bit for Ubuntu and see what it's internal limit is for stack. It may be possible to create a custom kernel. However, if this code is used outside of your own private server your users will undoubtably have issues.

The simple example you have shown does not have openmp directives, so I can only assume that perhaps your real code does include openmp directives.

For the example you have shown, one can do this workaround:

Real*8, save :: U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ),
. W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)

Real*8, save :: H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ),
. H3(NX*NY*NZ*NLY*NLZ)

and then you can use -openmp. The 'save' attribute will override the implicit -automatic inherent in -openmp.

With these large arrays you may still have trouble in OMP parallel regions if you try to make them all PRIVATE in the same region. Again, the stack requirements may exceed what your OS is willing to provide. You may have to work on subsets of the data within the parallel regions.

ron

Quoting - Ronald Green (Intel)
The issue here is that -openmp option also sets -automatic. -automatic will cause the allocation of your large arrays on stack. Now IN THEORY you have unlimited stack, however most linux distros will have some fixed limit on the amount of stack that a process can allocate - some distros this is around 1GB. I'll have to research this a bit for Ubuntu and see what it's internal limit is for stack. It may be possible to create a custom kernel. However, if this code is used outside of your own private server your users will undoubtably have issues.

The simple example you have shown does not have openmp directives, so I can only assume that perhaps your real code does include openmp directives.

For the example you have shown, one can do this workaround:

Real*8, save :: U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ),
. W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)

Real*8, save :: H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ),
. H3(NX*NY*NZ*NLY*NLZ)

and then you can use -openmp. The 'save' attribute will override the implicit -automatic inherent in -openmp.

With these large arrays you may still have trouble in OMP parallel regions if you try to make them all PRIVATE in the same region. Again, the stack requirements may exceed what your OS is willing to provide. You may have to work on subsets of the data within the parallel regions.

ron

The actual program is a "legacy" fortran 77 code that I wrote about 15 years ago for
a cray ymp.

It passes the big field arrays as arguments and are always shared in the OpenMP sections.

The main calling program looks like this:

      Program SEDUCT
c     --------------

c        Spectral Element DUCT Code
c        --------------------------


      Parameter( NX=771, NY=11, NZ=11, NLY=24, NLZ=24 )

      IMPLICIT REAL*8 (A-H,O-Z)
      include 'blkio.h'
      include 'blkdata.h'
      include 'blkwork.h'
      include 'blkblkbc.h'
      include 'blktable1.h'
      include 'blktable2.h'
      include 'blktable3.h'
      include 'blkstring.h'
      include 'blkpressure.h'
      include 'blkfft.h'
      include 'blkstats.h'
c
      Real*8 U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ),
     .       W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)
      Real*8 H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ),
     .       H3(NX*NY*NZ*NLY*NLZ)


       write(6,*)'  **  STARTED '

c -  IO channel numbers
c
       NDIN  = 5
       NDOUT = 6
       NDUMPER  = 7
       NRESTART = 8
       NSTOP = 10

       Nyvmin = 1
       Nyvmax = 2

       Nzvmin = 2
       Nzvmax = 2


       Undefined = -9.9D+30
       Pi = 4.0D0*DATAN(1.0D0)
       TwoPi = 2.0D0*Pi


      Call Initarrays(Nx,NY*NLY,NZ*NLZ,U,V,W,P,H1,H2,H3)

      Call Setup(Nx,Ny,Nz,NLY,NLZ,U,V,W,P,H1,H2,H3)

      IF( INITFIELD .eq. 0 ) then
          Call RESTART(Nx,Ny,Nz,NLY,NLZ,NM,ND,U,V,W,H1,H2,H3)
    ELSE IF( INITFIELD .eq. 1 ) then
        STOP
      ELSE IF( INITFIELD .eq. 2 ) then
        STOP
      ELSE IF( INITFIELD .eq. 3 ) then
         Call Init(Nx,Ny,Nz,NLY,NLZ,U,V,W,P)
      ELSE
      write(ndout,'(a,i4)')' *** Error : unknown start option. Abort.'
      STOP
      END IF

      Call Go(Nx,Ny,Nz,NLY,NLZ,U,V,W,P,H1,H2,H3)

      Call Finish

      Stop
      End

Now we are looking at another problem - your subroutine calls in which you pass those big arrays may be creating temporary copies of the arguments. This is because you are using old F77 passing syntax. Ideally, in F90+ you should have INTERFACE blocks for those external subroutines or put those arrays in a MODULE and USE the module. This will avoid array temporaries.

Try this: in a version of the code in which your arrays are smaller and which runs - compile with this option and run:

-check arg_temp_created

I do not see how your subroutines declare the dummy arguments so it's uncertain whether or not you are getting array temporaries.

But from a convenience point of view, why not:

module mydata
Parameter( NX=771, NY=11, NZ=11, NLY=24, NLZ=24 )
real(kind=8), save :: U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ), &
W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)
real(kind=8), save :: H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ), &
H3(NX*NY*NZ*NLY*NLZ)
end module mydata

Then your program and subroutines just need to have

USE mydata

Then your calls and subroutines are much more elegant

call initarrays( )

...
subroutine initarrays( )
use mydata

... rest of code
end

This will do several things: you avoid array temps for the arguments -> less pressure on stack, greatly improves performance (no copying of data on sub entry and exit). Simplifies the code.

ron

Quoting - Ronald Green (Intel)
Now we are looking at another problem - your subroutine calls in which you pass those big arrays may be creating temporary copies of the arguments. This is because you are using old F77 passing syntax. Ideally, in F90+ you should have INTERFACE blocks for those external subroutines or put those arrays in a MODULE and USE the module. This will avoid array temporaries.

Try this: in a version of the code in which your arrays are smaller and which runs - compile with this option and run:

-check arg_temp_created

I do not see how your subroutines declare the dummy arguments so it's uncertain whether or not you are getting array temporaries.

But from a convenience point of view, why not:

module mydata
Parameter( NX=771, NY=11, NZ=11, NLY=24, NLZ=24 )
real(kind=8), save :: U(NX*NY*NZ*NLY*NLZ), V(NX*NY*NZ*NLY*NLZ), &
W(NX*NY*NZ*NLY*NLZ), P(NX*NY*NZ*NLY*NLZ)
real(kind=8), save :: H1(NX*NY*NZ*NLY*NLZ), H2(NX*NY*NZ*NLY*NLZ), &
H3(NX*NY*NZ*NLY*NLZ)
end module mydata

Then your program and subroutines just need to have

USE mydata

Then your calls and subroutines are much more elegant

call initarrays( )

...
subroutine initarrays( )
use mydata

... rest of code
end

This will do several things: you avoid array temps for the arguments -> less pressure on stack, greatly improves performance (no copying of data on sub entry and exit). Simplifies the code.

ron


Thanks Ron.

The -check arg_temp_created did not thow up any messages during execution. Is this
test a foolproof way of checking for the creation of temp arrays?

The module suggestion is elegant but it may be somewhat complicated to apply.
The arrays concerned are "reshaped" when passed on to some subroutines.

Quoting - arktos


Thanks Ron.

The -check arg_temp_created did not thow up any messages during execution. Is this
test a foolproof way of checking for the creation of temp arrays?

The module suggestion is elegant but it may be somewhat complicated to apply.
The arrays concerned are "reshaped" when passed on to some subroutines.

-check arg_temp_created only applies to array temps created for arguments to subroutines. Yes, this means that you are avoiding temps on the subroutine calls.

module solution - understood, yes there is probably too much code to rework, as you describe it. Then I think just adding the 'save' attribute to the array declarations should do the trick. If not, would it be possible to tar up the code and attach it to this issue?

Ron,

Your suggestion seems to have been in the right direction.

I have created and used a module into which I have crammed most of the
common blocks and 3 (out of the 7) big field arrays. This has allowed me
to increase the memory by 30% - to a total of ~3.3GB. Attempts to increase
memory use farther have led to segfaulting - as b4.

I must now find a way of putting the remaining 4 arrays into the module to see
if I can use more of the 7.7GB memory available on my system.

I will be back as soon as I have more on this.
--

A few other suggestions which may help:

the Fortran runtime library will need some memory for buffer space. If you are reading and writing these large arrays and you do this with very large record sizes, this will require huge buffers. You may try breaking up the reads/writes into smaller record sizes (hence small buffer requirements)

You may wish to explore the -heap-arrays option, although for OpenMP code I hesitate to recommend this. The reason is that if you call procedures from within parallel regions, and -heap-arrays is used, then local variables within those procedures will no longer be thread-private (stack allocated) as one would normally expect. If your parallel regions are self-contained loop nests without procedure calls and you explicitly declare all loop data with PRIVATE and SHARED you should be OK to use -heap-arrays. So do try -heap-arrays with the non-openmp and non-parallel test of the code. And try again with -openmp and compare the results to make sure no anomalies were introduced by changing allocation schemes.

Using options -g -traceback may help identify where the segfault occurs. This is not foolproof, as allocations call MALLOC which then transitions to a kernel call. If the seg fault occurs in kernel, the unwinding of the stack is sometimes impossible. We're looking a improvements in stack unwinding for future products.


Ron,

I have now put all arrays into the module. I have been able to go up to 5.3GB.

When I increase memory requirements father (but within the sys limits) I cannot
run but now the message has changed:

~/d3d_p/test >./a.out
Killed

The first statement - a write - is not executed.
It seems now that the system kills it for some other reason. No idea what that
might be.

You have again guessed right. I use huge record sizes like this:

WRITE/READ(IO,Err=100,END=200)
(((((U(I,J,K,JL,KL),I=1,N1OLD),J=1,M2),K=1,M3),JL=1,N2),KL=1,N3)

I will to break it up. But are these buffers created when the read/write call is made
or the compiler anticipates its use and creates the buffers right off the beginning?

Incidentally, the use of modules seems to have increased the ratio of the
user-cpu/elapse-time from 3.4 to 3.65 - this is for four threads. Good thing.
--

Quoting - arktos

Ron,

I have now put all arrays into the module. I have been able to go up to 5.3GB.

When I increase memory requirements father (but within the sys limits) I cannot
run but now the message has changed:

~/d3d_p/test >./a.out
Killed

The first statement - a write - is not executed.
It seems now that the system kills it for some other reason. No idea what that
might be.

You have again guessed right. I use huge record sizes like this:

WRITE/READ(IO,Err=100,END=200)
(((((U(I,J,K,JL,KL),I=1,N1OLD),J=1,M2),K=1,M3),JL=1,N2),KL=1,N3)

I will to break it up. But are these buffers created when the read/write call is made
or the compiler anticipates its use and creates the buffers right off the beginning?

Incidentally, the use of modules seems to have increased the ratio of the
user-cpu/elapse-time from 3.4 to 3.65 - this is for four threads. Good thing.
--

Now we are getting into an interesting topic with this write statement. I had to think about this one a bit.

The way you have this coded, with the implied do loops: Since the implied do loop bounds are not constants (are they?) the compiler has to assume that you are potentially writing non-contiguous data (are you writing/reading non-contiguous blocks?). In cases like this, the runtime will create an array temporary to stage the data into contiguous memory before initiating the IO from this array temporary. Think of it as a 'gather' operation, if you are familiar with that terminology.

If these arrays were not so large, you could do:

write(IO,err=100,end=200) U

HOWEVER, many OSes limit records to 2GB or less. Since you are scaling out, this may be a constraint for you. However, if you don't anticipate reaching 2GB per array this would be a fine solution. In the syntax above, the compiler knows that the memory is contiguous, hence no staging buffer.

OR if you can think about write/reading in contigous sections of the array. Something like this, say:

do i=1,NLZ

write(IO,err=100,end=200) U( :, :, :, :, i )

end do

Here the F90 array syntax makes it clear that we're writing contiguous sections of the array. Note that I used an explicit do loop which will guarantee NLZ records are written. If you used an implied do loop we'll get back to the large record situation.

I hope this makes a little sense. The runtime buffering and use of temporaries does vary from vendor to vendor, but I believe the syntax above will give every compiler the best opportunity of avoiding array temporaries.

ron

My original question has been answered (and more), but before flagging the thread as answered,
I would like to know if there is a way of finding out a how the total memory required by an executable
arises i.e. some sort of list where the various arrays are printed with their respective sizes. The llinux/unix
command size is not very informative. I find that sometimes the difference of the result from the size
command and my tally of the various arrays can differ by 1GB.

And a related question: Are allocatable arrays accounted for in the results of the size command? And,
are the multiple copies of the private arrays used in the threads accounted for in it too?

Thanks
--

Quoting - arktos

My original question has been answered (and more), but before flagging the thread as answered,
I would like to know if there is a way of finding out a how the total memory required by an executable
arises i.e. some sort of list where the various arrays are printed with their respective sizes. The llinux/unix
command size is not very informative. I find that sometimes the difference of the result from the size
command and my tally of the various arrays can differ by 1GB.

And a related question: Are allocatable arrays accounted for in the results of the size command? And,
are the multiple copies of the private arrays used in the threads accounted for in it too?

Thanks
--

Unfortunately, there is no way to account for the dynamic data prior to runtime. And during runtime, the linux tools such as 'top' and 'vmstat' are your best bet. But keep in mind that you will see memory usage stay high for a while after your process exits, as linux is slow to show those pages freed. I'll usually keep a second monitoring window open with 'vmstat 2' - watching the colums for 'free' and look for any activity under 'swap' along with the cpu columns.

I do have a recommendation however. A tool named 'ganglia' is quite good for monitoring clusters. I am not sure if it is applicable to a single server but I do not see why it would not work. This tools can keep records of your servers memory usage. I've used it to monitor my jobs (and those of others) to watch the memory high-water marks and to see if/when paging occurred. It's quite a good tool, and it can save historical data so you can go back and get reports on older runs as well. It's all html and graphical, quite slick.

ron

Many, many thanks Ron.

Leave a Comment

Please sign in to add a comment. Not a member? Join today