System running out of memory when using asynchronous I/O

System running out of memory when using asynchronous I/O

PaulB's picture

Hi,

I have recently replaced synchronous read and write operations by asynchronous operations (unformatted, sequential access) in a FORTRAN application. The application is faster and output the correct results with small datasets but it crashes with large ones (it has to read and write 20GB temporary files a couple of times).

I suppose that it happens because the system is running out of memory. Indeed, I realized that the data+stack usage (DATA column of the top command) keeps increasing from the launch of the application and reaches 60GB when it crashes (48GB of RAM on the machine). To compare, only 300MB are used when running the application with synchronous calls.

I dont think that there is any logic mistake in the source code as I observe the same behavior when doing a WAIT immediately after an asynchronous READ or WRITE. Calling FLUSH right after the I/O operations has no effect on the memory use either.

I suspect that it may be due to the system buffer cache filling up but I am not sure.

Does anybody have any idea about the issue? Is there any way to limit the amount of buffer cahceused by the asynchronous I/O operations in FORTRAN?

Thanks a lot,

Paul

19 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.
Ronald W Green (Intel)'s picture

Paul,

There are environment variables and compiler options to control buffer - documented in the User Guide.

How big are the individual records (recordsize)? You aren't trying the read/write an entire huge array with one IO operation, are you?

Second thing I'd check: your WAIT hopefully is using ERR, IOMSG, and IOSTAT to make sure everything is a-ok on the wait. also if you are using IDs on the transfer and WAIT, make sure they are scalar integer variables.

PaulB's picture

Thank you for your answer Ronald.

I have tried to use the BUFFERED, BUFFERCOUNT and BLOCKSIZE specifiers in the OPEN statement with different values but the issue with the memory is still there.

I am not writing an entire huge array in one operation. Each record that is written is around 13MB.

I am using ERR and IOSTAT with the WAIT statement but no error in reported. The WAIT operations are successful.

And regarding the ID specifier used with the WAIT statement, it is an integer. To make sure that the WAIT operations were successful, I printed the value of the ID right after each WRITE operation and right before a WAIT. The ID is always 0 which indicates that the IDs are correctly reused and that the WAIT are successful.

The only way I found to reduce the memory usage is to limit the stacksize (limit command) but by doing that the performance of the program with asynchronous I/O is worse than with synchronous I/O.

PaulB's picture

In order to identify the problem I wrote a small test subroutine. When running it the VIRT, SWAP and DATA columns of top keep increasing and reach 2.7g. If I replace the asynchronous write and waitby WRITE(UNIT=30,ERR=999,IOSTAT=IOS) VAL, the memory usage stays very low.

Did I make a mistake in my code?

SUBROUTINE TESTIO()
INTEGER IDTAG,I,VAL,IOS
VAL=55
OPEN(30,FILE='testio.tmp',FORM='UNFORMATTED',ASYNCHRONOUS='YES',ERR=777,IOSTAT=IOS)
DO I=1,100
WRITE(UNIT=30,ASYNCHRONOUS='YES',ID=IDTAG,ERR=999,IOSTAT=IOS) VAL
WAIT(UNIT=30,ID=IDTAG,ERR=888,IOSTAT=IOS)
ENDDO
CLOSE(30,ERR=666,IOSTAT=IOS)
RETURN
666 WRITE(*,*) 'Close Error: ',IOS
STOP
777 WRITE(*,*) 'Open Error: ',IOS
STOP
888 WRITE(*,*) 'Wait Error: ',IOS
STOP
999 WRITE(*,*) 'Write Error: ',IOS
STOP
END

Ronald W Green (Intel)'s picture

cool, a good simple reproducer. One question while I start some tests - what version of the compiler do you use?

ron

PaulB's picture

I use the Intel Fortran 64 Compiler 11.1 (Build 20100806)

Paul

jimdempseyatthecove's picture

Ron,

RE: Paul's problem

I seem to recall a similar problem report (quite a while ago) and I will try to paraphrase the issue.

The WRITE with ASYCHRONOUS would (in the affected version) allocate a temporary object on the stack for each write. These temporary objects would be returned at stack cleanup time on return from the subroutine containing the asychronous write. (not returned at competion of the WRITE statement)

Note, "on return from the subroutine containing the asychronous write".

Therefore, should your asychronous writes be contained within the same subroutien as your control loop, the local stack could be consumed. The work-around was to move the asychronous write to a different subroutine, which returns the ID for later use in the WAIT

Jim Dempsey

www.quickthreadprogramming.com
PaulB's picture

Thank you Jim for your comment. I tried to put the WRITE in a separate subroutine but it does not solve the memory problem.

I manage to limit the memory usage by limiting the stack size but the program then runsslower than with synchronous writes.

jimdempseyatthecove's picture

Paul,

Can you compile your "little" subroutine with WRITE as with assembler listing, then post the listing file here?

Something may show up that isn't obvious.

Jim

www.quickthreadprogramming.com
PaulB's picture

Hi Jim,

I attached the assembler listing. By comparing this listing with the one generated when using synchronous writes, the main difference is the block from line 140to 208 that does not exist in the synchronous version. However, I don't understand the different operations performed in it.

I tried the Valgrind memcheck tool on the test program to detect memory issues and got the following result. It saysthat there is a memory leakon the line where the asynchronous WRITE is performed.

Memory leaks are also reported by the Intel Inspector XE tool:
[Unknown](45476): Error X2003: : Memory leak: Allocation site: Function pthread_cond_wait: Module libpthread.so.0
[Unknown](45476): Error X2004: : Memory leak: Allocation site: Function pthread_cond_wait: Module libpthread.so.0

And Intel Inspector XE also finds deadlocks:

[Unknown](138806): Warning X1: : Lock hierarchy violation: Lock owned: Function for_asynchronous: Module testiopg
[Unknown](139726): Warning X2: : Lock hierarchy violation: Lock owned: Function for_asynchronous: Module testiopg
[Unknown](139726): Warning X3: : Lock hierarchy violation: Allocation site: Function for_asynchronous: Module testiopg
[Unknown](140624): Warning X4: : Lock hierarchy violation: Lock owned: Function for__aio_acquire_lun: Module testiopg
[Unknown](140706): Warning X5: : Lock hierarchy violation: Lock owned: Function for__aio_acquire_lun: Module testiopg
[Unknown](141276): Warning X6: : Lock hierarchy violation: Allocation site: Function for__aio_acquire_lun: Module testiopg
[Unknown](133689): Error X7: : Deadlock: Lock owned: Function for__aio_release_lun: Module testiopg
[Unknown](133689): Error X8: : Deadlock: Lock wanted: Function for__aio_release_lun: Module testiopg
[Unknown](133879): Error X9: : Deadlock: Lock wanted: Function for__aio_release_lun: Module testiopg
[Unknown](134052): Error X10: : Deadlock: Lock wanted: Function for__aio_release_lun: Module testiopg
[Unknown](135946): Error X11: : Deadlock: Lock owned: Function for_waitid: Module testiopg
[Unknown](135946): Error X12: : Deadlock: Lock wanted: Function for_waitid: Module testiopg
[Unknown](136144): Error X13: : Deadlock: Lock wanted: Function for_waitid: Module testiopg
[Unknown](136157): Error X14: : Deadlock: Lock wanted: Function for_waitid: Module testiopg
[Unknown](136529): Error X15: : Deadlock: Lock wanted: Function for_waitid: Module testiopg
[Unknown](138806): Error X16: : Deadlock: Lock owned: Function for_asynchronous: Module testiopg
[Unknown](138806): Error X17: : Deadlock: Lock wanted: Function for_asynchronous: Module testiopg
[Unknown](140841): Error X18: : Deadlock: Allocation site: Function for__aio_acquire_lun: Module testiopg
[Unknown](141276): Error X19: : Deadlock: Allocation site: Function for__aio_acquire_lun: Module testiopg
[Unknown](141371): Error X20: : Deadlock: Lock wanted: Function for__aio_acquire_lun: Module testiopg
[Unknown](138806): Error X21: : Deadlock: Lock owned: Function for_asynchronous: Module testiopg
[Unknown](141276): Error X22: : Deadlock: Allocation site: Function for__aio_acquire_lun: Module testiopg

Paul

Memcheck output:
==28573== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)
==28573== malloc/free: in use at exit: 368,032 bytes in 3,001 blocks.
==28573== malloc/free: 3,020 allocs, 19 frees, 385,756 bytes allocated.
==28573== For counts of detected errors, rerun with: -v
==28573== searching for pointers to 3,001 not-freed blocks.
==28573== checked 2,095,904,472 bytes.
==28573==
==28573== 96,000 bytes in 2,000 blocks are definitely lost in loss record 2 of 3
==28573== at 0x4A04B32: calloc (vg_replace_malloc.c:279)
==28573== by 0x30C9C0B1A8: pthread_cond_wait@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
==28573== by 0x425571: for_asynchronous (in testiopg)
==28573== by 0x402ECE: MAIN__ (testiopg2.f:18)
==28573== by 0x402CE1: main (in testiopg)
==28573==
==28573==
==28573== 272,000 bytes in 1,000 blocks are possibly lost in loss record 3 of 3
==28573== at 0x4A04B32: calloc (vg_replace_malloc.c:279)
==28573== by 0x30C8C0FEF2: _dl_allocate_tls (in /lib64/ld-2.5.so)
==28573== by 0x30C9C06C29: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.5.so)
==28573== by 0x4254D7: for_asynchronous (in testiopg)
==28573== by 0x402ECE: MAIN__ (testiopg2.f:18)
==28573== by 0x402CE1: main (in testiopg)
==28573==
==28573== LEAK SUMMARY:
==28573== definitely lost: 96,000 bytes in 2,000 blocks.
==28573== possibly lost: 272,000 bytes in 1,000 blocks.
==28573== still reachable: 32 bytes in 1 blocks.
==28573== suppressed: 0 bytes in 0 blocks.

Attachments: 

AttachmentSize
Download assembler_listing.11.88 KB
Download testiopg.f826 bytes
jimdempseyatthecove's picture

Paul,

In examining the assembler output, it appears that the optimizer removed the code generated for the

WAIT(30,ID=IDTAG)

On a Win32 x32 Debug build the WAIT generates a

call _for_waitid

But in your assembler listing there is no call to this subroutine. I think the compiler generated broken code.
Without the WAIT you might end up with 1000 in-flight WRITES (which I think exceeds an internal limitation).

Try placing theWAIT in an external subroutine

[fxfortran]      program testiopg     
      INTEGER IDTAG,I
	  INTEGER, PARAMETER :: N=1000
	  REAL, ALLOCATABLE ::  VALUES(:)
	  ALLOCATE(VALUES(N))
	  DO I=1,N
		VALUES(I)=I
	  ENDDO
      OPEN(30,FILE='testio.tmp',FORM='UNFORMATTED',ASYNCHRONOUS='YES')
      DO I=1,1000
		WRITE(UNIT=30,ASYNCHRONOUS='YES',ID=IDTAG) VALUES
        call doWAIT(30,IDTAG)
      ENDDO  
      CLOSE(30)   
	  DEALLOCATE(VALUES)
      end program testiopg
      
CDEC$ ATTRIBUTES NOINLINE :: doWAIT
      subroutine doWAIT(IOU,IOID)
        INTEGER IOU,IOID
        WAIT(IOU,ID=IDTAG)
      end subroutine doWAIT[/fxfortran]

If that doesn't work,make doWAIT an external subroutine and turn off InterProcedural Optimizations (IPP), at least for that file.

Jim Dempsey

www.quickthreadprogramming.com
Ronald W Green (Intel)'s picture

Paul,

This is odd - I am using the same compiler, 11.1.073 build 20100806, on Intel 64 for 64 bit applications. My command line:

ifort -O0 -S testiopg.f

but my assembly file is much bigger and has the call to for_waitid:

-rw-r--r-- 1 rwgreen support 12162 Feb 24 13:50 testiopg_bad.s
-rw-rw-r-- 1 rwgreen support 17294 Feb 24 13:51 testiopg.s

$ grep wait *.s
testiopg.s: call for_waitid #18.9

Let's get some more information from you. What OS are you using, and version? I've tested on Ubuntu 10.10 and on RHEL 5.2.

Try this command and send the output:

ifort -O0 -dryrun

my output is attached

ron

Attachments: 

AttachmentSize
Download dryrun_ron.txt2.4 KB
PaulB's picture

Hi,

I tried to generate assembler listings again and this time the call for_waitid appears. I may not have attached the correct listing in my previous post, sorry for that.

Anyway, I have managed to have the test program attached (testiopg.f90) working by using ifort 10.1 instead of ifort 11.1. I have attached the listing and dryrun output generated with both versions. The test program was compiled with the command ifort o testiopg testiopg.f90.

While the test program compiled with ifort 10.1 runs correctly with VIRT and SWAP being stable at 9.9m, the same program compiled with ifort 11.1 uses more and more memory (VIRT and SWAP increase to around 60g) and crashes with the error forrtl: severe(12): message not found on the line where WRITE is performed.

There is also another thing. When I compile the test program with ifort 10.1 but by adding the threads flag, the same behavior than with ifort 11.1 occurs: It uses more and more memory and crashes. I attached the listing and dryrun output for that version too (ifort10_threads).

Any idea about what is going on? I saw in the ifort documentation that threads, openmp or reentrancy threads must be specified on Linux to use asynchronous IO. Does the program use synchronous IO instead of asynchronous IO when one of these flags is not specified?

Paul

Attachments: 

jimdempseyatthecove's picture

Paul,

Using IVF 11.0.066 [IA-32], Windows XP x32,memory footprint remains flat at around 1GB.
and
Using IVF 11.0.066 [Intel 64], Windows XP x64,memory footprint remains flat at around 2.2GB. (other stuff loaded/running).

Something may have happened between 11.0.066 and 11.1, or a difference in the runtime libraries (me Windows XP, you Linux??)

Jim Dempsey

www.quickthreadprogramming.com
PaulB's picture

Thank you very much Jim for trying to help me!

I finally found outthat the root of the problem is the libpthread library. Apparently I am not the only one to observe memory leaks with that library on Linux.

When I execute pmap (memory map of a process)while running my program I get the following:
...

00002aaabdbfd000 2048K rwx-- [ anon ]
00002aaabdcfd000 4K ----- [ anon ]
00002aaabdcfe000 2048K rwx-- [ anon ]
00002aaabddfe000 4K ----- [ anon ]
00002aaabddff000 2048K rwx-- [ anon ]
00002aaabdeff000 4K ----- [ anon ]
00002aaabdf00000 2048K rwx-- [ anon ]
00002aaabe000000 4K ----- [ anon ]
00002aaabe001000 2048K rwx-- [ anon ]
00002aaabe101000 4K ----- [ anon ]
00002aaabe102000 2048K rwx-- [ anon ]
00002aaabe202000 4K ----- [ anon ]
(hundreds more lines like that)
...

I suspect that each 2048k block corresponds to a thread stack space. It seems that the libpthread library is creating a new thread for each asynchronous I/O operation but that it never releases the stack space of those threads.

The test program runs correctly if I link the libpthread library statically. However, the program on which I am working also uses the Intel MKL library and I get errors when trying to compile it with libpthread linked statically:

/usr/lib64/libpthread.a(ptw-nanosleep.o)(.text+0x1a): In function `nanosleep':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(ptw-nanosleep.o)(.text+0x68): In function `nanosleep':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(pthread.o)(.text+0x7f): In function `__pthread_initialize_minimal':
: undefined reference to `__libc_setup_tls'/usr/lib64/libpthread.a(pthread.o)(.text+0x86): In function `__pthread_initialize_minimal':
: undefined reference to `_dl_cpuclock_offset'/usr/lib64/libpthread.a(pthread.o)(.text+0xc0): In function `__pthread_initialize_minimal':
: undefined reference to `_errno'/usr/lib64/libpthread.a(pthread.o)(.text+0xcb): In function `__pthread_initialize_minimal':
: undefined reference to `_h_errno'/usr/lib64/libpthread.a(pthread.o)(.text+0x289): In function `pthread_initialize':
: undefined reference to `_res'/usr/lib64/libpthread.a(pthread.o)(.text+0x3cc): In function `pthread_initialize':
: undefined reference to `_dl_init_static_tls'
/usr/lib64/libpthread.a(pthread.o)(.text+0x947): In function `__pthread_reset_main_thread':
: undefined reference to `_errno'
/usr/lib64/libpthread.a(pthread.o)(.text+0x955): In function `__pthread_reset_ma in_thread':
: undefined reference to `_h_errno'
/usr/lib64/libpthread.a(pthread.o)(.text+0x963): In function `__pthread_reset_ma in_thread':
: undefined reference to `_res'
/usr/lib64/libpthread.a(ptw-read.o)(.text+0x1a): In function `read':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(ptw-read.o)(.text+0x72): In function `read':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(ptw-close.o)(.text+0x1a): In function `close':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(ptw-close.o)(.text+0x5e): In function `close':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(ptw-open.o)(.text+0x1a): In function `open':
: undefined reference to `__syscall_error'
/usr/lib64/libpthread.a(ptw-open.o)(.text+0x72): more undefined references to `__syscall_error' follow

Does anybody know if the MKL library can be used when linking libpthread statically? Is there any alternative to that threading library to do asynchronous I/O?

If I cannot manage to have asynchronous I/O working, I thought about using openmp to create separate threads and performing synchronous I/O in these threads. Any experience with that approach?

Paul

jimdempseyatthecove's picture

Paul,

When you use OpenMP to start a worker thread (e.g. for your I/O) .AND. if your app or a library (MKL) uses OpenMP, then you will need to enable the OpenMP nested parallel regions feature (many systems default to disabled nested regions). An environment variable can be used, or a runtime library call can be made. You could also use a pthread thread.

Because your I/O thread will likely require a synchronization method, e.g. wait for data to write, and/or block read till buffer empty, you might find itbetter to use a pthread thread.

In your mocked up code you were not performing work between the WRITE and WAIT. What happens between the WRITE and WAIT on your production code?

If you are doing nothing, then consider using synchronous I/O and consider increasing the I/O buffer size.

If you have code, regardless of using an OpenMP thread, pthread thread, or asynchronous thread, your code must be made aware that the data list on the write is a list of data pending to be written. And this data should not be modified until after the WAIT completes (completion of I/O).

Jim Dempsey

www.quickthreadprogramming.com
PaulB's picture

Jim,

I just implemented synchronous I/O in separate threads with OpenMPand I am very happy of the result. The performance gain is the same than when using asynchronous I/O and there is no issue with the amount of memory used.

The program contains an iteration loop that sequentially reads data from a couple oftemporary files, performs a complex computation and finally writes the results of the computationto a temporary file. The amount of data read and written from files being pretty large (up to 50MB per read/write), the program can really benefit from asynchronous I/O. My implementation uses one OpenMPsection for the computation and one for each of the read/write operations.Using a circular buffer, Iread the input of iteration i+1 and write the output of iteration i-1 while computing iteration i.

The computation stage contains FFT calculations done with the MKL library. I currently use the sequential version of that library (the FFTs are too small to benefit from the multithreadedlibrary)so I don't have nested parallel regions. However,I'll be careful if I decide to go with the multithreaded version one day. Thanks for the tip and for your help.

Paul

jimdempseyatthecove's picture

Paul,

You will likely find the two thread synchronous I/O just as efficient as asynchronous I/O, and it may actually end up being more efficient. A circular buffer (or a circular n-buffers) is a good way to handle it. If you feel adventuresome, add additional computational threads. This should be relatively easy for you now that you have the framework layed out. Currently you likely have a flag or indicator to signal how far the read has completed, and possibly a second flag (or means) to indicate when the computation is finished with the buffer. Enlarging your circular buffer by a factor of n, and compute threads by a factor of n should be relatively trivial (use arrays of flags). Dedicate each compute thread to each portion of the circular buffer. This coding technique is called a parallel pipeline. Give it a try, you may be pleased by the results.

RE: the FFTs are too small to benefit from the multithreadedlibrary

If you use the multi-threaded MKL library and set MKL up to use 1 thread, then you will have a thread safe version of MKL. Read the MKL docs, as the single threaded library may be thread safe. With a thread safe MKL library you can process multiple FFT's concurrently. That may be your next optimization project.

Jim Dempsey

www.quickthreadprogramming.com
Ronald W Green (Intel)'s picture

Paul,

My oversight - you were right a few posts ago - you DO need to use -threads or -openmp when using asynchronous IO. I should have caught that.

If we link with -shared-intel, we see that

ifort -o iotest iotest.f90 -shared-intel
ldd iotest
...
libifcore.so.5 => /opt/spdtools/compiler/cpro/Compiler/11.1/073/lib/intel64/libifcore.so.5

is linked against the single-threaded NON-REENTRANT Fortran runtime library. For asynchronous IO, this is unsafe (as evidenced by the memory leak) because each IO operation spawns a thread to do the IO in background.

Now if we compile with -threads:
ldd iotest
libifcoremt.so.5 => /opt/spdtools/compiler/cpro/Compiler/11.1/073/lib/intel64/libifcoremt.so.5

we bring in the multithreaded, thread-safe Fortran runtime. Also, -openmp also selects this runtime.

I'm a little surprised that ASYNCHRONOUS would actually be allowed without -threads OR, as you conjectured, perhaps the runtime should detect that it's not the multithreaded runtime and do the IO operations synchronously. I'll check into this.

ron

Login to leave a comment.