I wrote a fairly straight forward test program to begin examining the benefits of performing asynchronous IO while post processing CFD flow fields, or even while performing the actual (DNS) CFD computations. This test program times some IO operations because I wanted to explore 2 potential techniques of parallelizing IO with respect to computations. The data is separated into different volumes for each time realization, then each volume is separated into different wall-normal (or k) planes each of which is stored in its own unformatted file. So my two strategies were to either read in an entire volume ahead of the one I am currently processing, or to read in a single k plane in advance of the data which is currently being processed. I have attached all the relavent files needed to compile the test program (3 files). If you like I can even send you the data (maybe > 1GB though...) if you want to test it. The problem is that the serial/non-parallel case works fine, and each of the two different parallel strategies works fine as long as I don't read in any more than 13 files (k planes). If i read 14 or more (106 in total per volume) the code simply hangs. No error messages, no crashes. If I use strace -p to look at the code while it is executing I see no activity (but I see plenty when I look during the serial code execution without asynchronous IO). I have tried this on two different machines (although the data is in the same place, mounted with NFS I think) and the same thing happens. I have not yet opened it in a debugger since totalview doesn't forward x windows to my home machine properly, but plan to do this when I get back to work on Monday. Below is some perhaps optional motivational info for the two strategies, and then some contextual info about my background, and perhaps finaly some background on each of the 3 files I have attached.
The first strategy requires reading multiple unformatted files asynchronously and concurrently. In the past I have encountered stupidly parallel problems, but which require performing IO stored in different files but on the same device, and attempted a similar technique: I had n sets of files each of which could be processed independantly. So option 1: write a script which processes them in series. Option 2: write a script which using PBS runs each case on an individual node concurrently. The problem was that this parallel case took longer than the serial case. I suspect that this was because the processing was almost entirely IO and all the data was stored on the same device, so each cluster would ask for some different file all at once from the same device/data server and the server would find file one, start reading it and sending it to the node with my program, then before it could finish would find file 2 start reading etc. and then have to go back and find where it was in file 1, in effect causing more write head seeks than the serial case. By my logic, depending on how asynchronous IO has been implemented I could encounter a similar situation using strategy 1 of my test program.
As an alternative to this I devised strategy 2: only one file would be asynchronously opened and read at a time, but I could still perform calculations laging the data being currently read.
Where I am coming from: contextual background information about my programming abilities
I am a graduate student researching supersonic and hypersonic compressible turbulence which can require the use of some very large data sets (~1TB sometimes). I do not necessarily have the best background in computer architecture/science etc. Since I am not a professional developer, I write codes mainly for readability, portability, generality (where possible) and speed (when necessary) so please keep all of this in mind when replying.
testasynchIO.f90 main test program. Takes 4 commandline arguments: case, i-dimensions, j-dimensions, k-dimensions
case is s|p|b corresponding to serial, parallel, and block or stencil strategies. The i-dimension and j-dimension are integers and correspond to the size of the data arrays in the 1 and 2 dimensions and are fixed by the content of each kplane file. You may vary k-dimension from 1 to 106 to determine the number of kplanes to read in and also the size of the variables in the 3 dimension.
A module with some very basic utilities for getting command line arguments, handling errors related to them, and printing progress bars, spinners, etc.
a module containing derived types/data structures, as well as some various scientific constants and integer constants corresponding to various data types.
a binary executable which i compiled and may or may not work on your machine.
Thanks in advance for any help. If you think you have a better way to do what I am trying to do I am happy to discuss it with you.