Read data from file of unknown size

Read data from file of unknown size

I need to write a general routine to read data from files of unknown size. The files consist of an unknown number of records of comma-delimited values. The number of values in a record is not known, but all records are similar. All of the results need to be stored in an array say A(NVALS, NRECS). The number of records may be quite large (thousands), so A needs to be dynamically allocated. How can I do this?

Part 1: How to read the first record and determine the number of values in it, NVALS?

Part 2: How to read the records, each time adding the values to A? I can use the crude method of reading all the records in a loop without storing the values until EOF, just to count NRECS; then rewind, allocate A (assuming NVALS has been determined), and reread. But surely there is a more elegant or faster way?

12 posts / 0 new
Last post
For more complete information about compiler optimizations, see our Optimization Notice.

I usually use the steps you suggested to get the exact values for NVALS and NRECS. For the second part it is sufficient to read the record without interpreting the values in the record, by using a read(x,y) without variable name(s). The interpretation of ASCII characters is consuming processor time. If the file is stored on a SSD, a RAM disk or a RAID volume the processing time could be reduced further.

You have a couple of options - one strategy below.  Your idea to read and count rows, then rewind and read data isn't necessarily all that bad - but the following presents an alternative using a single read pass.

Reading a record (or line, depending on the file's access mode) of arbitrary length is easy enough to do with non-advancing input.  Deferred length character lets you then do that read into a variable length string.  To lazy to describe, so something rather like:

  !> Reads a complete line (end-of-record terminated) from a file.
  !! @param[in]     unit              Logical unit connected for formatted 
  !! input to the file.
  !! @param[out]    line              The line read.
  !! @param[out]    stat              Error code, positive on error, 
  !! IOSTAT_END (which is negative) on end of file.
  !! @param[out]    iomsg             Error message - only defined if 
  !! iostat is non-zero.  
  SUBROUTINE GetLine(unit, line, stat, iomsg)    
    ! Arguments    
    INTEGER, INTENT(IN) :: unit
    INTEGER, INTENT(OUT) :: stat
    CHARACTER(*), INTENT(OUT) :: iomsg    
    ! Local variables    
    ! Buffer to read the line (or partial line).
    CHARACTER(256) :: buffer    
    INTEGER :: size           ! Number of characters read from the file.    
    line = ''
      READ (unit, "(A)", ADVANCE='NO', IOSTAT=stat, IOMSG=iomsg, SIZE=size)  &
      IF (stat > 0) RETURN      ! Some sort of error.
      line = line // buffer(:size)
      IF (stat < 0) THEN
        IF (stat == IOSTAT_EOR) stat = 0
      END IF
    END DO    

Once you have that line it is then easy enough to count delimiters (commas) for simple input (if you want to be more robust you can select a particular variety of CSV and consider things such as quoting and escaping).  Perhaps one procedure that simply counts the fields, then another procedure that actually returns the fields in a row into a one dimensional array that is of the right length (rows and columns allocated as part of the process described below).  Use both procedures for the first row, use the latter procedure only for the second and subsequent rows.  Unless your line lengths are also completely regular you would still use the GetLine thing above to actually retrieve each subsequent row.

For handling a variable number of rows I'd be inclined to go for a exponentially growing allocatable buffer.  Pick some reasonable initial guess for the number of rows.  Allocate the 2D buffer to that size (number of columns from the field count above).  Start reading data in, row by row, tracking how many rows of the buffer are used.  When the buffer is full, allocate a temporary to twice the size, copy the buffer's data across, then move_alloc the temporary across to the buffer.  Repeat.  When end of file is hit chop the buffer down to the actual size in use and you're done. 

There are tradeoffs in selection of the initial buffer size between the maximum memory footprint of your program and the number of times/amount of time spent copying the buffer to the temporary when the buffer size needs to change.  (Similar tradeoffs exist with the GetLine thing - and once you've read one line you could adapt the size of its internal buffer, given latter lines are likely to be of similar length to the first.)

Iinteresting IanH I never new of advance='no' , very useful.

Perhaps after reading the first record the number of records can be estimated by the size of the file so only 1 or 0 buffer swops are needed dependant on if it is an over or under estimate,

1) Get the file's size
2) read the 1st record, note it's length
3) filesize / recordlength = number of records
4) parse record 1 for number of fields (comma delimiters)
5) record count x number of fields = size of array

RECURSIVE FUNCTION file_exists (fname) RESULT (size)


	TYPE(T_WIN32_FIND_DATA)			:: fdata

	CHARACTER(LEN=*),INTENT(IN) 	:: fname  ! full pathname to file

	INTEGER(HANDLE)                 :: ihandl

	INTEGER							:: size, rval, nc

	size = -1

	ihandl = FindFirstFile (fname, fdata)


		size = fdata%nFileSizeLow

		rval = FindClose (ihandl)   ! needed to avoid memory leak


	    ! file not readable error


END FUNCTION file_exists
INTEGER FUNCTION arg_count (string, nc)




	INTEGER						:: j


	IF (nc > 0) THEN

		DO j = 1, nc

			IF (string(j:j) == comma) arg_count = arg_count + 1



END FUNCTION arg_count

Thanks Ian and Paul for this help. I was hoping that there might be some kind of special dynamic allocation for arrays that allowed them to grow as new values are assigned, somewhat like character strings, but I guess not.

Ian's solution is much like I was thinking, but I don't know why "buffer" is used in addition to "line", with the concatenation then required at line 39.

In Paul's method: I know I wasn't clear about this, but while I can assume that all records have the same number of values, the values may be in different formats (E or F) or different number of sig digits so I cannot assume that all records have the same number of characters. So steps 1 - 3 to get no. of records is too risky in this case. Rest of method is intriguing but rather complex for me (and my coworkers; better to stick to more basic Fortran).

Here is the backbone of my eventual solution (details omitted for clarity):

CHARACTER(1)   :: $DELIM =CHAR(9) ! 9 if tab, or ',' ' ' etc
OPEN(5, FILE='DataFile.txt')
!  Read first record one character at a time and count delimiters.
NVAL = 1 ! First value requires no delimiter
     read (5,'(A)', advance='no', eor=10) $ch
     if ($ch == $delim) nval = nval + 1
! According to delimiters the first line contains NVALS values.
! Read first record to verify (not shown)
! Read rest of file (without saving values) to get a record count.
NREC = 0
     READ (5, *, END=20)
     NREC = NREC + 1
! Read the data for real now that size requirments are known.
     READ (5, *, IOSTAT = IERR) (A(IREC, IVAL), IVAL = 1, NVAL)
     ! ierr trap not shown for clarity.
DEALLOCATE(A)! Not needed but good practice

(Please excuse the use of END= and EOR= error exits to statement labels. I know this is outdated and not recommended but here it really does result in the simplest and most readable code (at the risk of starting a heated debate!))

In real application there would be a couple of read err traps.

Any discussion of the merits of this approach vs. the other two suggestions?



Paul's method can be used to estimate the size. Enlarge this size by your uncertanty. Example x1.2 for 20% margin of error. Allocate larger than what you need. Then count as you fill. Use the reduced portion of the array for the remainder of the program A(1:nRead).

Jim Dempsey

In GetLine, `buffer` is a fixed length buffer that reduces the number of READ's (and hence maybe the amount of underlying runtime activity) and the number of reallocate and copy actions that are required in order to progressively assemble `line`.  You are reading the record in buffer sized chunks and appending to `line`.  `buffer` could be one character long, but then the overhead associated with the other activity in the loop might become material. 

You can append a row at a time to an allocatable array; but the problem is efficiency - particularly if you are talking large data sets.  The 2D case is a little messy relative to the trivial 1D case, but you can squeeze it all into a single statement if you are particularly pathological...

array = RESHAPE( [RESHAPE(TRANSPOSE(array), [SIZE(array)]), row_to_append],  &
[SIZE(array,1)+1, SIZE(array,2)], ORDER=[2,1] )

(The /standard-semantics option is required.  I'm not recommending the above at all... just saying it can be done.  It took me longer to write the above single statement snippet than the entire rest of this post - which is hardly a promising sign for code clarity...)

My personal policy is not to use list directed input to process data in production code, due to there being a number "surprising" features for novice users and the difficulty of validating input (e.g. a stray comma on a line would be silently swallowed, a bare "\" (or is it "/" - can never remember - which is telling...) in a field causes all sorts of hidden fun, etc.).  Perhaps this is paranoia, but I have a little mirror attached to my monitor to remind me of one dumb user of my code that is particularly notorious for making stupid mistakes.  Consequently I would rather tokenise the line and internal read in each value separately.  Perhaps a bit of coding pain, but you only have to write your robust "read REAL array from CSV file" procedure once, ever.

Starting an identifier with $ is not standard conforming and I don't see any benefit from that practice, only portability downside.  Given F95 I don't bother deallocating my local allocatables.  F90 only compilers would be pretty rare these days, plus invariably I require bits of F2003.

If you did have fixed length records and wanted to go down the size of file approach I'd consider using the standard INQUIRE statement to get the file size, but there can still be trickiness going from record length as seen by program to record length in the file (consider record delimiters in the file).

(One downside to the read and count, allocate, rewind, read and parse approach is if your input isn't coming from a real file, but a pipe (redirected standard input or similar) - in that case the REWIND will (probably) have conniptions.  Perhaps not applicable here, but I have written the odd Fortran utility that expects data to be shovelled into it via INPUT_UNIT.)

Based on your method, rather than count the number of items in the first record, you could count the number of items in each record, looking for the maximum items in a record and also counting the number of records. This would allow for variable record lengths, if that is a possibility.
This requires 2 passes of file read, but is a simpler approach.
Tab delimeted items would be preferred.


Regarding "starting an identifier with $": Yes I know it is not standard, but an extension supported by Intel. Why do I do it? Because I find it very convenient to name variables with a little implicit pop-up flag that itendifies the type, without having to refer back to the declarations block. Especially when the length of the procedure exceeds one screenful or one page. Thus, even though I am a firm believer in IMPLICIT NONE, I still (almost always) begin integers with I through N and reals with everything else. When the $ became available I started using that for character strings, and it has simplified my life. I believe it also helps others who are reading my code (because, when I try to read code from others who don't follow this practice--namely Windows people not math people--it drives me bonkers). Now if only there was a way to signify logicals...

Why not use first letter prefix

dFEE ! REAL(8)
cTEXT ! character
bLOGICAL ! while you could us l, it is hard to determine between l and I

Jim Dempsey

This is a pretty good suggestion, Jim. Especially where names are normally in all caps (I use both UC and lc systems, but that's another story). I will experiment with it.

Leave a Comment

Please sign in to add a comment. Not a member? Join today