How to read a general csv file

How to read a general csv file

I am given a number of very large csv files. A representative line from these files is

4/8/2013,12:54:55.242500,8.986848E-3,-1.775244E-2,...

This sample contains 4 values, delimited by commas:
1.  a date string "4/8/2013"
2. a time string "12:54:55.242500"
3. a data value, 8.986848E-3
4. a data value, -1.775244E-2

And it goes on, but this is enough to illustrate the problem. How to read this data? Note that the text items are not delimited by quotes; also the numerical values are expressed in exponential format but with different field widths, and in general they might be in F fields again of varying widths. Ideally I would use a list-directed (*) format to get the numerical values, but my effort stumbles and results in nonsensical results. It is not necessary that I actually read the first two character values. But how can I skip them and still use (*) format?

Data files of this nature are easily read by Excel, for example, and I am hard-pressed to convince others that Fortran is appropriate for this sort of application. Can somebody help me? (The files typically contain 24 values per line, and 90,000 lines, and I have to read hundreds of them and process the data they contain. Hence I want something more powerful than Excel). 

publicaciones de 9 / 0 nuevos
Último envío
Para obtener más información sobre las optimizaciones del compilador, consulte el aviso sobre la optimización.

I have done this many times. The best way I have found is to approach it by writing a solution that addresses the specific formats being read.

I approach this in 2 parts:
first identify the 4 fields, seperated by a comma.
Store them as either character fields(4)*30 or as an index to the first character of each field.
do i = 1,len_trim(line)
   if (line(i:i)/=',') cycle
   nc = nc+1
   eof(n) = i
end do
eof(n+1) = i

second, for each field, parse the field to calculate the number.

For your example:
field 1 : date field: so identify 3 integers seperated by /; eg day/month/year. Convert this into days since reference, eg 1-Jan-2000, JDATE is needed here.
field 2 : time field: so identify 2 integers seperated by :. Convert this to day
fields 3 & 4 : floating point numbers, so use an internal read.

Make sure you do an error test for all numbers, using both iostat and also range sensibility tests. Hopefilly you will get no errors reported.

Don't solve the world; just the fields you are expecting.  Do what excel does when doing "Text to Columns"

John

Don't use *.  You will find that the slash character used in the date does some very interesting things to the input processing.

Read the line into a character buffer (if you are writing to F2003 - a deferred length allocatable variable that can accomodate arbitrary line lengths).  Tokenise that buffer on the basis of the location of the commas (for extra points ignore commas that appear in quoted strings - you will need to select a particular flavour of "CSV" that specifies how quotes are handled/escaped, etc).  Again, if using F2003 you might tokenise to a[n array of derived type with a component that is a] deferred length character.  Write the module that has the necessary procedures and types to do this once.

Then process each field generated by the tokenisation as appropriate.  When reading a real field, prepare a format (E, ES, F... it doesn't matter...) using an internal write to a character buffer that has the field width in the format set to match the actual field width and that has ".0" for the decimal part of the format. That is, if `field` is a character variable that holds the text of a particular field...

character(10) :: fmt
write (fmt, "('F',I0,'.0')") LEN(field)
READ (field, fmt) some_real_variable

You will need to write your own date and time parsers, chopping the field up using whatever delimiters are appropriate.  While writing your parser, write a rude letter to the person that generated the CSV data, telling them to repetitively write out the text of ISO8601 until they promise to be good data exchange citizen in future.

You will make life easier if you read it into excel, redefine the column formats and then export as CSV. You want text in quotes otherwsie you will be stuffed parsing text that contains a comma for example. do you need the data as xx/xx/xxx format? You can get excel to ouput as text record to save having to parse it.

If you have consistent (and known) formats than reading is trivial needed only a few lines of code. You can even add a header string to each column with text for the fortran format for that column....

You read it into a text string long enough to contain all the data and then break it out by fields delimited by commas.

Linda

Thanks for these quick tips and suggestions. I will investigate and adapt.

For now, I have found through trial and error that the following simple READ is sufficient to get by:

CHARACTER(8) cDUM8
CHARACTER(16) cDUM16
REAL A(:,:)
READ (5,'(A8, 1X, A15, 1X, 20G15.1)') cDUM8, cDUM15, (A(IREC, IVAL), IVAL = 1, NVALS)

The main thing I have learned is that a general number edit descriptor can be used, G15.1 in this case, and if the real field is shorter, it will be adapted because of the comma delimiters. So 15 is simply "big enough" in my case (risky). Also, if a decimal is present in the input it will override the edit descriptor.  So the numbers are read just fine. The weakness is with the text strings.

This temp solution doesn't parse out the contents, but that could be added. The risk is that string fields may be different from the 8 and 15 sizes I have assumed. If they are shorter, perhaps the PAD= specifier can solve that and pad the strings with blanks; then the strings could be declared "just big enough". More trial and error needed.

Here's what surprises me: despite being somewhat anal about reading books and manuals (and the IVF documentation, I can't find anything that describes the conditions I have indicated above using reverse engineering).

Your format string in the READ statement will not process the date field correctly if the day-of-the-month and/or the month are expressed with two digits. Similar problems may occur with the time string, depending on what the input data contains.

I personally wouldn't use that approach.  I think it is just too error prone.

- I'm a bit surprised (disappointed) that the runtime doesn't caterwaul when it cops things other than the permitted (given a certain comma mode) characters for a numeric input field.  Other fortran compilers may not be so forgiving.

- If your fields don't line up as you expect, then you run the risk of significant digits from one field being considered in another.  That's bad.

- For input don't use a non zero decimal digits specifier (i.e. make the ".1" to be ".0" instead), otherwise if the user types in "1" rather than "1.0" then the compiler will take the value to be "0.1" - which would make most users not accustomed to providing input on punched cards rather cross.

From my personal experience with .csv files there's no golden rule to read such files in a Fortran program. It starts with the field separator. The defaults for separators differ among language versions which might be overridden by options the user can set for his individual MS-Excel or even an individual file. The separator used mostly is the comma, but semicolon, colon, point and others are possible. I recommend to read each record of the csv file a text string, identify the fields by the separator used, including a check for separators within quoted text strings, as suggested by IanH. To my mind it is inevitable to know the type of each field, as John Campbell noticed. The field can be read internally according to its type, e. g. if it is integer, float, date, character etc. I thinks that's the most safe and general procedure. 

Inicie sesión para dejar un comentario.