Solved

Fastest CSV parser?

Posted on 2014-04-07
96
976 Views
Last Modified: 2014-05-15
Hello Experts, I need to find a very fast way to parse very large CSV flat files. Now, I don't intend to read the file data and "keep" it in memory nor load it into a database - I need to read and parse the CSV data as quickly as possible and "look" for certain aberrations in the data, fix it up, then immediately write it right back out to a file. So it's basically just read, fix data, write back out... over and over again.

Thank you all
    Shawn
0
Comment
Question by:shawn857
  • 40
  • 30
  • 18
  • +2
96 Comments
 
LVL 36

Expert Comment

by:Geert Gruwez
ID: 39985249
personally, i'd fix it at the source, where the file is written, but i'm guessing that's out of your hands ? right ? if it is not in your hands then place it into the hands of the person responsible. after all, what is the point in writing wrong information.  you could point out that people have gone to prison for giving misleading information. it will solve your problem instantly.

but if, you really want to go the long way to Rome, because eventually, everyone does go to Rome... or was it Israel, near that small black cubicle and that wall where everyone cries at ? hmmm, forgot which it was

then you need to consider that file operations are the slowest operations there are
will the replaced text be the same size as the original ?
for example if the replacement of text "to be replaced" has to be "was replaced 2", then technically you don't need to rewrite the complete file, just put the write file pointer at the correct place in the file and write the new text
that will save you masses of time

secondary, since you are writing a very big file, it will probably be faster if you cut the file up in segments and process them in simultaneous threads.  you'll have to start reading a little before the start of the segment, except the first segment, to fix any overlapping on the segment end and beginnings.  with some advanced low level coding it's possible to concatenate multiple files into 1
http://msdn.microsoft.com/en-us/library/windows/desktop/aa363778%28v=vs.85%29.aspx
some sites state resetting the end file pointer to beginning of the next file, but that requires the files be equally aligned
technically it should be possible as download managers do this all the time

btw, there is also a comment from russel on regular expressions ...
to replace text:
http://www.experts-exchange.com/Programming/Languages/Pascal/Delphi/Q_21228228.html
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39985362
Can you post a small example of your csv? Also can you give more info on the parsing you need to do e.g. is it just a case of replacing certain characters with others? or more complicated?
0
 
LVL 25

Expert Comment

by:Sinisa Vuk
ID: 39985404
Hmmm, will your new fixed data expand or shrink fields in csv?
(ex:  ABCDEFGH   ->   ACBD)
Do you know what you looking for (in csv file)  or you need proper/real csv parsing?
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39985422
Here is a small console demo on how you can parse your csv file.

My demo simply deletes $ signs from the sample (which I attached) but of course you could change the ParseLine function to do whatever parsing you like.

Can you give indication of size of your csv file?

program csv_parse;

{$APPTYPE CONSOLE}

uses
  SysUtils;


procedure DoParse;
var FileIn, FileOut: text;
    buf, sFileIn, sFileOut, overwrite: string;

  function ParseLine:string;
  var i: integer;
  begin
    result:= '';
    if (buf <> '') then
    begin
      //for demo, we replace all $ with blanks
      for i:= 1 to length(buf) do
        if buf[i] <> '$'
          then result:= result+buf[i];
    end;
  end;

begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //propmt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;


  Writeln('Parsing file...');
  AssignFile(FileIn, sFileIn);
  AssignFile(FileOut, sFileOut);
  reset(FileIn);
  Rewrite(FileOut);
  try
    while not eof(FileIn) do
    begin
      //SetConsoleCursorPosition
      ReadLn(FileIn, buf);
      WriteLn(FileOut, ParseLine);
    end;
    Writeln('Parsing complete.');
  finally
    Close(FileIn);
    Close(FileOut);
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window

sample.csv
0
 
LVL 45

Assisted Solution

by:aikimark
aikimark earned 50 total points
ID: 39985808
In your (related) string concatenation question, http:Q_28406019.html ,
you asked about the JEDI project.  It contains a TJvCsvDataSet class.

Also, there is a delimited text option for ADODB.

=================
What kind of 'cleaning up' activity do you need to do?

If you're looking best raw speed, you might need to use buffered I/O.  You read in some number of bytes, split the string by CrLf and then process the items in the resulting array/list.  The last 'chunk' is prepended to the next chunk you read.  I usually recommend reading several thousand bytes at a time (~50k) and varying that buffer size until you find the performance sweet spot.  Also, it helps if your buffer is a multiple of your hard drive's segment size.

I would have suggested the TStringList.LoadFromFile() method, but it sounds like your file is too big for memory.  The commantext property is a pretty handy parser.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39985898
My previous example can be optimised quite a bit by doing blockreads and working with buffers (as suggested by akimark).

I tested on a sample of 1 million records (10 columns per record) and based on my machine my initial demo parses it in +-7.5 seconds. After changing to blockread approach, it came down to +-2.3 seconds. Would this be fast enough? How many records are we talking about? How fast do you need it to be?
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39986144
ok, I managed to get it to 0.6 seconds on my machine using my sample of 1 million records (10 columns). Here's how (enjoy):
program csv_parse2;

{$APPTYPE CONSOLE}

uses
  SysUtils, windows;

const buf_size = 65535;
var buf_read : array[0..buf_size-1] of AnsiChar;
    buf_write: array[0..buf_size-1] of AnsiChar; //used to hold data for writing to output_file
    unparsed: array of AnsiChar;  //used to hold unparsed excess i.e. between end of block read and last #13

procedure DoParse;
var sFileIn, sFileOut, overwrite: string;
    len, len_unparsed, len_bufwrite, bytes_read, hFileIn, hFileOut: integer;
    StartTick, EndTick: cardinal;
    p_start, p_end, p_last: integer;


  procedure FlushOutput;
  begin
    if (len_bufwrite > 0)
      then FileWrite(hFileOut, buf_write[0], len_bufwrite);
    len_bufwrite:= 0;
  end;

  procedure ParseLine;
  var i: integer;
  begin
    //process unparsed from previous block read, if any
    for i := 0 to len_unparsed - 1 do
    begin
      if unparsed[i] <> '$' then //demo simply deletes all '$' from csv
      begin
        buf_write[len_bufwrite]:= unparsed[i];
        Inc(len_bufwrite);
      end;
      if (len_bufwrite = buf_size) then FlushOutput;
    end;
    len_unparsed:= 0;

    //process from current block read
    for i := p_start to p_end do
    begin
      if buf_read[i] <> '$' then //demo simply deletes all '$' from csv
      begin
        buf_write[len_bufwrite]:= buf_read[i];
        Inc(len_bufwrite);
      end;
      if (len_bufwrite = buf_size) then FlushOutput;
    end;
  end;

begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  len_bufWrite:= 0;
  len_unparsed:= 0;
  Setlength(unparsed, 0); //by default, will grow as needed

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //propmt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;


  Writeln('Parsing file...');

  hFileIn:= FileOpen(sFileIn, fmOpenRead or fmShareDenyWrite); //open file, allow shared access but don't allow writes
  hFileOut:= FileCreate(sFileOut);
  try
    StartTick:= GetTickCount;

    repeat
      bytes_read:= FileRead(hFileIn, buf_read[0], buf_size);

      //work through buffer
      p_start:= 0;
      p_end:= 0;
      p_last:= bytes_read-1;
      while (p_end <= p_last) do
      begin
        if (buf_read[p_end] = #13) then
        begin
          ParseLine;
          p_start:= p_end+1; //set start and end markers
        end;
        Inc(p_end);
      end;

      //store unparsed
      len:= p_end - p_start;
      if (len>0) then
      begin
        if (len > length(unparsed))
          then Setlength(unparsed, len); //grow unparsed if needed
        len_unparsed:= len;
        move(buf_read[p_start], unparsed[0], len_unparsed);
      end
      else len_unparsed:= 0;

    until (bytes_read < buf_size);
    //write last block if any
    if (len_bufwrite > 0) then FlushOutput;

    EndTick:= GetTickCount;
    Writeln('Parsing completed in '+IntToStr((EndTick-StartTick) div 1000)+'.'+IntToStr((EndTick-StartTick) mod 1000)+' seconds');

  finally
    FileClose(hFileIn);
    FileClose(hFileOut);
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window

csv-parse2.dpr
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39986382
@Pierre

Are you using SSD?
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39986457
SSD? What is that? You mean Solid State Drive? No I am using a hybrid. Seagate Hybrid SSHD ST1000DX001 1TB 3.5" SATA Desktop Hard Drive (HD-S1000H64)
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39986524
The hybrid has a bunch of flash memory to buffer the data coming off of the platters.  They are almost as fast as SSDs.  It is likely that your 0.6 sec performance was a second execution of the test (multiple tests are advised) and that your data was cached in the drive's flash memory.

I'm not discounting your code's performance.  I was attributing some of the performance gains to your hardware configuration.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39986640
True that. Out of office for now, will test tomorrow with fresh/new file. Would be interesting to see results from you or the asker. Did i miss any areas for further optimisation?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39986820
Did i miss any areas for further optimisation?
Nothing that I noticed.
0
 

Author Comment

by:shawn857
ID: 39987768
Thank you all who answered.

Geert: You are right, fixing the data at its source is "out of my hands" - I have no control. I have to be ready for any kind of crappy data that comes along.




Replaced text will be *almost* the same size as original, but not quite. Here's exactly what I'm doing currently:

- Reading a CSV record from a text file
- parsing it character by character looking for delimiters and putting each data field into a stringlist
- I then inspect each data field in the stringlist doing 3 things:
  (1) deleting all double-quotes (") from the datafield... unless there is a delimiter character "embedded" within the quotes, then I don't do it.
  (2) Trimming leading/trailing spaces from the data field
  (3) If I find TEXT data in what should be a NUMERIC datafield, I write a warning to a log file.
- I then write back out the record to a text file
- Repeat until end of CSV input file.


I would *like* to keep this as a single-threaded operation if I can... cause I really don't know much about multi-threaded programming. That would open up a whole new can of worms for me. I really don't want to add to the complexity of my program.


Pierre asked for a small sample of my CSV... it really could be any kind of CSV data, but here is one small sample:

10548509,9.71502E+11,4.43452E+11,routing,"""+971502347269"" <0971502347269>",SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313<CRLF>
10548509,9.71502E+11,10548509,voice,"""+971502347269"" <0971502347269>",Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313<CRLF>
10548513,2089591955,4.43452E+11,routing,"""002089591955"" <002089591955>",SIP/217.168.44.132-000907e8,Local/10548513@voice-b956;1,Dial,"Local/10548513@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313<CRLF>


Sinisa: You asked - "Hmmm, will your new fixed data expand or shrink fields in csv? (ex:  ABCDEFGH   ->   ACBD)
Do you know what you looking for (in csv file)  or you need proper/real csv parsing?"

The new re-written data will never be bigger, only smaller. And at that, not by too much - just a few chars here and there. I could be dealing with any kind of CSV data Sinisa, so my program needs to be robust enough to handle anything.


Aikimark: you give good suggestions about reading in a large chunk, processing that, and moving on to the next... to minimize file I/O operations. Does the TJvCsvDataSet class you mentioned to this? Yes, my datafiles could be potentially far to big to load
into memory using TStringList.LoadFromFile(). Also, I played with commatext before and it didn't suit all my needs as it had some limitations - namely treating spaces like field delimiters - which was not good.


Also, I don't know if I mentioned it before, but I am using Delphi 7.


Thanks All !
  Shawn
0
 
LVL 36

Expert Comment

by:Geert Gruwez
ID: 39987995
i've seen your description before ... the lines after
" Here's exactly what I'm doing currently:"

ah yes.. SQLLoader from oracle
it's an app designed for just this ... loading files of data into a database
and it's very fast
you wouldn't have an oracle database handy,i imagine ?

but if you do ...
currently there is a few processes here where csv get produced from some kind of erp system called sap.  one of it's nasty habits is of putting the minus signing after the number instead of before the number to indicate negatives
Bad sap example:
> A,456-,B,C,123,D

What we do here is save the files to a shared directory
and then use oracle database to do any magic
we create a table (organization external) which is a physical table in the database describing the CSV file and fields in it.

The table is treated as a normal table and running selects on them is like on any table
The downside is the database needs to do file io, which pops up in the performance monitor we have, but that's it.

If you don't have an oracle db, then i won't make a sample
As you probably wouldn't be using it

Just indicating a possibly blazing fast alternative ... :)

here is the link to oracle 11 utilities SQLLoader concepts
http://docs.oracle.com/cd/E11882_01/server.112/e22490/ldr_concepts.htm#SUTIL3312
0
 

Author Comment

by:shawn857
ID: 39988002
No I don't Geert. That would put me in completely uncharted waters - I know nothing about SQL and databases...

Thanks
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39988022
Hi Shawn.

Based on akimarks' comments, I reran using my sample of 1 million records. I created 3 new files and ran the parser on each of them, to prevent disk caching influence. Though not a thorough benchmark, it gives good indication of speed. Here the results:
File1: 0.639 seconds
File2: 0.640 seconds
File3: 0.639 seconds

In my demo, you should implement your code in the Parseline function. Since speed is critical I suggest you don't use the stringlist approach to processing the fields. Process the data in the buffers directly. The idea is to not have unnecessary assignments and memory operations i.e. read once, process in place.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 39988034
It would probably be easier if you replaced the following with a function such as ParseChar:

buf_write[len_bufwrite]:= unparsed[i];
Inc(len_bufwrite);

Open in new window


e.g.
procedure ParseChar(AChar: AnsiChar);
  begin
    //do your parsing here. ParseLine will feed this function with characters
    // add parsed data to buf_write 
    // Inc(len_bufwrite) if you add to the buffer
    if (len_bufwrite = buf_size) then FlushOutput;
  end;

  procedure ParseLine;
  var i: integer;
  begin
    //process unparsed from previous block read, if any
    for i := 0 to len_unparsed - 1 
      do ParseChar(unparsed[i]);     
    len_unparsed:= 0;

    //process from current block read
    for i := p_start to p_end 
      do ParseChar(buf_read[i]);
  end;

Open in new window

0
 
LVL 25

Expert Comment

by:Sinisa Vuk
ID: 39988044
Although csv parser is a good approach and more understandable, it is possible to build
"state machine" with character flow which do on the fly parsing and write to parallel stream.

Do you need to overwrite source file with newer one?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39988426
If you use one of the string libraries already mentioned above (faststrings, tstringbuilder, etc.), you can clean your lines with three replace (function) invocations:
With your posted data sample, these three replacements
Replace ,""" with ,"
Replace """, with ",
Replace "" with

produced the following
10548509,9.71502E+11,4.43452E+11,routing,"+971502347269 <0971502347269>",SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313<CRLF>
10548509,9.71502E+11,10548509,voice,"+971502347269 <0971502347269>",Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313<CRLF>
10548513,2089591955,4.43452E+11,routing,"002089591955 <002089591955>",SIP/217.168.44.132-000907e8,Local/10548513@voice-b956;1,Dial,"Local/10548513@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313<CRLF>

Open in new window

Note: if you aren't using one of those external libraries, you should try AnsiReplaceStr() to get a baseline performance time.

I don't see an instance of leading or trailing spaces in a field in your data to offer a cleanup hint.  However, these could probably be cleaned up with a couple of replace statements if the leading/trailing spaces were limited to text fields.

How do we know what fields are numeric for your third check?

I tried to avoid the parsing of the lines, but your third check requirement killed that idea.
0
 

Author Comment

by:shawn857
ID: 39990109
Sinisa, you wrote:

"Although csv parser is a good approach and more understandable, it is possible to build
"state machine" with character flow which do on the fly parsing and write to parallel stream."

>> I've never heard of a "state machine" before so I googled it and found this:

http://sourcemaking.com/design_patterns/state/delphi

Coincidentally enough, here it discusses in detail how to implement this to parse CSV data - giving full code for a CSV "State Machine"! Would using this as the "parsing engine", dovetailed into Pierre's general framework for reading and processing data from a large buffer be an ideal solution? (Pierre: thanks for your detailed framework - it could be way to go)


"Do you need to overwrite source file with newer one? "

>> Sinisa - I don't overwrite the original one, I just write the output to a new file.


Aikimark: yes, it WOULD be more efficient to delete the quotes all in one shot for the whole line, instead of field-by-field... good point there. But I *still* would need to trim the leading/trailing blanks on a field-by-field basis however.
   I then need to check each field to see if it's numeric or not and I use TryStrToFloat for that.


Thank you All
   Shawn
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39990574
I *still* would need to trim the leading/trailing blanks on a field-by-field basis
Maybe not.  If you were trimming leading and trailing blanks within text fields then we do not necessarily need to parse the line to do that.  My earlier question was whether you only need to trim the leading/trailing blanks within text fields.  I do not see your answer to that question.

I then need to check each field to see if it's numeric or not
That differs from your prior description that indicated that you knew which fields were supposed to be numeric and you validated that they were.  This description seems to indicate that you don't know which fields are numeric.
0
 

Author Comment

by:shawn857
ID: 39990755
Aikimark: Sorry for the misunderstandings - yes, I *do* need to trim the leading/trailing blanks from each field, even if its numeric.

Also you said:
"That differs from your prior description that indicated that you knew which fields were supposed to be numeric and you validated that they were.  This description seems to indicate that you don't know which fields are numeric. "

Sorry, I *do* know what the fields are "supposed" to be (either Text or Numeric), but still I must check each field value with TryStrToFloat because sometimes there are text values where Numeric values ought to be and vice versa... and I need to report those instances to my error log file.

I think there's no way to avoid it - I need to parse for each and every field in the record.

Thanks!
    Shawn
0
 
LVL 25

Expert Comment

by:Sinisa Vuk
ID: 39996728
I think you can "look" whole file as one single csv row. If you know that field 3 and 4 must be numeric - then if you count this single row files:

is numeric := (FieldCount div RealCSVRowFields) in (3,4)   .. you'll get information when to check for numerics.

- create two char streams - one for source and one for destination.
- read field by field (iterate first stream for delimiters (comma or CRLF))
- do transformations and write onto second stream.
- this way you will have single pass (not many when use some other method, stringreplace,...)
0
 

Author Comment

by:shawn857
ID: 39996950
Thanks Sinisa, but I'm sorry I didn't quite understand the wording of your proposal:

"I think you can "look" whole file as one single csv row. If you know that field 3 and 4 must be numeric - then if you count this single row files:

is numeric := (FieldCount div RealCSVRowFields) in (3,4)   .. you'll get information when to check for numerics. "


I couldn't quite follow what you meant. Could you re-explain, please?


Also, what do you think of the CSV parsing "state machine"  I found at:
http://sourcemaking.com/design_patterns/state/delphi

No good...?

Thanks!
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40003332
Shawn, I believe your question has been answered. Can we get closure on this? Are you still missing something?
0
 

Author Comment

by:shawn857
ID: 40010003
Hi Pierre, I was waiting for Sinisa to respond... I did not quite follow what he was explaining in his last message. And I also wanted to know his thoughts about the CSV "state machine" I found while googling:

http://sourcemaking.com/design_patterns/state/delphi

What do you think about that Pierre? Do you think that calling this CSV-parsing state machine from the framework you gave, would be a very good approach?

Thanks!
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40016913
Sorry for late response, been on holiday :)

Regarding the state machine example you found, seems like a methodical approach at parsing your input, so yes, you should be able to implement as part of my example. Personally, I would just parse it directly in my ParseChar function but that is just what I'm comfortable with.
0
 

Author Comment

by:shawn857
ID: 40018394
Thanks Pierre. Well, the way I see it overall, there are 3 main areas to "optimize":

(1) File input and reading
(2) The actual CSV parsing of characters
(3) File output and writing

For area # (1), your sample project seems to solve that very well. For (2), I'm going to have to make a timing comparison between my existing CSV parsing engine and an implementation of the "CSV State Machine" I mentioned I found while googling. Also Sinisa's msg on April 12 suggested another method, but I didn't quite follow what he meant where he said:

"I think you can "look" whole file as one single csv row. If you know that field 3 and 4 must be numeric - then if you count this single row files:
is numeric := (FieldCount div RealCSVRowFields) in (3,4)   .. you'll get information when to check for numerics. "

This could give me a 3rd option for finding the ultimate CSV parser...


For area # (3), I have a question please Pierre - is it faster (and feasible) to write each "parsed" line to a big output "buffer", and THEN only do a file write when that buffer is full... or to do a file write of *each* parsed line?

Thanks!
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40019560
Re Sinisa's comment, I also don't know what he means.

For area # (3), I have a question please Pierre - is it faster (and feasible) to write each "parsed" line to a big output "buffer", and THEN only do a file write when that buffer is full... or to do a file write of *each* parsed line?
Performance-wise it would definitely be better to use the buffered output as per my example as opposed to writing each parsed line as it would result in fewer write operations and that is one of the bottlenecks in this process. Even in my example there may be some optimisation possible by playing around with the size of the buffer but you would have to do some testing to make a call on that.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40019570
Apologies, I just looked at my previous comments and realise I made a mistake in the declaration
procedure ParseChar(AChar: AnsiChar);

Open in new window


it should of course be PAnsichar (to pass by reference not value) and the feeding of it should be like this
ParseChar(@buf_read[i]);

Open in new window


Regarding the state machine, just keep in mind that the fewer memory allocations you have, the faster your code will be. So you should aim to read only once into memory, that is why you should do the parsing directly on the buf_read variable in my example. The way I have put it together it does that as the Parseline is passed references to that variable and as such no new memory allocations take place resulting in speed-optimised code. Take that into account if you want to implement that state machine example. Personally I would just use variables to control where we are and what is being parsed.
0
 

Author Comment

by:shawn857
ID: 40021460
"So you should aim to read only once into memory, that is why you should do the parsing directly on the buf_read variable in my example."

Thank you Pierre, I will absolutely keep that in mind - good point... and thanks for your answer to my last question regarding writing the output to file.

I am hoping for Sinisa to re-appear on here and that he will explain his last post a bit further! It appears I have a good handle on the File I/O of my task... the last hurdle being to find the best and fastest way to do the CSV parsing.
   Pierre, while googling I found this:

http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader?msg=4795605#xx4795605xx

Full code for a complete CSV parser - supposedly very fast. The problem is, it appears to be written in .NET - which I have never worked with before. There appears to be nowhere on that site to contact the author of this - Mr. Sebastien Lorion.
    I'm wondering how difficult it would be to translate this from .NET to Delphi...

Thanks!
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40022188
I have reworked my example to parse the fields for you based on your requirements i.e.

deleting all double-quotes (") from the datafield... unless there is a delimiter character "embedded" within the quotes, then I don't do it.
Trimming leading/trailing spaces from the data field
If I find TEXT data in what should be a NUMERIC datafield, I write a warning to a log file.


Before I post the code, please confirm if this is the expected output you want:

Your sample lines
10548509,9.71502E+11,4.43452E+11,routing,"""+971502347269"" <0971502347269>",SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548509,9.71502E+11,10548509,voice,"""+971502347269"" <0971502347269>",Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548513,2089591955,4.43452E+11,routing,"""002089591955"" <002089591955>",SIP/217.168.44.132-000907e8,Local/10548513@voice-b956;1,Dial,"Local/10548513@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313

Open in new window

Output
10548509,9.71502E+11,4.43452E+11,routing,"+971502347269" <0971502347269>,SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548509,9.71502E+11,10548509,voice,"+971502347269" <0971502347269>,Local/10548509@voice-6ed4;2,",",AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548513,2089591955,4.43452E+11,routing,"002089591955" <002089591955>,SIP/217.168.44.132-000907e8,Local/10548513@voice-b956;1,Dial,"Local/10548513@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 40022326
@Pierre

How do you know which fields should be text and which should be numeric?

If I've read Shawn's comments correctly, we have to make sure that the text fields either do not contain numbers or are not proper numbers (int, float).  I'm not exactly sure how stringent the requirement, but Shawn asserted that he needs to apply TryStrToFloat() to every field.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40022377
@aikimark

From the example he gave one could guess. As for using TryStrToFloat, I don't believe it would give the best results. Better is to check for allowed characters.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40022537
@Pierre

Since the three sample lines do not include leading or trailing spaces in fields, I assert that the posted lines are not a representative sample of the data.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40022964
@ mark
I included leading/trailing spaces in my testing.

Here is my reworked solution:
program csv_parse2;

{$APPTYPE CONSOLE}

uses
  SysUtils, windows;

const buf_size = 65535;
var buf_read : array[0..buf_size-1] of AnsiChar;
    buf_write: array[0..buf_size-1] of AnsiChar; //used to hold data for writing to output_file

procedure DoParse;
var sWarnings, sFileIn, sFileOut, overwrite: string;
    len, len_unparsed, bytes_read, hFileWarnings, hFileIn, hFileOut: integer;
    StartTick, EndTick: cardinal;
    line_start, line_end, line_last, len_bufwrite: integer;
    ptrLineStart, ptrFieldStart, ptrFieldEnd: PAnsiChar;
    ReadCompleted: boolean;
      isQuoted: boolean;
      LineNum, FieldNum: integer;

  procedure LogWarning(warning: AnsiString);
  begin
    FileWrite(hFileWarnings, warning[1], length(warning));
  end;

  procedure FlushOutput;
  begin
    if (len_bufwrite > 0)
      then FileWrite(hFileOut, buf_write[0], len_bufwrite);
    len_bufwrite:= 0;
  end;


  procedure WriteChar(AChar: AnsiChar);
  begin
    buf_write[len_bufwrite]:= AChar;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer
    if (len_bufwrite = buf_size) then FlushOutput;
  end;

  procedure WritePChar(AChar: PAnsiChar);
  begin
    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer
    if (len_bufwrite = buf_size) then FlushOutput;
  end;

  procedure ParseField;
  var p,last: PAnsiChar;
      contains_delim: boolean;
  begin
    //quick err check
    if (ptrFieldStart = nil) or (ptrFieldEnd=nil) or (ptrFieldStart>ptrFieldEnd) then
      raise exception.Create('Error: input not in expected format');

    //if field is just one char and ptrFieldStart^ is a comma, the field is empty - exit
    if (ptrFieldStart = ptrFieldEnd) and (ptrFieldStart^=',') then
    begin
      ptrFieldStart:= nil;
      ptrFieldEnd:= nil;
      exit;
    end;

    //if field is quoted, skip first and last
    if ptrFieldStart^ = '"' then inc(ptrFieldStart);
    if ptrFieldEnd^   = '"' then dec(ptrFieldEnd);

    //remove leading spaces
    while (ptrFieldStart^ = ' ') and (ptrFieldStart < ptrFieldEnd)
      do Inc(ptrFieldStart);
    //need error check here, if ptrFieldStart =ptrFieldEnd here, we don't have a value

    //remove trailing spaces
    while (ptrFieldEnd^ = ' ') and (ptrFieldEnd > ptrFieldStart)
      do Dec(ptrFieldEnd);

    //scan for delim
    contains_delim:= false;
    p:= ptrFieldStart;
    while (p<=ptrFieldEnd) do
    begin
      if (p^=',') then
      begin
        contains_delim:= true;
        break;
      end;
      inc(p);
    end;

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //parse field data
    last:= nil;
    p:= ptrFieldStart;
    while (p<=ptrFieldEnd) do
    begin
      {here you can check for text values where you expected numbers only
       from your example it looks like field one should only allow numbers
       so you could do something like:
       }
      if (FieldNum = 1) and (not(p^ in ['0'..'9']))
        then LogWarning(AnsiString(format('Error on line %d char %d. Expected numbers only, found %s'#13#10,[LineNum, (p-ptrLineStart), p^])));

      if not((p^ = '"') and ((last <> nil) and (last^ = '"'))) //removing double quotes
         or ((p^='"') and contains_delim) //don't remove " if field contains delim
        then WritePChar(p);

      last:= p;
      inc(p);
    end;

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //reset
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
  end;

  procedure ParseLine;
  var i: integer;
  begin
    Inc(LineNum);
    isQuoted:= false;
    FieldNum:= 0;
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;

    //save LineStart - used to calculate character position for log file entry
    ptrLineStart:= @buf_read[line_start];

    //process from current block read
    for i := line_start to line_end do
    begin
      //do ParseChar(@buf_read[i]);
      //ignore LF
      if (buf_read[i] = #10) then Continue;

      //set start of field value
      if not Assigned(ptrFieldStart) then ptrFieldStart:= @buf_read[i];

      if (buf_read[i] = '"') then isQuoted:= not isQuoted;

      //check for field delimeter
      if (buf_read[i] in [',',#13]) and (not isQuoted) then
      begin
        if (ptrFieldEnd=nil) then ptrFieldEnd:= @buf_read[i]; //in case we reached the end (there won't be a comma)
        Inc(FieldNum);
        ParseField;
        if (buf_read[i] = ',') then WritePChar(@buf_read[i]); //add field delimeter

        //if end of record, reset FieldNum and write CRLF
        if (buf_read[i] = #13) then
        begin
          FieldNum:= 0;
          WriteChar(#13);
          WriteChar(#10);
        end;
      end
      else ptrFieldEnd:= @buf_read[i];
    end;
  end;


begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  len_bufWrite:= 0;
  len_unparsed:= 0;

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //prompt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;

  sWarnings:= 'warnings_'+ChangeFileExt(sFileOut, '.log');

  Writeln('Parsing file...');

  hFileIn:= FileOpen(sFileIn, fmOpenRead or fmShareDenyWrite); //open file, allow shared access but don't allow writes
  hFileOut:= FileCreate(sFileOut);
  hFileWarnings:= FileCreate(sWarnings);
  try
    StartTick:= GetTickCount;

    LineNum:= 0;
    repeat

      bytes_read:= FileRead(hFileIn, buf_read[len_unparsed], buf_size - len_unparsed);
      ReadCompleted:= (bytes_read < buf_size - len_unparsed);
      line_last:= len_unparsed + bytes_read-1;
      //len_unparsed:= 0; not needed here, setting it below

      //work through buffer
      if (buf_read[0] = #10) then line_start:= 1 else line_start:= 0; //skip first char if #10
      line_end:= 0;
      while (line_end <= line_last) do
      begin
        if (buf_read[line_end] = #13) then
        begin
          ParseLine;
          line_start:= line_end+1; //set start and end markers
        end;

        Inc(line_end);
      end;

      //move unparsed to start of buf_read for next pass
      len:= line_end - line_start;
      if (len>0) then
      begin
        len_unparsed:= len;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;

      //error check
      if (len_unparsed = buf_size) then //nothing was parsed!
        raise exception.Create('Error: Read buffer too small or no end of line detected');


    until ReadCompleted;
    //write last block if any
    if (len_bufwrite > 0) then FlushOutput;

    EndTick:= GetTickCount;
    Writeln('Parsing completed in '+IntToStr((EndTick-StartTick) div 1000)+'.'+IntToStr((EndTick-StartTick) mod 1000)+' seconds');

  finally
    FileClose(hFileIn);
    FileClose(hFileOut);
    FileClose(hFileWarnings);
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window


Compiled size: 94kb
Test file lines: 502500
Test file size: 108.6 Mb
Test duration: 2 seconds

Extract from input sample
10548509,9.71502E+11,4.43452E+11,routing,"""+971502347269"" <0971502347269>",SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
1x0548509,9.71502E+11,10548509,voice,"""+971502347269"" <0971502347269>",Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548513,2089591955,4.43452E+11,routing,"""002089591955"" <002089591955>",SIP/217.168.44.132-000907e8,Local/10548513@voice-b956;1,Dial,"Local/10548513@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
105@48509,9.71502E+11,4.43452E+11,routing,"""+971502347269"" <0971502347269>",SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548509,9.71502E+11,10548509, voice       ,"""+971502347269"" <0971502347269>",Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313

Open in new window

Extract from sample output
10548509,9.71502E+11,4.43452E+11,routing,"+971502347269" <0971502347269>,SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
1x0548509,9.71502E+11,10548509,voice,"+971502347269" <0971502347269>,Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548513,2089591955,4.43452E+11,routing,"002089591955" <002089591955>,SIP/217.168.44.132-000907e8,Local/10548513@voice-b956;1,Dial,"Local/10548513@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
105@48509,9.71502E+11,4.43452E+11,routing,"+971502347269" <0971502347269>,SIP/217.168.44.132-000907e4,Local/10548509@voice-6ed4;1,Dial,"Local/10548509@voice,45,gL(3599000)",10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313
10548509,9.71502E+11,10548509,voice,"+971502347269" <0971502347269>,Local/10548509@voice-6ed4;2,,AGI,STREAM,10/5/2012 11:25,10/5/2012 11:25,10/5/2012 11:25,5,5,ANSWERED,1349436313

Open in new window

Sample warning log
Error on line 2 char 2. Expected numbers only, found x
Error on line 4 char 4. Expected numbers only, found @
Error on line 502466 char 4. Expected numbers only, found @
Error on line 502467 char 5. Expected numbers only, found X

Open in new window


You would need to fine tune the numeric field error checking - use the example on line 103 in conjunction with field numbers to fine tune to your needs.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40022979
also, test for leading/trailing spaces in a numeric field.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40022987
I did, made no difference.
0
 

Author Comment

by:shawn857
ID: 40028124
Thank you very much for the reworked code with the CSV parsing Pierre...  you really went over and above the call of duty! :-)

Aikimark asked a few days ago how do I know which fields are numeric and which are text. Well, in my app I present the user a small "preview" of his CSV file divided up by his desired field delimiter (ie. similar to what MS Excel does when you import a text file). There, he specifies for each field, whether it's Text or Numeric. Then when I parse the whole datafile, I know what to look for in each field, and if I don't find it, I write to the error log.
    Also, when I'm trying to confirm numeric values in numeric fields, it's not quite as simple as just checking for only valid numeric characters. For my app, it's still valid to have a number in scientific notation - ie. 9.23038E+11. This is still valid numeric value, but it contains some non-numeric characters. Hence, this is why the TryStrToFloat is such a convenient solution for me. But could there be a faster way?

Also Pierre, in my current CSV parser, I store all the parsed fields to a TStringList for convenient processing once a line is parsed. In your code, do you store each parsed field anywhere, or is it simply "marked" by ptrFieldStart and ptrFieldEnd? I *do* need to store these parsed fields somewhere while I do processing on that parsed line... is a TStringList an inefficient means to do this?

Thanks!
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40028925
Yes, TryStrToFloat is convenient but not faster than checking for allowed characters. The question is do you want to sacrifice convenience for speed? My example was only for integers but you could expand on it to also allow for thousand and decimal separators and even scientific format.

Take the scientific format for example,
you could scan from the end and you expect the last 2 chars to be in ['0'..'9'];
3rd last in ['+','-']
fourth last must then be 'E'
Only one decimal separator allowed and the rest must be in ['0'..'9']
and lastly the first could be in ['-','+','0'..'9']

Why do you need to store in TStringList? For speed optimisation we want to read once write once which is why we work on the bufRead and write to bufWrite (and flush to file when full) instead of reading to a Stringlist and parsing that.

To answer your question about the field values, they are stored in the bufRead variable and parsed to the bufWrite buffer for speed optimisation. Yes, between the ptrFieldStart and ptrFieldEnd pointers. One could modify the code to use a Stringlist instead of the bufWrite buffer (thus still only parsing/writing in memory once) but I'm not sure if this would be faster than using the bufWrite buffer (testing needed to determine).

P.S. Yes, I have gone "over and above the call of duty" here :) Wishing we could have questions with more than 500 points max hehe.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40029423
I don't think it is enough to expand the allowed characters for numeric fields.  You need to allow these characters in a proper context of a valid floating point number
0
 

Author Comment

by:shawn857
ID: 40033048
Thanks Pierre... actually I find the TryStrToFloat command quite fast (as compared to using before StrToFloat in an 'if' statement). For now, I think I'll leave it like that, but if it becomes a bottleneck, I'll manually check for scientific notation!

Regarding the TStringList... maybe you misunderstand me a little bit - I don't want to replace the bufRead/bufWrite with StringLists, no no of course not... there is no need to do that. I only use StringList to store each parsed field of the record, because you see, sometimes I have to do comparisons between fields in the same record. Things like "if Field 1> Field 5 then blah blah blah. So for each record I parse, I need to put all the parsed fields in a temporary structure, which I then re-initialize and re-use as the next record gets parsed. In my code now I use a TStringList, yes... but maybe just a regular dynamic Array of String is more efficient? So as a field gets parsed by your routine, I could just save it to this array, then increment the length of the array by 1 using Setlength. What do you think?

Thanks!
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40036569
I understood you correctly about the stringlist but my point was that it creates additional overhead (thus slower code) as you are then allocating memory for the same thing twice (once for write buffer and once for the stringlist).

I have reworked my example to facilitate field access directly from the write buffer. Use the FieldAccessor variable to access field values in between line parses. See line 221 for example of using it. Fields are zero based (so the first field value you can access like this: FieldAccessor.AsString(0)

I also added TryAsInteger and TryAsFloat to the FieldAccessor for your conveninence.

program csv_parse2;

{$APPTYPE CONSOLE}

uses
  SysUtils, windows;

const buf_size = 65535;
      buf_grow = 1024;
//const buf_size = 524280;

var buf_read : array[0..buf_size-1] of AnsiChar;
    buf_write: array of AnsiChar; //used to hold data for writing to output_file
    buf_write_size: integer = buf_size;


type

  TFieldAccessInfo = record
    FieldStart: Pointer;
    FieldEnd: Pointer;
  end;

  TFieldAccessor = class
    FFieldAccessInfo: array of TFieldAccessInfo;
    FFieldCount: integer;
    private
      procedure AddFieldValue(FieldStart, FieldEnd: PAnsiChar);
    public
      constructor Create;
      destructor Destroy; override;

      procedure Clear;
      function TryAsFloat(FieldNum: integer; var Value: double): boolean;
      function TryAsInteger(FieldNum: integer; var Value: integer): boolean;
      function AsString(FieldNum: integer): AnsiString;


    property FieldCount: integer read FFieldCount;
  end;

procedure DoParse;
var sWarnings, sFileIn, sFileOut, overwrite: string;
    len, len_unparsed, bytes_read, hFileWarnings, hFileIn, hFileOut: integer;
    StartTick, EndTick: cardinal;
    line_start, line_end, line_last, len_bufwrite: integer;
    ptrLineStart, ptrFieldStart, ptrFieldEnd: PAnsiChar;
    ReadCompleted: boolean;
      isQuoted: boolean;
      LineNum, FieldNum: integer;
    FieldAccessor: TFieldAccessor;

  procedure LogWarning(warning: AnsiString);
  begin
    FileWrite(hFileWarnings, warning[1], length(warning));
  end;

  procedure FlushOutput;
  begin
    if (len_bufwrite > 0)
      then FileWrite(hFileOut, buf_write[0], len_bufwrite);
    len_bufwrite:= 0;
  end;


  procedure WriteChar(AChar: AnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure WritePChar(AChar: PAnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure ParseField;
  var p,last, bwStart, bwEnd: PAnsiChar;
      contains_delim: boolean;
  begin
    //quick err check
    if (ptrFieldStart = nil) or (ptrFieldEnd=nil) or (ptrFieldStart>ptrFieldEnd) then
      raise exception.Create('Error: input not in expected format');

    //if field is just one char and ptrFieldStart^ is a comma, the field is empty - exit
    if (ptrFieldStart = ptrFieldEnd) and (ptrFieldStart^=',') then
    begin
      ptrFieldStart:= nil;
      ptrFieldEnd:= nil;
      exit;
    end;

    //if field is quoted, skip first and last
    if ptrFieldStart^ = '"' then inc(ptrFieldStart);
    if ptrFieldEnd^   = '"' then dec(ptrFieldEnd);

    //remove leading spaces
    while (ptrFieldStart^ = ' ') and (ptrFieldStart < ptrFieldEnd)
      do Inc(ptrFieldStart);
    //need error check here, if ptrFieldStart =ptrFieldEnd here, we don't have a value

    //remove trailing spaces
    while (ptrFieldEnd^ = ' ') and (ptrFieldEnd > ptrFieldStart)
      do Dec(ptrFieldEnd);

    //scan for delim
    contains_delim:= false;
    p:= ptrFieldStart;
    while (p<=ptrFieldEnd) do
    begin
      if (p^=',') then
      begin
        contains_delim:= true;
        break;
      end;
      inc(p);
    end;

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //parse field data
    last:= nil;
    p:= ptrFieldStart;
    bwStart:= @buf_write[len_bufwrite];
    while (p<=ptrFieldEnd) do
    begin
      (*
        REMOVED THE FOLLOWING BECAUSE NOW PROVIDING FIELD LEVEL ACCESS PER PARSED LINE:
      {here you can check for text values where you expected numbers only
       from your example it looks like field one should only allow numbers
       so you could do something like:
       }
      if (FieldNum = 1) and (not(p^ in ['0'..'9']))
        then LogWarning(AnsiString(format('Error on line %d char %d. Expected numbers only, found %s'#13#10,[LineNum, (p-ptrLineStart), p^])));
      *)

      if not((p^ = '"') and ((last <> nil) and (last^ = '"'))) //removing double quotes
         or ((p^='"') and contains_delim) //don't remove " if field contains delim
        then WritePChar(p);

      last:= p;
      inc(p);
    end;
    bwEnd:= @buf_write[len_bufwrite-1];
    FieldAccessor.AddFieldValue(bwStart, bwEnd);

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //reset
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
  end;

  procedure ParseLine;
  var i,v: integer;
      d: double;

  begin
    Inc(LineNum);
    isQuoted:= false;
    FieldNum:= 0;
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
    FieldAccessor.Clear;

    //save LineStart - used to calculate character position for log file entry
    ptrLineStart:= @buf_read[line_start];

    //process from current block read
    for i := line_start to line_end do
    begin
      //do ParseChar(@buf_read[i]);
      //ignore LF
      if (buf_read[i] = #10) then Continue;

      //set start of field value
      if not Assigned(ptrFieldStart) then ptrFieldStart:= @buf_read[i];

      if (buf_read[i] = '"') then isQuoted:= not isQuoted;

      //check for field delimeter
      if (buf_read[i] in [',',#13]) and (not isQuoted) then
      begin
        if (ptrFieldEnd=nil) then ptrFieldEnd:= @buf_read[i]; //in case we reached the end (there won't be a comma)
        Inc(FieldNum);
        ParseField;
        if (buf_read[i] = ',') then WritePChar(@buf_read[i]); //add field delimeter

        //if end of record, reset FieldNum and write CRLF
        if (buf_read[i] = #13) then
        begin
          FieldNum:= 0;
          WriteChar(#13);
          WriteChar(#10);
        end;
      end
      else ptrFieldEnd:= @buf_read[i];

    end;

    //do any field specific checks here e.g.
    if not FieldAccessor.TryAsInteger(0, v)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected integer value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));
    // you could do range checks if needed like this : if (v < 0) or (v>= SomeMaxNumber) then LogWarning
    if not FieldAccessor.TryAsFloat(1, d)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected float value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));

    if len_bufwrite >= buf_size then FlushOutput;
  end;


begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  len_bufWrite:= 0;
  len_unparsed:= 0;
  SetLength(buf_write, buf_size); //initial size; will grow as needed

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //prompt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;

  sWarnings:= 'warnings_'+ChangeFileExt(sFileOut, '.log');

  Writeln('Parsing file...');

  hFileIn:= FileOpen(sFileIn, fmOpenRead or fmShareDenyWrite); //open file, allow shared access but don't allow writes
  hFileOut:= FileCreate(sFileOut);
  hFileWarnings:= FileCreate(sWarnings);
  FieldAccessor:= TFieldAccessor.Create;
  try
    StartTick:= GetTickCount;

    LineNum:= 0;
    repeat

      bytes_read:= FileRead(hFileIn, buf_read[len_unparsed], buf_size - len_unparsed);
      ReadCompleted:= (bytes_read < buf_size - len_unparsed);
      line_last:= len_unparsed + bytes_read-1;
      //len_unparsed:= 0; not needed here, setting it below

      //work through buffer
      if (buf_read[0] = #10) then line_start:= 1 else line_start:= 0; //skip first char if #10
      line_end:= 0;
      while (line_end <= line_last) do
      begin
        if (buf_read[line_end] = #13) then
        begin
          ParseLine;
          line_start:= line_end+1; //set start and end markers
        end;

        Inc(line_end);
      end;

      //move unparsed to start of buf_read for next pass
      len:= line_end - line_start;
      if (len>0) then
      begin
        len_unparsed:= len;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;

      //error check
      if (len_unparsed = buf_size) then //nothing was parsed!
        raise exception.Create('Error: Read buffer too small or no end of line detected');


    until ReadCompleted;
    //write last block if any
    if (len_bufwrite > 0) then FlushOutput;

    EndTick:= GetTickCount;
    Writeln('Parsing completed in '+IntToStr((EndTick-StartTick) div 1000)+'.'+IntToStr((EndTick-StartTick) mod 1000)+' seconds');

  finally
    FileClose(hFileIn);
    FileClose(hFileOut);
    FileClose(hFileWarnings);
    FieldAccessor.Free;
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

{ TFieldAccessor }

procedure TFieldAccessor.AddFieldValue(FieldStart, FieldEnd: PAnsiChar);
begin
  Inc(FFieldCount);
  SetLength(FFieldAccessInfo, FFieldCount);
  FFieldAccessInfo[FFieldCount-1].FieldStart:= FieldStart;
  FFieldAccessInfo[FFieldCount-1].FieldEnd:= FieldEnd;
end;

function TFieldAccessor.AsString(FieldNum: integer): AnsiString;
var len: integer;
    test: array[0..99] of AnsiChar;
begin
  result:= '';
  len:= integer(FFieldAccessInfo[FieldNum].FieldEnd) - integer(FFieldAccessInfo[FieldNum].FieldStart);
  if (len > 0) then
  begin
    SetLength(result, len+1);
//    move(FFieldAccessInfo[FieldNum].FieldStart, test[0], len);
    move(FFieldAccessInfo[FieldNum].FieldStart^, result[1], len+1);
  end;
end;

procedure TFieldAccessor.Clear;
begin
  FFieldCount:= 0;
  SetLength(FFieldAccessInfo, 0);
end;

constructor TFieldAccessor.Create;
begin
  FFieldCount:= 0;
end;

destructor TFieldAccessor.Destroy;
begin
  Clear;
  inherited;
end;

function TFieldAccessor.TryAsFloat(FieldNum: integer;
  var Value: double): boolean;
begin
  result:= TryStrToFloat(String(AsString(FieldNum)), Value);
end;

function TFieldAccessor.TryAsInteger(FieldNum: integer;
  var Value: integer): boolean;
begin
  result:= TryStrToInt(String(AsString(FieldNum)), Value);
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window



Compiled size: 96kb
Test file lines: 502500
Test file size: 108.6 Mb
Test duration: 2.4 seconds

It seems the impact is about 0.4 seconds more on my previous test run i.e. Previous avg was 2 seconds, now 2.4 seconds.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40036871
@Shawn
then increment the length of the array by 1 using Setlength
No.  Dynamic resizing of arrays is usually a performance killer.
0
 

Author Comment

by:shawn857
ID: 40041336
Thank you so much for the updated code Pierre! I've been testing it and stepping through it in debug mode... it does seem to have a little problem storing the field value when the field is either only 1 character long, or an "empty" field (ie. two consecutive delimiters in a row). There doesn't seem to be any distinction between these two cases because in either case, ptrFieldStart=ptrFieldEnd. What do you think?

Thanks!
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40041497
The problem lies from line 103. If I noted consecutive field delims, i just skipped it instead of adding field with blank value. try changing as follows and see if it works for you:

    if (ptrFieldStart = ptrFieldEnd) and (ptrFieldStart^=',') then
    begin
      FieldAccessor.AddFieldValue(0, 0); //just add blank field
      ...

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:shawn857
ID: 40042613
Thank you Pierre... still though, fields of only one character long (like ,4, or ,"7",) get treated as they are empty fields. I am double-checking the created fields by writing each of them to the "log file" after each line is parsed in Parseline. As I step through function TFieldAccessor.AsString, I think I found the problem: if the field is a valid one-character length (or one-digit) field (like ,4, or something), then the variable "len" will still get the value of 0 (because FieldEnd will be equal to FieldStart), and therefore the function result will be assigned '' (null).
   I find using pointers very very hard to debug and step through as the "Tooltip Expression Evaluation" in the IDE doesn't always reveal the true values of these variables. I know it's not good programming practice, but I honestly never use pointers! :-(

Thanks
   Shawn
0
 

Author Comment

by:shawn857
ID: 40042654
I *think* I have corrected the problem Pierre - I modified the TFieldAccessor.AsString function thusly:


function TFieldAccessor.AsString(FieldNum: integer): AnsiString;
var len: integer;
    test: array[0..99] of AnsiChar;
begin
  result:= '';
  if (FFieldAccessInfo[FieldNum].FieldEnd = nil) and (FFieldAccessInfo[FieldNum].FieldStart = nil) then
     EXIT;

  len:= integer(FFieldAccessInfo[FieldNum].FieldEnd) - integer(FFieldAccessInfo[FieldNum].FieldStart);
  if (len >= 0) then
  begin
    SetLength(result, len+1);
//    move(FFieldAccessInfo[FieldNum].FieldStart, test[0], len);
    move(FFieldAccessInfo[FieldNum].FieldStart^, result[1], len+1);
  end;
end;

Open in new window


A true "null" field should have FFieldAccessInfo[FieldNum].FieldStart and FFieldAccessInfo[FieldNum].FieldEnd as both NIL, so I do this check first. What do you think?

Thanks!
    Shawn
0
 

Author Comment

by:shawn857
ID: 40043593
Just one other thing please Pierre - I see in your newer version you are "growing" the write buffer size if needed. I noticed that "FlushOutput" only gets executed once during the program - after the main Repeat|Until loop is finished. Is this intentional? If I am processing a very large file (some gigabytes big), does this mean that the whole file is written to the buf_write buffer before it is finally flushed? If so, I think a very large file could crash a computer. Was there a problem with the way you had it before - where FlushOutput was called within the WriteChar and WritePChar procedures, once buf_write reached it's alotted size?

Thanks!
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40043749
I think I solved the issue now. Have a look:

Regarding the FlushOutput:
We start with the same size as buf_write (65535 in my example) and only increase if and when needed. The reason being that we want field access. Without doing this, we could end up with one line overlapping in 2 buf_read cycles and thus not being in a contiguous block.

program csv_parse2;

{$APPTYPE CONSOLE}

uses
  SysUtils, windows;

const buf_size = 65535;
      buf_grow = 1024;
//const buf_size = 524280;

var buf_read : array[0..buf_size-1] of AnsiChar;
    buf_write: array of AnsiChar; //used to hold data for writing to output_file
    buf_write_size: integer = buf_size;


type

  TFieldAccessInfo = record
    FieldStart, FieldEnd: integer;
  end;

  TFieldAccessor = class
    FFieldAccessInfo: array of TFieldAccessInfo;
    FFieldCount: integer;
    private
      procedure AddFieldValue(FieldStart, FieldEnd: integer);
    public
      constructor Create;
      destructor Destroy; override;

      procedure Clear;
      function TryAsFloat(FieldNum: integer; var Value: double): boolean;
      function TryAsInteger(FieldNum: integer; var Value: integer): boolean;
      function AsString(FieldNum: integer): AnsiString;


    property FieldCount: integer read FFieldCount;
  end;

procedure DoParse;
var sWarnings, sFileIn, sFileOut, overwrite: string;
    len, len_unparsed, bytes_read, hFileWarnings, hFileIn, hFileOut: integer;
    StartTick, EndTick: cardinal;
    line_start, line_end, line_last, len_bufwrite: integer;
    ptrLineStart, ptrFieldStart, ptrFieldEnd: PAnsiChar;
    ReadCompleted: boolean;
      isQuoted: boolean;
      LineNum, FieldNum: integer;
    FieldAccessor: TFieldAccessor;

  procedure LogWarning(warning: AnsiString);
  begin
    FileWrite(hFileWarnings, warning[1], length(warning));
  end;

  procedure FlushOutput;
  begin
    if (len_bufwrite > 0)
      then FileWrite(hFileOut, buf_write[0], len_bufwrite);
    len_bufwrite:= 0;
  end;


  procedure WriteChar(AChar: AnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure WritePChar(AChar: PAnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure ParseField;
  var p,last: PAnsiChar;
      bwStart, bwEnd: integer;
      contains_delim: boolean;
  begin
    //quick err check
    if (ptrFieldStart = nil) or (ptrFieldEnd=nil) or (ptrFieldStart>ptrFieldEnd) then
      raise exception.Create('Error: input not in expected format');

    //if field is just one char and ptrFieldStart^ is a comma, the field is empty - exit
    if (ptrFieldStart = ptrFieldEnd) and (ptrFieldStart^=',') then
    begin
      FieldAccessor.AddFieldValue(0, 0); //just add blank field
      ptrFieldStart:= nil;
      ptrFieldEnd:= nil;
      exit;
    end;

    //if field is quoted, skip first and last
    if ptrFieldStart^ = '"' then inc(ptrFieldStart);
    if ptrFieldEnd^   = '"' then dec(ptrFieldEnd);

    //remove leading spaces
    while (ptrFieldStart^ = ' ') and (ptrFieldStart < ptrFieldEnd)
      do Inc(ptrFieldStart);
    //need error check here, if ptrFieldStart =ptrFieldEnd here, we don't have a value

    //remove trailing spaces
    while (ptrFieldEnd^ = ' ') and (ptrFieldEnd > ptrFieldStart)
      do Dec(ptrFieldEnd);

    //scan for delim
    contains_delim:= false;
    p:= ptrFieldStart;
    while (p<=ptrFieldEnd) do
    begin
      if (p^=',') then
      begin
        contains_delim:= true;
        break;
      end;
      inc(p);
    end;

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //parse field data
    last:= nil;
    p:= ptrFieldStart;
    bwStart:= len_bufwrite;
    while (p<=ptrFieldEnd) do
    begin
      (*
        REMOVED THE FOLLOWING BECAUSE NOW PROVIDING FIELD LEVEL ACCESS PER PARSED LINE:
      {here you can check for text values where you expected numbers only
       from your example it looks like field one should only allow numbers
       so you could do something like:
       }
      if (FieldNum = 1) and (not(p^ in ['0'..'9']))
        then LogWarning(AnsiString(format('Error on line %d char %d. Expected numbers only, found %s'#13#10,[LineNum, (p-ptrLineStart), p^])));
      *)

      if not((p^ = '"') and ((last <> nil) and (last^ = '"'))) //removing double quotes
         or ((p^='"') and contains_delim) //don't remove " if field contains delim
        then WritePChar(p);

      last:= p;
      inc(p);
    end;
    bwEnd:= len_bufwrite;
    FieldAccessor.AddFieldValue(bwStart, bwEnd);


    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //reset
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
  end;

  procedure ParseLine;
  var i,v: integer;
      d: double;

  begin
    Inc(LineNum);
    isQuoted:= false;
    FieldNum:= 0;
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
    FieldAccessor.Clear;

    //save LineStart - used to calculate character position for log file entry
    ptrLineStart:= @buf_read[line_start];

    //process from current block read
    for i := line_start to line_end do
    begin
      //do ParseChar(@buf_read[i]);
      //ignore LF
      if (buf_read[i] = #10) then Continue;

      //set start of field value
      if not Assigned(ptrFieldStart) then ptrFieldStart:= @buf_read[i];

      if (buf_read[i] = '"') then isQuoted:= not isQuoted;

      //check for field delimeter
      if (buf_read[i] in [',',#13]) and (not isQuoted) then
      begin
        if (ptrFieldEnd=nil) then ptrFieldEnd:= @buf_read[i]; //in case we reached the end (there won't be a comma)
        Inc(FieldNum);
        ParseField;
        if (buf_read[i] = ',') then WritePChar(@buf_read[i]); //add field delimeter

        //if end of record, reset FieldNum and write CRLF
        if (buf_read[i] = #13) then
        begin
          FieldNum:= 0;
          WriteChar(#13);
          WriteChar(#10);
        end;
      end
      else ptrFieldEnd:= @buf_read[i];
    end;

    //do any field specific checks here e.g.
    if not FieldAccessor.TryAsInteger(0, v)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected integer value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));
    // you could do range checks if needed like this : if (v < 0) or (v>= SomeMaxNumber) then LogWarning
    if not FieldAccessor.TryAsFloat(1, d)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected float value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));

    if len_bufwrite >= buf_size then FlushOutput;
  end;


begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  len_bufWrite:= 0;
  len_unparsed:= 0;
  SetLength(buf_write, buf_size); //initial size; will grow as needed

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //prompt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;

  sWarnings:= 'warnings_'+ChangeFileExt(sFileOut, '.log');

  Writeln('Parsing file...');

  hFileIn:= FileOpen(sFileIn, fmOpenRead or fmShareDenyWrite); //open file, allow shared access but don't allow writes
  hFileOut:= FileCreate(sFileOut);
  hFileWarnings:= FileCreate(sWarnings);
  FieldAccessor:= TFieldAccessor.Create;
  try
    StartTick:= GetTickCount;

    LineNum:= 0;
    repeat

      bytes_read:= FileRead(hFileIn, buf_read[len_unparsed], buf_size - len_unparsed);
      ReadCompleted:= (bytes_read < buf_size - len_unparsed);
      line_last:= len_unparsed + bytes_read-1;
      //len_unparsed:= 0; not needed here, setting it below

      //work through buffer
      if (buf_read[0] = #10) then line_start:= 1 else line_start:= 0; //skip first char if #10
      line_end:= 0;
      while (line_end <= line_last) do
      begin
        if (buf_read[line_end] = #13) then
        begin
          ParseLine;
          line_start:= line_end+1; //set start and end markers
        end;

        Inc(line_end);
      end;

      //move unparsed to start of buf_read for next pass
      len:= line_end - line_start;
      if (len>0) then
      begin
        len_unparsed:= len;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;

      //error check
      if (len_unparsed = buf_size) then //nothing was parsed!
        raise exception.Create('Error: Read buffer too small or no end of line detected');


    until ReadCompleted;
    //write last block if any
    if (len_bufwrite > 0) then FlushOutput;

    EndTick:= GetTickCount;
    Writeln('Parsing completed in '+IntToStr((EndTick-StartTick) div 1000)+'.'+IntToStr((EndTick-StartTick) mod 1000)+' seconds');

  finally
    FileClose(hFileIn);
    FileClose(hFileOut);
    FileClose(hFileWarnings);
    FieldAccessor.Free;
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

{ TFieldAccessor }

procedure TFieldAccessor.AddFieldValue(FieldStart, FieldEnd: integer);
begin
  Inc(FFieldCount);
  SetLength(FFieldAccessInfo, FFieldCount);
  FFieldAccessInfo[FFieldCount-1].FieldStart:= FieldStart;
  FFieldAccessInfo[FFieldCount-1].FieldEnd:= FieldEnd;
end;

function TFieldAccessor.AsString(FieldNum: integer): AnsiString;
var len: integer;
begin
  result:= '';
  len:= integer(FFieldAccessInfo[FieldNum].FieldEnd) - integer(FFieldAccessInfo[FieldNum].FieldStart);
  if (len > 0) then
  begin
    SetLength(result, len);
    move(buf_write[FFieldAccessInfo[FieldNum].FieldStart], result[1], len);
  end;
end;

procedure TFieldAccessor.Clear;
begin
  FFieldCount:= 0;
  SetLength(FFieldAccessInfo, 0);
end;

constructor TFieldAccessor.Create;
begin
  FFieldCount:= 0;
end;

destructor TFieldAccessor.Destroy;
begin
  Clear;
  inherited;
end;

function TFieldAccessor.TryAsFloat(FieldNum: integer;
  var Value: double): boolean;
begin
  result:= TryStrToFloat(String(AsString(FieldNum)), Value);
end;

function TFieldAccessor.TryAsInteger(FieldNum: integer;
  var Value: integer): boolean;
begin
  result:= TryStrToInt(String(AsString(FieldNum)), Value);
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window


And it does not just get called once. It gets called at the end of each line parse. That FlushOutput you notice after the repeat loop is left over there from the previous version in which we did not need field access and it was just flushing anything that was not yet flushed. Theoretically, in the new version, this should always be zero but I don't see any harm in leaving it there as it could catch instance where the last few bytes read is not ended with an CRLF (an error).
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40043773
regarding big files, how big are you talking about?

I used your sample data and repeated it to:
Test file lines: 5 025 000
Test file size: 1.06 GB (1,139,000,320 bytes)
Test duration: Parsing completed in 25.678 seconds (i only ran it once)

That is 5.1 seconds per million lines! Seems pretty good to me.
0
 

Author Comment

by:shawn857
ID: 40045140
ahhh you're right Pierre - FlushOutput *IS* called at the end of Parseline.... sorry!!! :-(

Thank you for the modifications - do you think switching FieldStart/FieldEnd/bwStart/bwEnd from pointers to integers sacrifices any overall speed?

For the FieldAccessor.AsString routine, has your update rendered my modification below unneccessary:

function TFieldAccessor.AsString(FieldNum: integer): AnsiString;
var len: integer;
    test: array[0..99] of AnsiChar;
begin
  result:= '';
  if (FFieldAccessInfo[FieldNum].FieldEnd = nil) and (FFieldAccessInfo[FieldNum].FieldStart = nil) then
     EXIT;

  len:= integer(FFieldAccessInfo[FieldNum].FieldEnd) - integer(FFieldAccessInfo[FieldNum].FieldStart);
  if (len >= 0) then
  begin
    SetLength(result, len+1);
//    move(FFieldAccessInfo[FieldNum].FieldStart, test[0], len);
    move(FFieldAccessInfo[FieldNum].FieldStart^, result[1], len+1);
  end;
end;

Open in new window



I think what I need to do there considering your mods, is change the line:

if (FFieldAccessInfo[FieldNum].FieldEnd = nil) and (FFieldAccessInfo[FieldNum].FieldStart = nil) then

Open in new window


to this:

if (FFieldAccessInfo[FieldNum].FieldEnd = 0) and (FFieldAccessInfo[FieldNum].FieldStart = 0) then

Open in new window



Yes?

Thanks!
    Shawn

P.S: Pierre, the data files could be very very big, I would have no control over that. They could be 20-30 gigs or more....
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40046582
Thank you for the modifications - do you think switching FieldStart/FieldEnd/bwStart/bwEnd from pointers to integers sacrifices any overall speed?
No. It is just used to store integers instead of pointers, both are same size and they are just used to provide indexes into the bufWrite.

Regarding your second question, I doubt that code is needed at all. If start = end then Len will be zero and AsString will return an empty string.
0
 

Author Comment

by:shawn857
ID: 40056600
OK thanks Pierre. I dd do some comparisons between the new "integer indexed" version and the previous "pointer indexed" verson and the pointer indexed version looks to be about 10% faster in my tests - I guess I'll stick with that one!
   Pierre, I noticed that after it parses a complete block of text, that the buf_read seems to lose track of its indexing when moving the remaining unparsed chars back to the beginning of buf_read... just before reading in another block of chars.

      //move unparsed to start of buf_read for next pass
      len:= line_end - line_start;
      if (len>0) then
      begin
        len_unparsed:= len;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;

Open in new window


To trace it, I put in some code to output to the logfile, each field that gets parsed. I added this code to the bottom of Procedure Parseline, just before the FlushOutput line.

   for x:=0 to FieldAccessor.FFieldCount - 1 do
    begin
      LogWarning(AnsiString(FieldAccessor.AsString(x)+#13#10));
    end;

Open in new window



Here is what my tracing output looked like when things started to get out of whack:

JANET
GHILAIN
404 RIVERVIEW CT

MINOOKA
IL
60447
9325
mghilain@comcast.net
MICHAEL
STEIN
556 BLOOMFIELD CT

BIRMINGHAM
MI
48009
1303
peachstocks@aol.com
LAVINIA
SCHERBAN
3018 STONEHEDGE CT

MATTHEWS
NC
2810,
     øQE     DTƒ    š   QE          ¦             Åÿ     Åÿ             Pû            €ÿÿ      €       À                         ¼K ü$‚                         øQE     ðTƒ    š   QE          §                                     ÿÿÿÿÿÿÿÿ       @ÿÿ      €       Ä                         „õ Uƒ                             
   result.txt     øQE     ´Uƒ    š   QE          ¨             -     -             ÿÿÿÿÿÿÿÿ       @ÿÿ      €       0                         ü˜ (N‚                          8«E ´Z„	   XVƒXVƒD   óSƒùSƒûSƒTƒTƒTƒ        TƒTƒ!Tƒ"Tƒ$TƒD       ÿ FIRST,LAST,ADDRESS,APT,CITY,ST,INCOME,Z4,EMAIL
JAMES,ROBARGE,128 ADAMS RD,,AUBURNDALE,FL,33823,9411,jrobarge@gmail.com
LINDA,MOEHLMAN,1055 OAK POINTE DR,APT 16,WATERFORD,MI,48327,1628,lmoehlman@gmail.com
JAMES,HAZEN,3294 ROCHESTER RD,,DRYDEN,MI,48428,9724,jimsbait@charter.net
MICHAEL,PARISE,847 COUNTRY CREEK DR,,SALINE,MI,48176,9473,blsgard57@aol.com
MATTHEW,SLAGLE,1103 W 43RD AVE,,KENNEWICK,WA,99337,4518,marlodaily@hotmail.com
MICHAEL,HUGGINS,2110 OPELOUSAS TRL,,SAN ANTONIO,TX,78245,3667,tigger51979@hotmail.com
CARL,LOOMIS,1541 NEWPORT CREEK DR,,ANN ARBOR,MI,48103,2200,claire.loomis1@gmail.com

Open in new window



As you see, it was perfect up until:

MATTHEWS
NC
2810

But then the stored field values apparently started to get bad values written to them.  It appears that when it goes to read another block of chars, the indexing gets messed up somewhere...

Thanks!
    Shawn
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40057296
That looks like a broken CSV, certainly not well formed.  It might be prudent to have two output streams, one (good) for parsed data and another (bad) for broken records.
0
 

Author Comment

by:shawn857
ID: 40057611
I thought that could be it too Aikimark, so I took just the "good" records from my datafile - the ones that processed clean up to where it messed up - and replicated them to create a new datafile. I ran that and it still created the same problem.
   I looked at it some more and I'm hazarding a guess that it has something to with the "buf_write" variable when it gets augmented in length in the WriteChar and WritePChar routines.

  procedure WriteChar(AChar: AnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer
  end; // WriteChar

  procedure WritePChar(AChar: PAnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer
  end; // WritePChar

Open in new window



The variables that determine the start and end of each field when they get stored, are bwStart and bwEnd... and they get their values from buf_write. So it kinda makes sense that a corrupted buf_write may be the culprit here.
   I did some debugging a little while ago and thought I might have found the source of the problem while stepping through routine WritePChar *just* at the time when the write buffer was approaching its 65535 upper capacity.

  procedure WritePChar(AChar: PAnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer
  end; // WritePChar

Open in new window


Debug showed that len_bufwrite was equal to 65535 (as was buf_write_size), therefore causing the "if (len_bufwrite > buf_write_size) then" statement to be false which would result in the immediate execution of this statement: "buf_write[len_bufwrite]:= AChar^;" with len_bufwrite being = 65535, which would exceed the 0..65534 indexing limits of buf_write. Thinking I had found the bug, I changed the
"if (len_bufwrite > buf_write_size) then" statement to "if (len_bufwrite >= buf_write_size) then", but to my surprise it didn't make one bit of difference - the same problem still occurred.  Strange!

Thanks!
    Shawn
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40057636
One more thing I just noticed is that it looks like the valid content following the invalid content starts with column/field headers.  Is it possible that your file contains multiple datasets?
0
 

Author Comment

by:shawn857
ID: 40057647
Don't think so Aikimark... it's just a straight file of records in this manner:

JAMES,ROBARGE,128 ADAMS RD,,AUBURNDALE,FL,33823,9411,jrobarge@gmail.com
LINDA,MOEHLMAN,1055 OAK POINTE DR,APT 16,WATERFORD,MI,48327,1628,lmoehlman@gmail.com
JAMES,HAZEN,3294 ROCHESTER RD,,DRYDEN,MI,48428,9724,jimsbait@charter.net
MICHAEL,PARISE,847 COUNTRY CREEK DR,,SALINE,MI,48176,9473,blsgard57@aol.com
MATTHEW,SLAGLE,1103 W 43RD AVE,,KENNEWICK,WA,99337,4518,marlodaily@hotmail.com

I've attached the file to this message.

Thanks!
   Shawn
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40057782
I do not see any attached file.  Please retry.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058074
I noted a problem with the parsing code i.e. when it stores unparsed. It should ignore EOL marker #10. See below for changes I made

Line 279: line_end:= 0;
should be line_end:= line_start;

Also changed code from line 291 to:
      //move unparsed to start of buf_read for next pass
      if (line_end <> line_start) then
      begin
        if (buf_read[line_start] = #10) then Inc(line_start); //skip first char if #10
        len_unparsed:= line_end - line_start;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058075
Hmm, we can no longer edit comments?

Anyway, re my last post, there still seems to be something wrong, working on it...
0
 

Author Comment

by:shawn857
ID: 40058079
Sorry, let me try that again...
Optins.txt
0
 

Author Comment

by:shawn857
ID: 40058083
Thank you Pierre, I appreciate your diligence.

Thanks!
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058099
Change line 15 to :
buf_write_size: integer = buf_size-1;
0
 

Author Comment

by:shawn857
ID: 40058109
Thanks Pierre. And still make the changes you outlined in your previous post?

Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058114
yes but I think the main problem was that inital buf_write_size. If it does not solve the problem please post your sample file so I can have a look. I just need to get the kids to school and get to work so I can have a look at it in about 1 hour.
0
 

Author Comment

by:shawn857
ID: 40058121
Sure, yes of course Pierre. OK, I posted my sample file a few posts ago... here it is attached again.

Thanks
    Shawn
Optins.txt
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40058181
@Shawn

That file doesn't look anything like the sample data you posted in your earlier comment.
http:#a39987768
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058184
Thanks aiki. Also, that attachment seems to contain personal info. Shawn should state if it is or if it is just sample data. If not, that attachment should be removed from this thread. Looking into the issue now...
0
 
LVL 14

Accepted Solution

by:
Pierre Cornelius earned 450 total points
ID: 40058204
Shawn, after changing the initial write buf length, I ran your sample file and it seems to parse correct to me. What am I missing? I don't get the garbled output that you get.

here is my latest code:
program csv_parse2;

{$APPTYPE CONSOLE}

uses
  SysUtils, windows;

const buf_size = 65535;
      buf_grow = 1024;
//const buf_size = 524280;

var buf_read : array[0..buf_size-1] of AnsiChar;
    buf_write: array of AnsiChar; //used to hold data for writing to output_file
    buf_write_size: integer = buf_size-1;


type

  TFieldAccessInfo = record
    FieldStart, FieldEnd: integer;
  end;

  TFieldAccessor = class
    FFieldAccessInfo: array of TFieldAccessInfo;
    FFieldCount: integer;
    private
      procedure AddFieldValue(FieldStart, FieldEnd: integer);
    public
      constructor Create;
      destructor Destroy; override;

      procedure Clear;
      function TryAsFloat(FieldNum: integer; var Value: double): boolean;
      function TryAsInteger(FieldNum: integer; var Value: integer): boolean;
      function AsString(FieldNum: integer): AnsiString;


    property FieldCount: integer read FFieldCount;
  end;

procedure DoParse;
var sWarnings, sFileIn, sFileOut, overwrite: string;
    len, len_unparsed, bytes_read, hFileWarnings, hFileIn, hFileOut: integer;
    StartTick, EndTick: cardinal;
    line_start, line_end, line_last, len_bufwrite: integer;
    ptrLineStart, ptrFieldStart, ptrFieldEnd: PAnsiChar;
    ReadCompleted: boolean;
      isQuoted: boolean;
      LineNum, FieldNum: integer;
    FieldAccessor: TFieldAccessor;

  procedure LogWarning(warning: AnsiString);
  begin
    FileWrite(hFileWarnings, warning[1], length(warning));
  end;

  procedure FlushOutput;
  begin
    if (len_bufwrite > 0)
      then FileWrite(hFileOut, buf_write[0], len_bufwrite);
    len_bufwrite:= 0;
  end;


  procedure WriteChar(AChar: AnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure WritePChar(AChar: PAnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure ParseField;
  var p,last: PAnsiChar;
      bwStart, bwEnd: integer;
      contains_delim: boolean;
  begin
    //quick err check
    if (ptrFieldStart = nil) or (ptrFieldEnd=nil) or (ptrFieldStart>ptrFieldEnd) then
      raise exception.Create('Error: input not in expected format');

    //if field is just one char and ptrFieldStart^ is a comma, the field is empty - exit
    if (ptrFieldStart = ptrFieldEnd) and (ptrFieldStart^=',') then
    begin
      FieldAccessor.AddFieldValue(0, 0); //just add blank field
      ptrFieldStart:= nil;
      ptrFieldEnd:= nil;
      exit;
    end;

    //if field is quoted, skip first and last
    if ptrFieldStart^ = '"' then inc(ptrFieldStart);
    if ptrFieldEnd^   = '"' then dec(ptrFieldEnd);

    //remove leading spaces
    while (ptrFieldStart^ = ' ') and (ptrFieldStart < ptrFieldEnd)
      do Inc(ptrFieldStart);
    //need error check here, if ptrFieldStart =ptrFieldEnd here, we don't have a value

    //remove trailing spaces
    while (ptrFieldEnd^ = ' ') and (ptrFieldEnd > ptrFieldStart)
      do Dec(ptrFieldEnd);

    //scan for delim
    contains_delim:= false;
    p:= ptrFieldStart;
    while (p<=ptrFieldEnd) do
    begin
      if (p^=',') then
      begin
        contains_delim:= true;
        break;
      end;
      inc(p);
    end;

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //parse field data
    last:= nil;
    p:= ptrFieldStart;
    bwStart:= len_bufwrite;
    while (p<=ptrFieldEnd) do
    begin
      (*
        REMOVED THE FOLLOWING BECAUSE NOW PROVIDING FIELD LEVEL ACCESS PER PARSED LINE:
      {here you can check for text values where you expected numbers only
       from your example it looks like field one should only allow numbers
       so you could do something like:
       }
      if (FieldNum = 1) and (not(p^ in ['0'..'9']))
        then LogWarning(AnsiString(format('Error on line %d char %d. Expected numbers only, found %s'#13#10,[LineNum, (p-ptrLineStart), p^])));
      *)

      if not((p^ = '"') and ((last <> nil) and (last^ = '"'))) //removing double quotes
         or ((p^='"') and contains_delim) //don't remove " if field contains delim
        then WritePChar(p);

      last:= p;
      inc(p);
    end;
    bwEnd:= len_bufwrite;
    FieldAccessor.AddFieldValue(bwStart, bwEnd);


    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //reset
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
  end;

  procedure ParseLine;
  var i,v: integer;
      d: double;

  begin
    Inc(LineNum);
    isQuoted:= false;
    FieldNum:= 0;
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
    FieldAccessor.Clear;

    //save LineStart - used to calculate character position for log file entry
    ptrLineStart:= @buf_read[line_start];

    //process from current block read
    for i := line_start to line_end do
    begin
      //do ParseChar(@buf_read[i]);
      //ignore LF
      if (buf_read[i] = #10) then Continue;

      //set start of field value
      if not Assigned(ptrFieldStart) then ptrFieldStart:= @buf_read[i];

      if (buf_read[i] = '"') then isQuoted:= not isQuoted;

      //check for field delimeter
      if (buf_read[i] in [',',#13]) and (not isQuoted) then
      begin
        if (ptrFieldEnd=nil) then ptrFieldEnd:= @buf_read[i]; //in case we reached the end (there won't be a comma)
        Inc(FieldNum);
        ParseField;
        if (buf_read[i] = ',') then WritePChar(@buf_read[i]); //add field delimeter

        //if end of record, reset FieldNum and write CRLF
        if (buf_read[i] = #13) then
        begin
          FieldNum:= 0;
          WriteChar(#13);
          WriteChar(#10);
        end;
      end
      else ptrFieldEnd:= @buf_read[i];
    end;

    //do any field specific checks here e.g.
    if not FieldAccessor.TryAsInteger(0, v)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected integer value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));
    // you could do range checks if needed like this : if (v < 0) or (v>= SomeMaxNumber) then LogWarning
    if not FieldAccessor.TryAsFloat(1, d)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected float value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));

    if len_bufwrite >= buf_size then FlushOutput;
  end;


begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  len_bufWrite:= 0;
  len_unparsed:= 0;
  SetLength(buf_write, buf_size); //initial size; will grow as needed

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //prompt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;

  sWarnings:= 'warnings_'+ChangeFileExt(sFileOut, '.log');

  Writeln('Parsing file...');

  hFileIn:= FileOpen(sFileIn, fmOpenRead or fmShareDenyWrite); //open file, allow shared access but don't allow writes
  hFileOut:= FileCreate(sFileOut);
  hFileWarnings:= FileCreate(sWarnings);
  FieldAccessor:= TFieldAccessor.Create;
  try
    StartTick:= GetTickCount;

    LineNum:= 0;
    repeat

      bytes_read:= FileRead(hFileIn, buf_read[len_unparsed], buf_size - len_unparsed);
      ReadCompleted:= (bytes_read < buf_size - len_unparsed);
      line_last:= len_unparsed + bytes_read-1;
      //len_unparsed:= 0; not needed here, setting it below

      //work through buffer
      if (buf_read[0] = #10) then line_start:= 1 else line_start:= 0; //skip first char if #10
      line_end:= line_start;
      while (line_end <= line_last) do
      begin
        if (buf_read[line_end] = #13) then
        begin
          ParseLine;
          line_start:= line_end+1; //set start and end markers
        end;

        Inc(line_end);
      end;

      //move unparsed to start of buf_read for next pass
       if (line_end <> line_start) then
      begin
        if (buf_read[line_start] = #10) then Inc(line_start); //skip first char if #10
        len_unparsed:= line_end - line_start;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;

      //error check
      if (len_unparsed = buf_size) then //nothing was parsed!
        raise exception.Create('Error: Read buffer too small or no end of line detected');


    until ReadCompleted;
    //write last block if any
    if (len_bufwrite > 0) then FlushOutput;

    EndTick:= GetTickCount;
    Writeln('Parsing completed in '+IntToStr((EndTick-StartTick) div 1000)+'.'+IntToStr((EndTick-StartTick) mod 1000)+' seconds');

  finally
    FileClose(hFileIn);
    FileClose(hFileOut);
    FileClose(hFileWarnings);
    FieldAccessor.Free;
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

{ TFieldAccessor }

procedure TFieldAccessor.AddFieldValue(FieldStart, FieldEnd: integer);
begin
  Inc(FFieldCount);
  SetLength(FFieldAccessInfo, FFieldCount);
  FFieldAccessInfo[FFieldCount-1].FieldStart:= FieldStart;
  FFieldAccessInfo[FFieldCount-1].FieldEnd:= FieldEnd;
end;

function TFieldAccessor.AsString(FieldNum: integer): AnsiString;
var len: integer;
begin
  result:= '';
  len:= integer(FFieldAccessInfo[FieldNum].FieldEnd) - integer(FFieldAccessInfo[FieldNum].FieldStart);
  if (len > 0) then
  begin
    SetLength(result, len);
    move(buf_write[FFieldAccessInfo[FieldNum].FieldStart], result[1], len);
  end;
end;

procedure TFieldAccessor.Clear;
begin
  FFieldCount:= 0;
  SetLength(FFieldAccessInfo, 0);
end;

constructor TFieldAccessor.Create;
begin
  FFieldCount:= 0;
end;

destructor TFieldAccessor.Destroy;
begin
  Clear;
  inherited;
end;

function TFieldAccessor.TryAsFloat(FieldNum: integer;
  var Value: double): boolean;
begin
  result:= TryStrToFloat(String(AsString(FieldNum)), Value);
end;

function TFieldAccessor.TryAsInteger(FieldNum: integer;
  var Value: integer): boolean;
begin
  result:= TryStrToInt(String(AsString(FieldNum)), Value);
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window

0
 
LVL 45

Expert Comment

by:aikimark
ID: 40058205
@Pierre

At least the INCOME column data has been deleted.  We are seeing the five digit Zip Code in that column.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058208
@aiki
The old EE showed only some lines of code and the rest you can view by scrolling. After the facelift it does not, did you notice? I think in future I will rather attach as file else the thread gets gobbled up by code.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058229
noted, and thanks.
0
 

Author Comment

by:shawn857
ID: 40058274
Hi guys... Aikimark, you wrote:

"That file doesn't look anything like the sample data you posted in your earlier comment.
http:#a39987768".

Correct... not so many examples of "bad data" in this most recent sample file. It's just another kind of CSV data that I might be encountering. It could be any kind.

And it is just sample data... nothing relevant in there.

Pierre, you're right - when you look at the pure output file your code produces, it appears fine. The garbled output was obtained when I added code to write out each parsed field to the warning log file ... one field per line. There it revealed that the data getting stored in the FFieldAccessInfo fields was getting corrupted after the first full "block" of data had been read/written.
    I had added this code, just before the "FlushOutput" line at the very end of the Parseline routine:

    for x:=0 to FieldAccessor.FFieldCount - 1 do
    begin
      LogWarning(AnsiString(FieldAccessor.AsString(x)+#13#10));
    end;

Open in new window


Thanks!
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40058350
Shawn, I did the same and still don't see any errors. All looks fine to me. Can you be more specific?

I did this (at end of ParseLine function):
tmp:= '';
    for i:= 0 to FieldAccessor.FieldCount - 1
      do tmp:= tmp + FieldAccessor.AsString(i)+#9;
    LogWarning(tmp+#13#10);

Open in new window


and here are the results:
results-parsed.csv
0
 

Author Comment

by:shawn857
ID: 40059835
Here's something odd Pierre - I implemented your suggested changes and still got the same garbled results when writing individual fields, so I had the idea to change the buf_size to something really small so I'd be able to view the whole buffer in debug, and hopefully track the problem better. I changed buf_size to just 200, and instantly it fixed everything! Now it doesn't produce that garbled output! Isn't that weird? Why would the older buf_size of 65535 cause me such problems?

Thanks
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40060043
Strange indeed. Did you try it using my latest posted code (exact)? Coz i am still confused as when i run it using your sample i dont get any problems, it all appears to be correct. What version of delphi are you using?
0
 

Author Comment

by:shawn857
ID: 40060225
I'm going to try with your exact program Pierre - instead of taking your routines and putting in my program. Meanwhile, more strangeness. I had the buf_size set at 3170, everything worked. Then I set it at 3171 - it gave the errors. BUT, when I ran it again (without exiting the program or re-compiling it), t worked! I ran it again - it worked. I exited the program, re-compiled it (same buf_size setting: 3171). Then ran it - it gave the garbled output. I remained in the program, ran it again... it worked. Ran it again, it worked. For some reason, the 1st run after a fesh compile fails, but successive runs work. This is so strange!!! I'm going to reboot my computer, maybe I stomped on some memory or something...

Thanks
   Shawn
0
 

Author Comment

by:shawn857
ID: 40060414
Pierre, I just ran your *exact* console application on my data file and yes, it runs perfectly. I must have done something wrong with the few seemingly innocent mods I made. I have to do some backtracking now to see what code I added caused all this... dang. Sorry for the bother, you guys!

Thanks
    Shawn
0
 

Author Comment

by:shawn857
ID: 40060462
OK... did some digging and I took your latest code Pierre, and converted it from "integer indexing" back to your original "pointer indexing". Ran that on my CSV file and this resulted in garbled output! Give it a try on your end please.  :-)

Thanks
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40060831
ok, back to pointers. Ran it and all seems fine.

Here it is:
program csv_parser3;

{$APPTYPE CONSOLE}

uses
  SysUtils,
  windows;

const buf_size = 65535;
      buf_grow = 1024;
//const buf_size = 524280;

var buf_read : array[0..buf_size-1] of AnsiChar;
    buf_write: array of AnsiChar; //used to hold data for writing to output_file
    buf_write_size: integer = buf_size-1;


type

  TFieldAccessInfo = record
    FieldStart, FieldEnd: PAnsiChar;
  end;

  TFieldAccessor = class
    FFieldAccessInfo: array of TFieldAccessInfo;
    FFieldCount: integer;
    private
      procedure AddFieldValue(FieldStart, FieldEnd: PAnsiChar);
    public
      constructor Create;
      destructor Destroy; override;

      procedure Clear;
      function TryAsFloat(FieldNum: integer; var Value: double): boolean;
      function TryAsInteger(FieldNum: integer; var Value: integer): boolean;
      function AsString(FieldNum: integer): AnsiString;


    property FieldCount: integer read FFieldCount;
  end;

procedure DoParse;
var sWarnings, sFileIn, sFileOut, overwrite: string;
    len_unparsed, bytes_read, hFileWarnings, hFileIn, hFileOut: integer;
    StartTick, EndTick: cardinal;
    line_start, line_end, line_last, len_bufwrite: integer;
    ptrLineStart, ptrFieldStart, ptrFieldEnd: PAnsiChar;
    ReadCompleted: boolean;
      isQuoted: boolean;
      LineNum, FieldNum: integer;
    FieldAccessor: TFieldAccessor;

  procedure LogWarning(warning: AnsiString);
  begin
    FileWrite(hFileWarnings, warning[1], length(warning));
  end;

  procedure FlushOutput;
  begin
    if (len_bufwrite > 0)
      then FileWrite(hFileOut, buf_write[0], len_bufwrite);
    len_bufwrite:= 0;
  end;


  procedure WriteChar(AChar: AnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure WritePChar(AChar: PAnsiChar);
  begin
    //grow write buffer if needed
    if (len_bufwrite > buf_write_size) then
    begin
      inc(buf_write_size, buf_grow);
      SetLength(buf_write, buf_write_size);
    end;

    buf_write[len_bufwrite]:= AChar^;     // add parsed data to buf_write
    inc(len_bufwrite);                // Inc(len_bufwrite) if you add to the buffer

  end;

  procedure ParseField;
  var p,last: PAnsiChar;
      bwStart, bwEnd: pAnsiChar;
      contains_delim: boolean;
  begin
    //quick err check
    if (ptrFieldStart = nil) or (ptrFieldEnd=nil) or (ptrFieldStart>ptrFieldEnd) then
      raise exception.Create('Error: input not in expected format');

    //if field is just one char and ptrFieldStart^ is a comma, the field is empty - exit
    if (ptrFieldStart = ptrFieldEnd) and (ptrFieldStart^=',') then
    begin
      FieldAccessor.AddFieldValue(nil, nil); //just add blank field
      ptrFieldStart:= nil;
      ptrFieldEnd:= nil;
      exit;
    end;

    //if field is quoted, skip first and last
    if ptrFieldStart^ = '"' then inc(ptrFieldStart);
    if ptrFieldEnd^   = '"' then dec(ptrFieldEnd);

    //remove leading spaces
    while (ptrFieldStart^ = ' ') and (ptrFieldStart < ptrFieldEnd)
      do Inc(ptrFieldStart);
    //need error check here, if ptrFieldStart =ptrFieldEnd here, we don't have a value

    //remove trailing spaces
    while (ptrFieldEnd^ = ' ') and (ptrFieldEnd > ptrFieldStart)
      do Dec(ptrFieldEnd);

    //scan for delim
    contains_delim:= false;
    p:= ptrFieldStart;
    while (p<=ptrFieldEnd) do
    begin
      if (p^=',') then
      begin
        contains_delim:= true;
        break;
      end;
      inc(p);
    end;

    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //parse field data
    last:= nil;
    p:= ptrFieldStart;
    bwStart:= @buf_write[len_bufwrite];
    while (p<=ptrFieldEnd) do
    begin
      (*
        REMOVED THE FOLLOWING BECAUSE NOW PROVIDING FIELD LEVEL ACCESS PER PARSED LINE:
      {here you can check for text values where you expected numbers only
       from your example it looks like field one should only allow numbers
       so you could do something like:
       }
      if (FieldNum = 1) and (not(p^ in ['0'..'9']))
        then LogWarning(AnsiString(format('Error on line %d char %d. Expected numbers only, found %s'#13#10,[LineNum, (p-ptrLineStart), p^])));
      *)

      if not((p^ = '"') and ((last <> nil) and (last^ = '"'))) //removing double quotes
         or ((p^='"') and contains_delim) //don't remove " if field contains delim
        then WritePChar(p);

      last:= p;
      inc(p);
    end;
    bwEnd:= @buf_write[len_bufwrite];
    FieldAccessor.AddFieldValue(bwStart, bwEnd);


    //enclose with " if field value contains delim
    if (contains_delim) then WriteChar('"');

    //reset
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
  end;

  procedure ParseLine;
  var i,v: integer;
      d: double;

  begin
    Inc(LineNum);
    isQuoted:= false;
    FieldNum:= 0;
    ptrFieldStart:= nil;
    ptrFieldEnd:= nil;
    FieldAccessor.Clear;

    //save LineStart - used to calculate character position for log file entry
    ptrLineStart:= @buf_read[line_start];

    //process from current block read
    for i := line_start to line_end do
    begin
      //do ParseChar(@buf_read[i]);
      //ignore LF
      if (buf_read[i] = #10) then Continue;

      //set start of field value
      if not Assigned(ptrFieldStart) then ptrFieldStart:= @buf_read[i];

      if (buf_read[i] = '"') then isQuoted:= not isQuoted;

      //check for field delimeter
      if (buf_read[i] in [',',#13]) and (not isQuoted) then
      begin
        if (ptrFieldEnd=nil) then ptrFieldEnd:= @buf_read[i]; //in case we reached the end (there won't be a comma)
        Inc(FieldNum);
        ParseField;
        if (buf_read[i] = ',') then WritePChar(@buf_read[i]); //add field delimeter

        //if end of record, reset FieldNum and write CRLF
        if (buf_read[i] = #13) then
        begin
          FieldNum:= 0;
          WriteChar(#13);
          WriteChar(#10);
        end;
      end
      else ptrFieldEnd:= @buf_read[i];
    end;
(*
    //do any field specific checks here e.g.
    if not FieldAccessor.TryAsInteger(0, v)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected integer value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));
    // you could do range checks if needed like this : if (v < 0) or (v>= SomeMaxNumber) then LogWarning
    if not FieldAccessor.TryAsFloat(1, d)
      then LogWarning(AnsiString(format('Error on line %d field 1. Expected float value, found %s'#13#10,[LineNum, FieldAccessor.AsString(0)])));
*)
    tmp:= '';
    for i := 0 to FieldAccessor.FieldCount - 1
      do tmp:= tmp + FieldAccessor.AsString(i) + #9;
    LogWarning(AnsiString(tmp+#13#10));

    if len_bufwrite >= buf_size then FlushOutput;
  end;


begin
  sFileIn:= ParamStr(1);
  sFileOut:= ParamStr(2);

  len_bufWrite:= 0;
  len_unparsed:= 0;
  SetLength(buf_write, buf_size); //initial size; will grow as needed

  //ensure file exists
  if not FileExists(sFileIn) then
    raise exception.Create('ERROR-> File not found: '+sFileIn);

  //ensure not output not same as source
  if (sFileIn=sFileOut) then
    raise exception.Create('ERROR-> input_file and output_file may not be the same file: '+sFileIn);

  //prompt user for overwrite if file already exists
  if FileExists(sFileOut) then
  begin
    Writeln('output_file already exists. Overwrite(Y/N)?');
    ReadLn(overwrite);
    if LowerCase(overwrite) <> 'y' then
      raise exception.Create('ABORT-> Aborted by user');
  end;

  sWarnings:= 'warnings_'+ChangeFileExt(sFileOut, '.log');

  Writeln('Parsing file...');

  hFileIn:= FileOpen(sFileIn, fmOpenRead or fmShareDenyWrite); //open file, allow shared access but don't allow writes
  hFileOut:= FileCreate(sFileOut, fmOpenWrite or fmShareDenyWrite);
  hFileWarnings:= FileCreate(sWarnings);
  FieldAccessor:= TFieldAccessor.Create;
  try
    StartTick:= GetTickCount;

    LineNum:= 0;
    repeat

      bytes_read:= FileRead(hFileIn, buf_read[len_unparsed], buf_size - len_unparsed);
      ReadCompleted:= (bytes_read < buf_size - len_unparsed);
      line_last:= len_unparsed + bytes_read-1;
      //len_unparsed:= 0; not needed here, setting it below

      //work through buffer
      if (buf_read[0] = #10) then line_start:= 1 else line_start:= 0; //skip first char if #10
      line_end:= line_start; //line_end:= 0;
      while (line_end <= line_last) do
      begin
        if (buf_read[line_end] = #13) then
        begin
          ParseLine;
          line_start:= line_end+1; //set start and end markers
        end;

        Inc(line_end);
      end;

      //move unparsed to start of buf_read for next pass
      if (line_end <> line_start) then
      begin
        if (buf_read[line_start] = #10) then Inc(line_start); //skip EOL char #10
        len_unparsed:= line_end - line_start;
        move(buf_read[line_start], buf_read[0], len_unparsed);
      end
      else len_unparsed:= 0;

      //error check
      if (len_unparsed = buf_size) then //nothing was parsed!
        raise exception.Create('Error: Read buffer too small or no end of line detected');


    until ReadCompleted;
    //write last block if any
    if (len_bufwrite > 0) then FlushOutput;

    EndTick:= GetTickCount;
    Writeln('Parsing completed in '+IntToStr((EndTick-StartTick) div 1000)+'.'+IntToStr((EndTick-StartTick) mod 1000)+' seconds');

  finally
    FileClose(hFileIn);
    FileClose(hFileOut);
    FileClose(hFileWarnings);
    FieldAccessor.Free;
  end;
end;

procedure OutputUsage;
begin
  writeln('csv_parse v1.0');
  writeln('-------------------------------------------');
  writeln('USAGE: csv_parse input_file output_file');
  writeln('');
  writeln('input_file: expects a csv filename and path');
  writeln('output_file: new file to be created');
end;

{ TFieldAccessor }

procedure TFieldAccessor.AddFieldValue(FieldStart, FieldEnd: PAnsiChar);
begin
  Inc(FFieldCount);
  SetLength(FFieldAccessInfo, FFieldCount);
  FFieldAccessInfo[FFieldCount-1].FieldStart:= FieldStart;
  FFieldAccessInfo[FFieldCount-1].FieldEnd:= FieldEnd;
end;

function TFieldAccessor.AsString(FieldNum: integer): AnsiString;
var len: integer;
begin
  result:= '';
  len:= integer(FFieldAccessInfo[FieldNum].FieldEnd) - integer(FFieldAccessInfo[FieldNum].FieldStart);
  if (len > 0) then
  begin
    SetLength(result, len);
    move(FFieldAccessInfo[FieldNum].FieldStart^, result[1], len);
  end;
end;

procedure TFieldAccessor.Clear;
begin
  FFieldCount:= 0;
  SetLength(FFieldAccessInfo, 0);
end;

constructor TFieldAccessor.Create;
begin
  FFieldCount:= 0;
end;

destructor TFieldAccessor.Destroy;
begin
  Clear;
  inherited;
end;

function TFieldAccessor.TryAsFloat(FieldNum: integer;
  var Value: double): boolean;
begin
  result:= TryStrToFloat(String(AsString(FieldNum)), Value);
end;

function TFieldAccessor.TryAsInteger(FieldNum: integer;
  var Value: integer): boolean;
begin
  result:= TryStrToInt(String(AsString(FieldNum)), Value);
end;

begin
  try
    if (ParamCount <> 2)
      then OutputUsage
      else DoParse;
  except on E:Exception
    do Writeln(E.Message);
  end;
end.

Open in new window

0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40060870
In case you used the above code already, please note I edited it  (I forgot a ^).
0
 

Author Comment

by:shawn857
ID: 40063591
Thank you Pierre - I'm going to stick with the "integer-indexed" version... I don't know why but I keep getting that garbled output of fields when I use the pointer-indexed version.  i think I am good now, and I owe you many thanks - it works great and very very fast. Before we wrap up this question, just a couple of quick questions I was wondering please:

- I needed to move the Parseline and Parsefield procedures outside from within the scope of the DoParse procedure. Do you think this has any speed implications?

- Also, on that same note, some of the variables in your code that were local, I made global. Would that slow anything down, do you think?

Thanks!
    Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40066843
The only reason I had PArseLine and ParseField as nested functions was for readability. It may even have had performance impact but without testing, I can't tell.

As for making some variables global, if the only reason was because ParseLine and ParseField use them then rather add parameters to these functions and pass them by reference from DoParse.

As for performance of global vs local, I don't know. I suspect it wont make much difference but again without testing I can't tell.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40066930
ok, did some quick tests
global vs local vars
using global vars resulted in 17% speed improvement.
It makes sense as global vars are allocated once whereas local vars are allocated on the stack dynamically.

nested func vs part of main func
no performance difference noted.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40066954
updated results on big file test using global vars instead of local vars for DoParse
Test file lines: 5 025 000
Test file size: 1.06 GB (1,139,000,320 bytes)
Test duration: Parsing completed in 18.715 (average of 3 runs) seconds

~3.74 seconds per million records
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40067220
@Pierre

Is there a way to test your code with a clear cache in you hard drive buffer?  A hybrid drive will give outstanding performance figures when reading sectors out of its cache memory.  Readers with standard spinning media will probably not get such good performance.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40067338
aiki, getting a proper speed is not the point of my testing. All I need to know is if a certain method of doing things is faster than others so with this in mind it would actually suit me if it is using the HD cache buffer as it would give clearer picture of performance of the code excluding disk reads and writes. It is not intended to be a benchmark test, just an indication of which approach gives better speed.

Having said that, the last time you brought it up, I did the same limited testing (I just run it 3 times and get an avg) I did use 3 different input files to see if it makes a difference - it made no difference. So, maybe my HD is not caching.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40067382
The first run's time is probably more indicative of what traditional spinning media users will experience.  After that, the file will be read from the hybrid drive's cache.  You will still see better-than-average performance figures even on your first run.
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40067446
As i said above that doesnt matter. Not the purpose of my testing. But usually my 3 runs are pretty close to each other.
0
 

Author Closing Comment

by:shawn857
ID: 40068508
Thank you so much Pierre for your diligent work on this, and thanks to Aikimark also for your input. I am very very happy with this solution!

A+++++

Cheers!
   Shawn
0
 
LVL 14

Expert Comment

by:Pierre Cornelius
ID: 40068653
You're welcome. Good luck!
0
 

Author Comment

by:shawn857
ID: 40068669
Thanks Pierre!   :-)
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

In this tutorial I will show you how to use the Windows Speech API in Delphi. I will only cover basic functions such as text to speech and controlling the speed of the speech. SAPI Installation First you need to install the SAPI type library, th…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now