Link to home
Start Free TrialLog in
Avatar of vuzman
vuzman

asked on

File parsing strategy for huge file

I need to parse a huge file ( >1 GB ) in order to process the contents; the contents will be stored in a database so statistics can be made of the data. Does anyone know of a strategy to optimize the parsing (in terms of speed)?

It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction of the data needs to be stored; only headers of some packet types will be stored and an index so the full packet can be retreived from the file quickly. Other packet types will just be counted, while others will be ignored.

The processing that will take place after the parsing will also be time consuming, so the faster the parsing, the better. How should I go about doing this? From someone that has done this before, I have gotten the advice to write the data as SQL statements to another file, which after the parsing gets run against the database, but to me this seems like an unecessary step.

The language is C++
Avatar of sunnycoder
sunnycoder
Flag of India image

Hi vuzman,

> It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction
> of the data needs to be stored; only headers of some packet types will be stored and an index
> so the full packet can be retreived from the file quickly.

Open the file
While there are packets to read
      Read next header
      If this is your type
                store required information in database
      Get packet length from header
      advance file pointer to byte after the packet
end while

Sunnycoder
Avatar of PreachDotNet
PreachDotNet

If the line types etc are similar you could connect to the text file as an ODBC data source then use a simple insert into statement into a second database using a select statement to filter the relevant records.
Avatar of vuzman

ASKER

sunnycoder:

yeah, that's how I'd parse a normal-size file, but for a huge file I would imagine other strategies could be better. How about reading a big chunk of the file,  parse it, upload, then read again, etc.? If uploading to the DB is deferred one could combine several statements, but would this result in performance gains?

PreachDotNet:

Nice idea, but the packets in the file will not be on a single line, and I'm afraid SELECT'ing wont work.
You dont have to read the entire file at one go ...
Declare a buffer and read the file one buffer length at once ... Keep repeating until you get the whole data processed ... something like

while ( fread (buffer, BUF_LEN, fp)  != 0 )
{  
       ...
}

You need to take care of packets which lie on buffer boundary,i.e. have been partially read into the buffer.

Yet another alternative can be to use several threads to start processing the file from different locations at once.

>If uploading to the DB is deferred one could combine several statements, but would this result in
>performance gains?
Yes and No ... You will have lesser system calls so more of processing and less of waiting. However, there might be little or no benefit in terms of accessing the disk since I/O is buffered. Whatever the case, it will be an improvement over raw approach.
Normally in programming language there are some instruction to read a portion of data (100 byte, for example) from a file.
You could use those instructions/methods to read the portion of file you need and reposition on it. You could use RandomAccessFile (with seek capabilities).

Hope this help you.
Bye, Giant.
If you're only going to store some of the data avoid using databases as long as possible to ensure maximum speed -- the solution in your last paragraph will ensure maximum speed, HOWEVER, don't write SQL statements to file as these statements (INSERT's?) will be slow to run.

My solution would be to follow sunnycoders route, with one difference - write the required data out to a second file (comma delimited or whatever your preferred format is) then bulk copy (BCP) or similar this data into the database. On the face of it this looks like a long-winded solution, but I suggest would be the quickest. I've managed to reduce long running processes from 12+ hours to minutes using this approach. Ideally remove any network traffic from the equation (run the process within a discrete server/desktop environment for maximum gain). Remember networks are slow and databases are slower than files!!!!!

(of course you could try sunnycoders solution first then if you need greater speed try out this one...)
Avatar of vuzman

ASKER

If SQL statements aren't written right away, I might need to parse the second file... right?

I am not familiar with bcp, isn't it an MS SQL utility?  I'm using MySQL.
ASKER CERTIFIED SOLUTION
Avatar of bochgoch
bochgoch
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
OK, here's my LazyAzz solution.....BUT WORKZ
1) Import the entire friggin flat file (Assuming Delimitaion is TAB or Comma) into your SQL/MySQL Table using "Data Transformation services". Takes about 10-15mins

2) When import is done, Go into the Design view of your new gargantuan table and start deleting the Columns you don't need. Takes about 10-15mins

2.5) OPTION - Cluster fields you will run your Delete query criteria on.

3) Run a Delete query on the crud you don't want or need.
Takes about 30-45mins (WORST CASE Depending on Hardware)

4) Twist open a Corona, squeze your lime into it and have a drink while your query runs. Takes about 3-5 Beers. Depending on Fluid I/O.

I Process ISA/Proxy logs for a living ranging from 25 million records to 250 million records of 1gig text files to 4gig text files into SQL 2000. Anything you write in either C, C#,,C+,C++, VB, VBScript OrGod Forbid JAVA to scrub your data will take hours and hours and hours and hours........

Honorable mention - BCP works just as fast but is a little cryptic to me.
Correction in 2.5....

2.5) OPTION - Index fields you will run your Delete query criteria on. Cluster Index the most Common field.
Avatar of vuzman

ASKER

I wanted to return to the question with some more answers when I had gotten them myself, but I've been too busy to explore this fully.

The answer gets a 'B' grade as only the data-to-database loading has been adressed; no real strategy for reading huge files has been given by any of the responders.
I have processed large amounts of data before. I never had to go to a database, but the end result of the data is of little consequence as long as the files are on different physical drives.

Given that, your time is mostly spent in I/O wait. The thing that slows I/O down the most is when the drive has to seek from one track to another. That takes the longest when the tracks are not adjacent. This can happen when your disc is fragmented. So... start by defraging your disk drive. Make sure your data is on a different drive from your database. Then, to deduce the amount of I/O your program needs to perform, allocate the largest buffer you can without causing Windows to page parts of your program to disc. This can be 10K - 512K depending on the amount of free real ram your system has. Then fill the buffer with one read. Parse that buffer then read the next. Be careful of end of buffer processing so you don't mess up the one packet that is split between two reads.

If you use this method, you will be getting nearly full use of your CPU. This method is faster than overlapped I/O.

The BIG cautions are:
1) DO NOT Page. If you are using Win XP (or one of its predicessors), you can bring up a performance monitor. It will tell you if you have excessive paging activity. If so, reduce the size of the buffer or the number of programs you have sitting in memory. You would be surprised as to what is sitting around in there waiting on some event you don't even care about.

2) End of Buffer - if you are not careful you WILL drop data. The special case here is if the end of the packet is exactly at the end of the buffer. Test thoroughtly!

If you have any questions, Just ask...

Bill