File parsing strategy for huge file

Posted on 2004-09-20
Medium Priority
Last Modified: 2013-11-13
I need to parse a huge file ( >1 GB ) in order to process the contents; the contents will be stored in a database so statistics can be made of the data. Does anyone know of a strategy to optimize the parsing (in terms of speed)?

It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction of the data needs to be stored; only headers of some packet types will be stored and an index so the full packet can be retreived from the file quickly. Other packet types will just be counted, while others will be ignored.

The processing that will take place after the parsing will also be time consuming, so the faster the parsing, the better. How should I go about doing this? From someone that has done this before, I have gotten the advice to write the data as SQL statements to another file, which after the parsing gets run against the database, but to me this seems like an unecessary step.

The language is C++
Question by:vuzman
  • 3
  • 2
  • 2
  • +4
LVL 45

Expert Comment

ID: 12100051
Hi vuzman,

> It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction
> of the data needs to be stored; only headers of some packet types will be stored and an index
> so the full packet can be retreived from the file quickly.

Open the file
While there are packets to read
      Read next header
      If this is your type
                store required information in database
      Get packet length from header
      advance file pointer to byte after the packet
end while


Expert Comment

ID: 12100075
If the line types etc are similar you could connect to the text file as an ODBC data source then use a simple insert into statement into a second database using a select statement to filter the relevant records.

Author Comment

ID: 12100111

yeah, that's how I'd parse a normal-size file, but for a huge file I would imagine other strategies could be better. How about reading a big chunk of the file,  parse it, upload, then read again, etc.? If uploading to the DB is deferred one could combine several statements, but would this result in performance gains?


Nice idea, but the packets in the file will not be on a single line, and I'm afraid SELECT'ing wont work.
The 14th Annual Expert Award Winners

The results are in! Meet the top members of our 2017 Expert Awards. Congratulations to all who qualified!

LVL 45

Expert Comment

ID: 12100167
You dont have to read the entire file at one go ...
Declare a buffer and read the file one buffer length at once ... Keep repeating until you get the whole data processed ... something like

while ( fread (buffer, BUF_LEN, fp)  != 0 )

You need to take care of packets which lie on buffer boundary,i.e. have been partially read into the buffer.

Yet another alternative can be to use several threads to start processing the file from different locations at once.

>If uploading to the DB is deferred one could combine several statements, but would this result in
>performance gains?
Yes and No ... You will have lesser system calls so more of processing and less of waiting. However, there might be little or no benefit in terms of accessing the disk since I/O is buffered. Whatever the case, it will be an improvement over raw approach.
LVL 12

Expert Comment

ID: 12101385
Normally in programming language there are some instruction to read a portion of data (100 byte, for example) from a file.
You could use those instructions/methods to read the portion of file you need and reposition on it. You could use RandomAccessFile (with seek capabilities).

Hope this help you.
Bye, Giant.
LVL 13

Expert Comment

ID: 12101462
If you're only going to store some of the data avoid using databases as long as possible to ensure maximum speed -- the solution in your last paragraph will ensure maximum speed, HOWEVER, don't write SQL statements to file as these statements (INSERT's?) will be slow to run.

My solution would be to follow sunnycoders route, with one difference - write the required data out to a second file (comma delimited or whatever your preferred format is) then bulk copy (BCP) or similar this data into the database. On the face of it this looks like a long-winded solution, but I suggest would be the quickest. I've managed to reduce long running processes from 12+ hours to minutes using this approach. Ideally remove any network traffic from the equation (run the process within a discrete server/desktop environment for maximum gain). Remember networks are slow and databases are slower than files!!!!!

(of course you could try sunnycoders solution first then if you need greater speed try out this one...)

Author Comment

ID: 12103931
If SQL statements aren't written right away, I might need to parse the second file... right?

I am not familiar with bcp, isn't it an MS SQL utility?  I'm using MySQL.
LVL 13

Accepted Solution

bochgoch earned 450 total points
ID: 12109767

The MySQL equivalent to a BCP is LOAD DATA INFILE...see here for syntax:


Expert Comment

ID: 12128666
OK, here's my LazyAzz solution.....BUT WORKZ
1) Import the entire friggin flat file (Assuming Delimitaion is TAB or Comma) into your SQL/MySQL Table using "Data Transformation services". Takes about 10-15mins

2) When import is done, Go into the Design view of your new gargantuan table and start deleting the Columns you don't need. Takes about 10-15mins

2.5) OPTION - Cluster fields you will run your Delete query criteria on.

3) Run a Delete query on the crud you don't want or need.
Takes about 30-45mins (WORST CASE Depending on Hardware)

4) Twist open a Corona, squeze your lime into it and have a drink while your query runs. Takes about 3-5 Beers. Depending on Fluid I/O.

I Process ISA/Proxy logs for a living ranging from 25 million records to 250 million records of 1gig text files to 4gig text files into SQL 2000. Anything you write in either C, C#,,C+,C++, VB, VBScript OrGod Forbid JAVA to scrub your data will take hours and hours and hours and hours........

Honorable mention - BCP works just as fast but is a little cryptic to me.

Expert Comment

ID: 12128677
Correction in 2.5....

2.5) OPTION - Index fields you will run your Delete query criteria on. Cluster Index the most Common field.

Author Comment

ID: 12363027
I wanted to return to the question with some more answers when I had gotten them myself, but I've been too busy to explore this fully.

The answer gets a 'B' grade as only the data-to-database loading has been adressed; no real strategy for reading huge files has been given by any of the responders.

Expert Comment

ID: 12449888
I have processed large amounts of data before. I never had to go to a database, but the end result of the data is of little consequence as long as the files are on different physical drives.

Given that, your time is mostly spent in I/O wait. The thing that slows I/O down the most is when the drive has to seek from one track to another. That takes the longest when the tracks are not adjacent. This can happen when your disc is fragmented. So... start by defraging your disk drive. Make sure your data is on a different drive from your database. Then, to deduce the amount of I/O your program needs to perform, allocate the largest buffer you can without causing Windows to page parts of your program to disc. This can be 10K - 512K depending on the amount of free real ram your system has. Then fill the buffer with one read. Parse that buffer then read the next. Be careful of end of buffer processing so you don't mess up the one packet that is split between two reads.

If you use this method, you will be getting nearly full use of your CPU. This method is faster than overlapped I/O.

The BIG cautions are:
1) DO NOT Page. If you are using Win XP (or one of its predicessors), you can bring up a performance monitor. It will tell you if you have excessive paging activity. If so, reduce the size of the buffer or the number of programs you have sitting in memory. You would be surprised as to what is sitting around in there waiting on some event you don't even care about.

2) End of Buffer - if you are not careful you WILL drop data. The special case here is if the end of the packet is exactly at the end of the buffer. Test thoroughtly!

If you have any questions, Just ask...


Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

In real business world data are crucial and sometimes data are shared among different information systems. Hence, an agreeable file transfer protocol need to be established.
AngularJS web development a very simple procedure. So, to put it, in short, AngularJS’ stand out features are – Two-way data binding, MVC structure, directives, templates, dependency injections and testing.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
Six Sigma Control Plans

624 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question