Solved

File parsing strategy for huge file

Posted on 2004-09-20
13
2,213 Views
Last Modified: 2013-11-13
I need to parse a huge file ( >1 GB ) in order to process the contents; the contents will be stored in a database so statistics can be made of the data. Does anyone know of a strategy to optimize the parsing (in terms of speed)?

It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction of the data needs to be stored; only headers of some packet types will be stored and an index so the full packet can be retreived from the file quickly. Other packet types will just be counted, while others will be ignored.

The processing that will take place after the parsing will also be time consuming, so the faster the parsing, the better. How should I go about doing this? From someone that has done this before, I have gotten the advice to write the data as SQL statements to another file, which after the parsing gets run against the database, but to me this seems like an unecessary step.

The language is C++
0
Comment
Question by:vuzman
  • 3
  • 2
  • 2
  • +4
13 Comments
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12100051
Hi vuzman,

> It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction
> of the data needs to be stored; only headers of some packet types will be stored and an index
> so the full packet can be retreived from the file quickly.

Open the file
While there are packets to read
      Read next header
      If this is your type
                store required information in database
      Get packet length from header
      advance file pointer to byte after the packet
end while

Sunnycoder
0
 
LVL 6

Expert Comment

by:PreachDotNet
ID: 12100075
If the line types etc are similar you could connect to the text file as an ODBC data source then use a simple insert into statement into a second database using a select statement to filter the relevant records.
0
 

Author Comment

by:vuzman
ID: 12100111
sunnycoder:

yeah, that's how I'd parse a normal-size file, but for a huge file I would imagine other strategies could be better. How about reading a big chunk of the file,  parse it, upload, then read again, etc.? If uploading to the DB is deferred one could combine several statements, but would this result in performance gains?

PreachDotNet:

Nice idea, but the packets in the file will not be on a single line, and I'm afraid SELECT'ing wont work.
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 45

Expert Comment

by:sunnycoder
ID: 12100167
You dont have to read the entire file at one go ...
Declare a buffer and read the file one buffer length at once ... Keep repeating until you get the whole data processed ... something like

while ( fread (buffer, BUF_LEN, fp)  != 0 )
{  
       ...
}

You need to take care of packets which lie on buffer boundary,i.e. have been partially read into the buffer.

Yet another alternative can be to use several threads to start processing the file from different locations at once.

>If uploading to the DB is deferred one could combine several statements, but would this result in
>performance gains?
Yes and No ... You will have lesser system calls so more of processing and less of waiting. However, there might be little or no benefit in terms of accessing the disk since I/O is buffered. Whatever the case, it will be an improvement over raw approach.
0
 
LVL 12

Expert Comment

by:Giant2
ID: 12101385
Normally in programming language there are some instruction to read a portion of data (100 byte, for example) from a file.
You could use those instructions/methods to read the portion of file you need and reposition on it. You could use RandomAccessFile (with seek capabilities).

Hope this help you.
Bye, Giant.
0
 
LVL 13

Expert Comment

by:bochgoch
ID: 12101462
If you're only going to store some of the data avoid using databases as long as possible to ensure maximum speed -- the solution in your last paragraph will ensure maximum speed, HOWEVER, don't write SQL statements to file as these statements (INSERT's?) will be slow to run.

My solution would be to follow sunnycoders route, with one difference - write the required data out to a second file (comma delimited or whatever your preferred format is) then bulk copy (BCP) or similar this data into the database. On the face of it this looks like a long-winded solution, but I suggest would be the quickest. I've managed to reduce long running processes from 12+ hours to minutes using this approach. Ideally remove any network traffic from the equation (run the process within a discrete server/desktop environment for maximum gain). Remember networks are slow and databases are slower than files!!!!!

(of course you could try sunnycoders solution first then if you need greater speed try out this one...)
0
 

Author Comment

by:vuzman
ID: 12103931
If SQL statements aren't written right away, I might need to parse the second file... right?

I am not familiar with bcp, isn't it an MS SQL utility?  I'm using MySQL.
0
 
LVL 13

Accepted Solution

by:
bochgoch earned 150 total points
ID: 12109767
Yepp...

The MySQL equivalent to a BCP is LOAD DATA INFILE...see here for syntax:

http://dev.mysql.com/doc/mysql/en/LOAD_DATA.html
0
 

Expert Comment

by:BrokeAzz
ID: 12128666
OK, here's my LazyAzz solution.....BUT WORKZ
1) Import the entire friggin flat file (Assuming Delimitaion is TAB or Comma) into your SQL/MySQL Table using "Data Transformation services". Takes about 10-15mins

2) When import is done, Go into the Design view of your new gargantuan table and start deleting the Columns you don't need. Takes about 10-15mins

2.5) OPTION - Cluster fields you will run your Delete query criteria on.

3) Run a Delete query on the crud you don't want or need.
Takes about 30-45mins (WORST CASE Depending on Hardware)

4) Twist open a Corona, squeze your lime into it and have a drink while your query runs. Takes about 3-5 Beers. Depending on Fluid I/O.

I Process ISA/Proxy logs for a living ranging from 25 million records to 250 million records of 1gig text files to 4gig text files into SQL 2000. Anything you write in either C, C#,,C+,C++, VB, VBScript OrGod Forbid JAVA to scrub your data will take hours and hours and hours and hours........

Honorable mention - BCP works just as fast but is a little cryptic to me.
0
 

Expert Comment

by:BrokeAzz
ID: 12128677
Correction in 2.5....

2.5) OPTION - Index fields you will run your Delete query criteria on. Cluster Index the most Common field.
0
 

Author Comment

by:vuzman
ID: 12363027
I wanted to return to the question with some more answers when I had gotten them myself, but I've been too busy to explore this fully.

The answer gets a 'B' grade as only the data-to-database loading has been adressed; no real strategy for reading huge files has been given by any of the responders.
0
 
LVL 6

Expert Comment

by:billtouch
ID: 12449888
I have processed large amounts of data before. I never had to go to a database, but the end result of the data is of little consequence as long as the files are on different physical drives.

Given that, your time is mostly spent in I/O wait. The thing that slows I/O down the most is when the drive has to seek from one track to another. That takes the longest when the tracks are not adjacent. This can happen when your disc is fragmented. So... start by defraging your disk drive. Make sure your data is on a different drive from your database. Then, to deduce the amount of I/O your program needs to perform, allocate the largest buffer you can without causing Windows to page parts of your program to disc. This can be 10K - 512K depending on the amount of free real ram your system has. Then fill the buffer with one read. Parse that buffer then read the next. Be careful of end of buffer processing so you don't mess up the one packet that is split between two reads.

If you use this method, you will be getting nearly full use of your CPU. This method is faster than overlapped I/O.

The BIG cautions are:
1) DO NOT Page. If you are using Win XP (or one of its predicessors), you can bring up a performance monitor. It will tell you if you have excessive paging activity. If so, reduce the size of the buffer or the number of programs you have sitting in memory. You would be surprised as to what is sitting around in there waiting on some event you don't even care about.

2) End of Buffer - if you are not careful you WILL drop data. The special case here is if the end of the packet is exactly at the end of the buffer. Test thoroughtly!

If you have any questions, Just ask...

Bill
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
splitOdd10 challenge 5 104
numbers ascending pyramid 101 191
Please help| Getting the syntax error in below query. 2 56
How to Convert a Quote to an Order in SalesForce 2 36
A short article about problems I had with the new location API and permissions in Marshmallow
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now