Solved

File parsing strategy for huge file

Posted on 2004-09-20
13
2,211 Views
Last Modified: 2013-11-13
I need to parse a huge file ( >1 GB ) in order to process the contents; the contents will be stored in a database so statistics can be made of the data. Does anyone know of a strategy to optimize the parsing (in terms of speed)?

It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction of the data needs to be stored; only headers of some packet types will be stored and an index so the full packet can be retreived from the file quickly. Other packet types will just be counted, while others will be ignored.

The processing that will take place after the parsing will also be time consuming, so the faster the parsing, the better. How should I go about doing this? From someone that has done this before, I have gotten the advice to write the data as SQL statements to another file, which after the parsing gets run against the database, but to me this seems like an unecessary step.

The language is C++
0
Comment
Question by:vuzman
  • 3
  • 2
  • 2
  • +4
13 Comments
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12100051
Hi vuzman,

> It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction
> of the data needs to be stored; only headers of some packet types will be stored and an index
> so the full packet can be retreived from the file quickly.

Open the file
While there are packets to read
      Read next header
      If this is your type
                store required information in database
      Get packet length from header
      advance file pointer to byte after the packet
end while

Sunnycoder
0
 
LVL 6

Expert Comment

by:PreachDotNet
ID: 12100075
If the line types etc are similar you could connect to the text file as an ODBC data source then use a simple insert into statement into a second database using a select statement to filter the relevant records.
0
 

Author Comment

by:vuzman
ID: 12100111
sunnycoder:

yeah, that's how I'd parse a normal-size file, but for a huge file I would imagine other strategies could be better. How about reading a big chunk of the file,  parse it, upload, then read again, etc.? If uploading to the DB is deferred one could combine several statements, but would this result in performance gains?

PreachDotNet:

Nice idea, but the packets in the file will not be on a single line, and I'm afraid SELECT'ing wont work.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12100167
You dont have to read the entire file at one go ...
Declare a buffer and read the file one buffer length at once ... Keep repeating until you get the whole data processed ... something like

while ( fread (buffer, BUF_LEN, fp)  != 0 )
{  
       ...
}

You need to take care of packets which lie on buffer boundary,i.e. have been partially read into the buffer.

Yet another alternative can be to use several threads to start processing the file from different locations at once.

>If uploading to the DB is deferred one could combine several statements, but would this result in
>performance gains?
Yes and No ... You will have lesser system calls so more of processing and less of waiting. However, there might be little or no benefit in terms of accessing the disk since I/O is buffered. Whatever the case, it will be an improvement over raw approach.
0
 
LVL 12

Expert Comment

by:Giant2
ID: 12101385
Normally in programming language there are some instruction to read a portion of data (100 byte, for example) from a file.
You could use those instructions/methods to read the portion of file you need and reposition on it. You could use RandomAccessFile (with seek capabilities).

Hope this help you.
Bye, Giant.
0
 
LVL 13

Expert Comment

by:bochgoch
ID: 12101462
If you're only going to store some of the data avoid using databases as long as possible to ensure maximum speed -- the solution in your last paragraph will ensure maximum speed, HOWEVER, don't write SQL statements to file as these statements (INSERT's?) will be slow to run.

My solution would be to follow sunnycoders route, with one difference - write the required data out to a second file (comma delimited or whatever your preferred format is) then bulk copy (BCP) or similar this data into the database. On the face of it this looks like a long-winded solution, but I suggest would be the quickest. I've managed to reduce long running processes from 12+ hours to minutes using this approach. Ideally remove any network traffic from the equation (run the process within a discrete server/desktop environment for maximum gain). Remember networks are slow and databases are slower than files!!!!!

(of course you could try sunnycoders solution first then if you need greater speed try out this one...)
0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 

Author Comment

by:vuzman
ID: 12103931
If SQL statements aren't written right away, I might need to parse the second file... right?

I am not familiar with bcp, isn't it an MS SQL utility?  I'm using MySQL.
0
 
LVL 13

Accepted Solution

by:
bochgoch earned 150 total points
ID: 12109767
Yepp...

The MySQL equivalent to a BCP is LOAD DATA INFILE...see here for syntax:

http://dev.mysql.com/doc/mysql/en/LOAD_DATA.html
0
 

Expert Comment

by:BrokeAzz
ID: 12128666
OK, here's my LazyAzz solution.....BUT WORKZ
1) Import the entire friggin flat file (Assuming Delimitaion is TAB or Comma) into your SQL/MySQL Table using "Data Transformation services". Takes about 10-15mins

2) When import is done, Go into the Design view of your new gargantuan table and start deleting the Columns you don't need. Takes about 10-15mins

2.5) OPTION - Cluster fields you will run your Delete query criteria on.

3) Run a Delete query on the crud you don't want or need.
Takes about 30-45mins (WORST CASE Depending on Hardware)

4) Twist open a Corona, squeze your lime into it and have a drink while your query runs. Takes about 3-5 Beers. Depending on Fluid I/O.

I Process ISA/Proxy logs for a living ranging from 25 million records to 250 million records of 1gig text files to 4gig text files into SQL 2000. Anything you write in either C, C#,,C+,C++, VB, VBScript OrGod Forbid JAVA to scrub your data will take hours and hours and hours and hours........

Honorable mention - BCP works just as fast but is a little cryptic to me.
0
 

Expert Comment

by:BrokeAzz
ID: 12128677
Correction in 2.5....

2.5) OPTION - Index fields you will run your Delete query criteria on. Cluster Index the most Common field.
0
 

Author Comment

by:vuzman
ID: 12363027
I wanted to return to the question with some more answers when I had gotten them myself, but I've been too busy to explore this fully.

The answer gets a 'B' grade as only the data-to-database loading has been adressed; no real strategy for reading huge files has been given by any of the responders.
0
 
LVL 6

Expert Comment

by:billtouch
ID: 12449888
I have processed large amounts of data before. I never had to go to a database, but the end result of the data is of little consequence as long as the files are on different physical drives.

Given that, your time is mostly spent in I/O wait. The thing that slows I/O down the most is when the drive has to seek from one track to another. That takes the longest when the tracks are not adjacent. This can happen when your disc is fragmented. So... start by defraging your disk drive. Make sure your data is on a different drive from your database. Then, to deduce the amount of I/O your program needs to perform, allocate the largest buffer you can without causing Windows to page parts of your program to disc. This can be 10K - 512K depending on the amount of free real ram your system has. Then fill the buffer with one read. Parse that buffer then read the next. Be careful of end of buffer processing so you don't mess up the one packet that is split between two reads.

If you use this method, you will be getting nearly full use of your CPU. This method is faster than overlapped I/O.

The BIG cautions are:
1) DO NOT Page. If you are using Win XP (or one of its predicessors), you can bring up a performance monitor. It will tell you if you have excessive paging activity. If so, reduce the size of the buffer or the number of programs you have sitting in memory. You would be surprised as to what is sitting around in there waiting on some event you don't even care about.

2) End of Buffer - if you are not careful you WILL drop data. The special case here is if the end of the packet is exactly at the end of the buffer. Test thoroughtly!

If you have any questions, Just ask...

Bill
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Introduction This article explores the design of a cache system that can improve the performance of a web site or web application.  The assumption is that the web site has many more “read” operations than “write” operations (this is commonly the ca…
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now