?
Solved

File parsing strategy for huge file

Posted on 2004-09-20
13
Medium Priority
?
2,217 Views
Last Modified: 2013-11-13
I need to parse a huge file ( >1 GB ) in order to process the contents; the contents will be stored in a database so statistics can be made of the data. Does anyone know of a strategy to optimize the parsing (in terms of speed)?

It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction of the data needs to be stored; only headers of some packet types will be stored and an index so the full packet can be retreived from the file quickly. Other packet types will just be counted, while others will be ignored.

The processing that will take place after the parsing will also be time consuming, so the faster the parsing, the better. How should I go about doing this? From someone that has done this before, I have gotten the advice to write the data as SQL statements to another file, which after the parsing gets run against the database, but to me this seems like an unecessary step.

The language is C++
0
Comment
Question by:vuzman
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
  • 2
  • +4
13 Comments
 
LVL 45

Expert Comment

by:sunnycoder
ID: 12100051
Hi vuzman,

> It's a tcpdump (or similar) file, consisting of a lot of captured packets. Only a fraction
> of the data needs to be stored; only headers of some packet types will be stored and an index
> so the full packet can be retreived from the file quickly.

Open the file
While there are packets to read
      Read next header
      If this is your type
                store required information in database
      Get packet length from header
      advance file pointer to byte after the packet
end while

Sunnycoder
0
 
LVL 6

Expert Comment

by:PreachDotNet
ID: 12100075
If the line types etc are similar you could connect to the text file as an ODBC data source then use a simple insert into statement into a second database using a select statement to filter the relevant records.
0
 

Author Comment

by:vuzman
ID: 12100111
sunnycoder:

yeah, that's how I'd parse a normal-size file, but for a huge file I would imagine other strategies could be better. How about reading a big chunk of the file,  parse it, upload, then read again, etc.? If uploading to the DB is deferred one could combine several statements, but would this result in performance gains?

PreachDotNet:

Nice idea, but the packets in the file will not be on a single line, and I'm afraid SELECT'ing wont work.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 45

Expert Comment

by:sunnycoder
ID: 12100167
You dont have to read the entire file at one go ...
Declare a buffer and read the file one buffer length at once ... Keep repeating until you get the whole data processed ... something like

while ( fread (buffer, BUF_LEN, fp)  != 0 )
{  
       ...
}

You need to take care of packets which lie on buffer boundary,i.e. have been partially read into the buffer.

Yet another alternative can be to use several threads to start processing the file from different locations at once.

>If uploading to the DB is deferred one could combine several statements, but would this result in
>performance gains?
Yes and No ... You will have lesser system calls so more of processing and less of waiting. However, there might be little or no benefit in terms of accessing the disk since I/O is buffered. Whatever the case, it will be an improvement over raw approach.
0
 
LVL 12

Expert Comment

by:Giant2
ID: 12101385
Normally in programming language there are some instruction to read a portion of data (100 byte, for example) from a file.
You could use those instructions/methods to read the portion of file you need and reposition on it. You could use RandomAccessFile (with seek capabilities).

Hope this help you.
Bye, Giant.
0
 
LVL 13

Expert Comment

by:bochgoch
ID: 12101462
If you're only going to store some of the data avoid using databases as long as possible to ensure maximum speed -- the solution in your last paragraph will ensure maximum speed, HOWEVER, don't write SQL statements to file as these statements (INSERT's?) will be slow to run.

My solution would be to follow sunnycoders route, with one difference - write the required data out to a second file (comma delimited or whatever your preferred format is) then bulk copy (BCP) or similar this data into the database. On the face of it this looks like a long-winded solution, but I suggest would be the quickest. I've managed to reduce long running processes from 12+ hours to minutes using this approach. Ideally remove any network traffic from the equation (run the process within a discrete server/desktop environment for maximum gain). Remember networks are slow and databases are slower than files!!!!!

(of course you could try sunnycoders solution first then if you need greater speed try out this one...)
0
 

Author Comment

by:vuzman
ID: 12103931
If SQL statements aren't written right away, I might need to parse the second file... right?

I am not familiar with bcp, isn't it an MS SQL utility?  I'm using MySQL.
0
 
LVL 13

Accepted Solution

by:
bochgoch earned 450 total points
ID: 12109767
Yepp...

The MySQL equivalent to a BCP is LOAD DATA INFILE...see here for syntax:

http://dev.mysql.com/doc/mysql/en/LOAD_DATA.html
0
 

Expert Comment

by:BrokeAzz
ID: 12128666
OK, here's my LazyAzz solution.....BUT WORKZ
1) Import the entire friggin flat file (Assuming Delimitaion is TAB or Comma) into your SQL/MySQL Table using "Data Transformation services". Takes about 10-15mins

2) When import is done, Go into the Design view of your new gargantuan table and start deleting the Columns you don't need. Takes about 10-15mins

2.5) OPTION - Cluster fields you will run your Delete query criteria on.

3) Run a Delete query on the crud you don't want or need.
Takes about 30-45mins (WORST CASE Depending on Hardware)

4) Twist open a Corona, squeze your lime into it and have a drink while your query runs. Takes about 3-5 Beers. Depending on Fluid I/O.

I Process ISA/Proxy logs for a living ranging from 25 million records to 250 million records of 1gig text files to 4gig text files into SQL 2000. Anything you write in either C, C#,,C+,C++, VB, VBScript OrGod Forbid JAVA to scrub your data will take hours and hours and hours and hours........

Honorable mention - BCP works just as fast but is a little cryptic to me.
0
 

Expert Comment

by:BrokeAzz
ID: 12128677
Correction in 2.5....

2.5) OPTION - Index fields you will run your Delete query criteria on. Cluster Index the most Common field.
0
 

Author Comment

by:vuzman
ID: 12363027
I wanted to return to the question with some more answers when I had gotten them myself, but I've been too busy to explore this fully.

The answer gets a 'B' grade as only the data-to-database loading has been adressed; no real strategy for reading huge files has been given by any of the responders.
0
 
LVL 6

Expert Comment

by:billtouch
ID: 12449888
I have processed large amounts of data before. I never had to go to a database, but the end result of the data is of little consequence as long as the files are on different physical drives.

Given that, your time is mostly spent in I/O wait. The thing that slows I/O down the most is when the drive has to seek from one track to another. That takes the longest when the tracks are not adjacent. This can happen when your disc is fragmented. So... start by defraging your disk drive. Make sure your data is on a different drive from your database. Then, to deduce the amount of I/O your program needs to perform, allocate the largest buffer you can without causing Windows to page parts of your program to disc. This can be 10K - 512K depending on the amount of free real ram your system has. Then fill the buffer with one read. Parse that buffer then read the next. Be careful of end of buffer processing so you don't mess up the one packet that is split between two reads.

If you use this method, you will be getting nearly full use of your CPU. This method is faster than overlapped I/O.

The BIG cautions are:
1) DO NOT Page. If you are using Win XP (or one of its predicessors), you can bring up a performance monitor. It will tell you if you have excessive paging activity. If so, reduce the size of the buffer or the number of programs you have sitting in memory. You would be surprised as to what is sitting around in there waiting on some event you don't even care about.

2) End of Buffer - if you are not careful you WILL drop data. The special case here is if the end of the packet is exactly at the end of the buffer. Test thoroughtly!

If you have any questions, Just ask...

Bill
0

Featured Post

Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Exception Handling is in the core of any application that is able to dignify its name. In this article, I'll guide you through the process of writing a DRY (Don't Repeat Yourself) Exception Handling mechanism, using Aspect Oriented Programming.
In this post we will learn different types of Android Layout and some basics of an Android App.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Starting up a Project

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question