Solved

Tips For Parsing Large Files

Posted on 2009-05-06
10
305 Views
Last Modified: 2013-11-18
Hi,
Often I have a need to parse and/or compare very large files (up to 1GB sometimes) and although I have plenty of experience doing so, I'd still like to know if there are any methods or objects I could use that can help get the job done faster. Usually I stick to VB languages, but am not disinclined to using PHP, Perl, Python, or command-line if need be. For VBscript, and VBA I usually stick to using multiple dictionary objects, or a dictionary of dictionaries as I've found this helps to store and compare large tables of data with a drastically reduced amount of processing time. For VB.Net I use a hashtable... Though still, sometimes these processes take hours to run. Are there any other objects I should consider using, or are these the top preformers and I'm not likely to find anything better to use?

Thanks,
Josh
0
Comment
Question by:jdannemann
  • 4
  • 2
  • 2
  • +1
10 Comments
 
LVL 41

Expert Comment

by:HonorGod
Comment Utility
How complex is the parsing?
Both Python and Perl have very good text processing capabilities.
I personally prefer Python for its readability.
Both Python and Perl are quire robust, and able to load in large quantities of data, should you need to do that.

Can you provide more details about the kind of data contained in the files?

One thing that I do know, I had an involved algorithm that took hours for my Python script to process... I converted it to Java, and was astonished that it ran in seconds...  I was blown away at the difference in processing speeds.
0
 
LVL 1

Author Comment

by:jdannemann
Comment Utility
Hey HonorGod,

I didn't actually provide specifics as to what kind of parsing mainly because my question was aimed at parsing large quantities of data in general. Though, often I have a need not just to parse text, but might also have to calculate averages or sums based on one criteria or another.

Anyway, thanks for your input. I hadn't previously thought of using Java for this because just about the only time I use it is in web site development. I find it surprising it actually works better than Python.

Unfortunatly though, I haven't a clue how to use Java in that context. :(
0
 
LVL 41

Expert Comment

by:HonorGod
Comment Utility
Ah, an opportunity to learn!  I love it.  :-)
0
 
LVL 1

Author Comment

by:jdannemann
Comment Utility
Anyone else have any suggestions? I'm currently working on what could be defined as the vlookupfrom hell. I have to take a list of 200,000 partial URLs and find each one in another list of 500,000 rows. Dictionaries work fine to store the data, but calling the instr function slows everything down so much I had the script running for five days and it didn't finish. Wouldn't it be fantastic if there were a hashtable or dictionary that came equipped with built in comparison ?
0
 
LVL 16

Expert Comment

by:JohnBPrice
Comment Utility
If you are dealing with 200,000 to 500,000 rows, why not use a database?  SQL Express for example.  Doing a find in 500,000 rows (with an index) is lighting fast and nearly no coding required.
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 27

Accepted Solution

by:
BigRat earned 250 total points
Comment Utility
The contents of the file plays an important role. With a file of several hundred thousand partial urls I suspect that there are a large number of duplicates. To compare against a list of known urls I'd first sort the known urls (have  them permanently sorted) and sort the incomming data with a standard sorting program (there are some VERY fast programs on the market). Then when you linearly read through the partial urls you'd linearly read through the dictionary.
   Having the data on separate disks save a great deal of head movement and latency. In fact for a function like process(infile,dictfile)=>outfile, I'd try to have three disks to reduce head movement to a minimum. Also having the OS pre-fetch pages (which you can switch on when you serially process files) will also speed things up. Furthermore the maximum amount of memory which can be installed in the computer, should be installed.
   Remember that 500,000 urls at 30 characters each is only 15MB, so a Perl hash would work quite well if each entry is to be read more than once (non-sparse access). If the access is sparse it might be slower to read the data into memory that to leave it on disk.
   When loading a flat file into a database, whose access is NOT going to be on the primary key (because there isn't one, or because it's duplicated) sort the data according to the access key before loading, since databases don't cluster on secondary keys.
   If you have to use databases, try using a RAID-5 disk system with at least 100MB in the RAID controller. The writes are slow but the reads are very very fast.
  Without knowing exactly what data is to be processed, it is difficult to give specifics, but that advice comes from us, whose customers have 20-30 database tables of 5 million rows upwards. We almost all of our bulk processing using flat files.
0
 
LVL 1

Author Comment

by:jdannemann
Comment Utility
Hey JohnBPrice,

My boss thought of that too, but and to my surprise, using a database took twice as long, and not to mention, there are great many more steps just to get the database in there. Thanks for the suggestion though. I was surprised to see it didn't work as well doing it that way.

-Josh
0
 
LVL 1

Author Closing Comment

by:jdannemann
Comment Utility
Hey BigRat, you rock! I really didn't think about that. I took your suggestion, created a database and stored it on one of our public drives that uses Raid, connected the database to a server connected to a different raid drive, and ran a script from there to query the database. The whole process finished in twenty minutes!

Thanks!
-Josh
0
 
LVL 16

Expert Comment

by:JohnBPrice
Comment Utility
Well, a database has to be configured correctly with the indexes and whatnot.  It also depends on your algorithm databases are great at lookups (essentially B-Trees) and searching, not so great for one-pass operations.
0
 
LVL 27

Expert Comment

by:BigRat
Comment Utility
JohnBPrice: The main problem using a database is the choice of keys. If the key is not a cluster key (and they are not always linear in access anyway) you have the problem of switching between the index and the data. This causes disk head movement and takes a lot of time, particularly if the movement is so large that you loose the track index. Then you need to wait latency to resync the data.

Using a RIAD system there are several disks involved with the data duplicated on each disk and the effect is that one disk reads indexes and the other data. The performance improvement is dramatic - which is what the questioner has experienced.

RAID controllers and disks are so cheap, I don't understand why people who do production database work don't always use them.
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

Stuck in voice control mode on your Amazon Firestick?  Here is how to turn it off!!!
I've been asked to discuss some of the UX activities that I'm using with my team. Here I will share some details about how we approach UX projects.
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.
Articles on a wide range of technology and professional topics are available on Experts Exchange. These resources are written by members, for members, and can be written about any topic you feel passionate about. Learn how to best write an article t…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now