• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 933
  • Last Modified:

Efficient parsing from gz files

Continuing our discussion from thread:
http://www.experts-exchange.com/Programming/Languages/CPP/Q_27840974.html#a38330299

If you are a new participant, please read the above thread.
0
farzanj
Asked:
farzanj
  • 5
  • 4
  • 2
4 Solutions
 
farzanjAuthor Commented:
@hmccurdy
@evilrix: Oh.  That removes my misconception.  I was under the impression that fgets was reading one full buffer at a time. So, sorry, please come again.  The advantage of fgets over getline is more close to standard or what?  Is there any C++ way to handle popen?

@evilrix:
I have one consumer thread because I cannot read one gz file from the middle so I am bound to read it sequentially.  When I had output that was a flat file, I could divide it very successfully and it ran extremely fast.
I want to have multiple consumer threads because I am writing for real multi-processor machines and I want threads to be mapped to the processors, so it is not sudo-multiprocessing.  The least powerful machine that I have has 16 cores.  There are some with about 200 cores.
0
 
evilrixSenior Software Engineer (Avast)Commented:
>> The advantage of fgets over getline is more close to standard or what?
Yes, that's about the size of it. It's a standard part of the C standard so your code will be more portable. Feel free to use getline if this is not the case but the advantages of using it over fgets are probably minimal.

>> Is there any C++ way to handle popen?
Not that I know of. You could; however, implement a Boost iostreams sink to provide stream semantics. You could also implement your own filebuf that wraps the popen stream and instantiate a standard iostream with that. If this is a one off case it's probably not worth doing either, TBH.

>> I want threads to be mapped to the processors
If it's multiple consumers you have that should be fine as long as you don't exceed the rule of thumb of 2 threads per CPU (at least, not without benchmarking and profiling). You'll probably find that if you have more than one producer thread (the one that's reading from the file) that you don't see any appreciable difference - again, profiling is the only way to be sure and if performance really is that critical I would certainly recommend you do this as different hardware can (and will) produce different results. The recommendations I make are all rule of thumb that should be considered best practice without profiling metrics to say otherwise. I hope that makes sense.

>> There are some with about 200 cores.
Nice. They sound like ideal Quake servers :)
0
 
Hugh McCurdyCommented:
faranz, when you read a line, do you want to keep the \n that is at the end of the string or not?  If you want to keep it, then I suggest fgets().  If you want to lose it and your are hyper worried about speed, perhaps getline() is better (sigh).

The reason is that getline returns the number of characters in the string and fgets returns a pointer to the string.  Finding the \n from getline() is quick (you already know how long the string is).  With fgets() you'd have to use strlen().  

However, one thing I learned is that programmers are lousy predictors of effeciency.  As evilrix suggested, I'd write it a couple different ways and then use a profiler.

I'm assuming you are processing a large amount of data and you are in a hurry.  I don't know your average string length.  (Still, I'd be guessing and if speed matters, it's better to ask a profiler instead of a programmer.)
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
farzanjAuthor Commented:
Yes, I am hyper worried about speed.  The problem my company was facing was that they were not able to process a day's worth of logs in one day of run.  So I made it fast so that they could manage it.

I don't need new lines because I will only be parsing and extracting important parts of data.

Lines can be very long.  I tried 1024 bytes but it failed.  Now I am going for 4KB for line buffer and hope it would not fail.  Sizes are very variable.
0
 
evilrixSenior Software Engineer (Avast)Commented:
If you are using getline you don't need to worry about a fixed size buffer (in fact the documentation states that the buffer MUST be allocated with malloc) providing you never destroy the buffer. The getline function uses malloc and realloc to grow the buffer to the size you need. It will automatically expand the buffer until it is at a size that can accomodate all strings.

eg.

char * line = (char *) malloc(1024);
size_t n = 1024;
ssize_t len = 0;
while((len = getline(&line, &n, fp)) >= 0) 
{
   // process stuff
}
free(line);

Open in new window

0
 
Hugh McCurdyCommented:
Yes, getline() will use malloc() as needed.  I would guess that the algorithm is efficient but perhaps not tailored to your problem.  It might be more efficient to write you own.  (If not now, perhaps eventually.)  Might not be more efficient either.  That's what a profiler is for.

If speed is hyper important, why is the file compressed?  (I'm aware that one good reason is your bottleneck is reading from the disk and compressing means less reading from the disk.)  Clearly you have a huge amount of data.  Still, as speed is hyper important perhaps you should have two programs, one that reads the compressed files and the other reading uncompressed and have them race.  (Profiling)

I know you want to learn C++ but I wonder if that would be as fast as a well written C program.
0
 
farzanjAuthor Commented:
I am working at one of world's major telecom company.  Network traffic creates hundreds if not thousands of categories of log files.  Each category's volume is about half a petabyte a day.  Storage of  this massive amount of data creates huge problems.  The data captured using rsyslog (logging daemon) and is compressed with maximum compression (like gzip -9) and stored at filers. Now the logs have to be parsed, good information is to be extracted for network quality and many other purposes, put in database tables, data warehouse, no-sql, etc.

I had remodeled and programmed using Perl, some parsing app written by student of a prestigious CS program increasing both accuracy of information and speed in my first attempt.  Then I kept increasing speed one way or the other. My latest parser was in C++ which was a very plain structured C++ program. This increased speed yet again. Yes, I want to learn C++ but until 2002 I used to be fluent in C++, may be no where close to you guys.  My parsing was initially based on regular expressions with look arounds that worked pretty fast and the grammar of logs was some times complicated so it worked well. Since I was alright with regex, I was able to produce solution pretty fast. Then I switched to plain C++strings operations.  My Perl programs or scripts were multithreaded but threads have huge overhead in Perl.  I coded in Python as well and it was faster than Perl.

I studied some C verses C++ discussions online and it was mostly concluding that C had no speed advantage over C++.  It is a little counter intuitive but I can profile it.
0
 
Hugh McCurdyCommented:
So, it sounds more like you are rusty with C++ instead of trying to learn it the first time.  It is also obvious you are interested in speed.

It may be true that C has no speed advantage over C++ but that leads to questions of why the Linux kernel is written in C.  I have two other anecdotal experiences that suggest that C might be faster than C++ if highly optimized.

1. I was recruited for a position as a major company that needs highly optimized (speed) C code.  The work would involve the stock market.  The position would require that I move and that's currently "off the table."  Still, the company is rather successful.

2. I applied for a job, that I didn't get, for software that does simulations for physics research.  Simulating objects appears to be as about object oriented as you can possibly get yet because they need speed, they use C.

I'm not saying that highly optimized C would beat C++ because I don't have direct experience but I have encountered indirect evidence that some smart people thinks it to be so.

Now on to something positive.  Have you read Effective C++ and More Effective C++ by Scott Meyers?  If not, considering what you are trying to accomplish, I recommend those books.  If you haven't read either, read Effective C++ first.  It's more intermediate.  More Effective C++ is advanced.  Neither is introductory (not even close).  In fact, for some topics, if you are weak, he directs you back to your "favorite introductory C++ book."
0
 
Hugh McCurdyCommented:
Oh, thanks for answering the why are the files compressed question.  I thought it might be something like that but I think it's good to ask.

Since the file read must be sequential, I suggest considering writing your own getline() type function once you have the program otherwise working.  I don't know how getline() is managing memory but you might be able to do better since you have some idea of what your data looks like.

Disclaimers: programmers are horrible at predicting bottlenecks.  OTOH, algorithm classes are taught in college...

Let's say getline()'s authors were thinking 80 character lines with a little room for overflow.  The first call to malloc() could request 128 bytes.  If that's not enough, double it.  But the doubling requires a copy.  But if 256 isn't enough then double that.  If 99.9% of your lines are under 128 characters long and every now and then you have a 200 character line, this is effecient.  If, however, you have many 6000 character lines, all this copying is inefficient.  Perhaps we should start with a 8192 byte buffer.

If you write your own getline() you can change the memory allocation parameters and profile execution in order to find what runs fastest.

In closing, programmers are notoriously bad at predicting bottlenecks.  Scott Meyers write about this in one of those books.  (I'll find out which one if you want.)  Rewriting getline() might not help you.  But it might.
0
 
farzanjAuthor Commented:
I want to keep going with your insightful ideas.  This is pure gold for me and with your help I hope to take off my rust real fast.
0
 
Hugh McCurdyCommented:
Glad we could help.
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 5
  • 4
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now