Solved

reading faulty gzip data line by line

Posted on 2014-01-17
45
231 Views
Last Modified: 2014-09-29
At my work we have massive data in gzip files with corrupted headers.

gzip -l ewifi.log-pb-inet2111.rly2.company-20140116-anclogging2102.txt.gz
         compressed        uncompressed  ratio uncompressed_name
          210974992               83353 -253010.2% ewifi.log-pb-inet2111.rly2.company-20140116-anclogging2102.txt

Open in new window


This data has to be read line by line using both C++ and Java.

I have tried this in the past and asked this in the past as well.  Any library that is based on header information wouldn't work.

Linux utility gunzip works but for other libraries, it would only read first few hundred KB and then quit.

In the past I always did it in C by opening a pipe (popen) and using gunzip -c command.  I wonder if you could use streams to read it line by line or something.
0
Comment
Question by:farzanj
  • 15
  • 13
  • 6
  • +2
45 Comments
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Yes, you could read it from a pipe using Java. There's definitely something wrong there though - the compressed size seems to be larger than uncompressed ;)

mkfifo log.gz.pipe
zcat log.gz >log.gz.pipe

Open in new window

(Read from log.gz.pipe using normal Java IO)
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Also notice the negative sign before that.  

Could you please give me some code.  I have never done pipe in java.  And you think that pipe is the only way out?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
I don't think it's the only way out but it's certainly a good one if you're sure that the libgzip utils work.

You don't need code - just imagine you're reading a text file in Java. It's the same
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Well, this isn't even pipe.  You simply uncompressed the file and then you are reading it.

This will involve shell commands as well and is clearly not acceptable for my work requirements.

Any one please help??  Both Java and C++ any library?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Well, this isn't even pipe
What makes you think that?
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Pipe is a fundamental IPC concept in Unix/Linux and may be borrowed from Unix to other OS as well.  It is a system call and an OS level facility.

Just the way you open a pipe in C language.  It is one of the oldest way to communication between two processes.  In your suggestion, two processes never talk to one another.

Here's an intro to it.
http://www.tldp.org/LDP/lpg/node7.html

Java appears to have a pipe as well but I haven't used it.
http://tutorials.jenkov.com/java-nio/pipe.html
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
From the article you posted: http://www.tldp.org/LDP/lpg/node17.html#SECTION00732000000000000000

Notice any similarities between that and the code i posted?
http://linux.die.net/man/1/mkfifo
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Those are named pipes and I am not talking about that either.  I am talking about plain old pipes.

I don't even like named pipes because I will have to create a file on the filesystem.

Sorry to point you to whole book.
There are many section you can look at, like
http://www.tldp.org/LDP/lpg/node12.html#SECTION00723000000000000000


Sorry about the confusion, I am dealing with huge files and need a clean solution with no extra files created for reading.

Yes, it is pipe and my comment that it wasn't pipe isn't correct but those are named pipes
0
 
LVL 86

Assisted Solution

by:CEHJ
CEHJ earned 72 total points
Comment Utility
I'm well aware of the fact that a named pipe is different from piping processes

You mentioned
 I wonder if you could use streams to read it line by line or something.
and i've shown you a way to do that using a (named) pipe

You could also read the stream on stdin in Java by doing

zcat log.gz | java YourApp

Open in new window


Alternatively, start that using Runtime.exec and read the process stream
0
 
LVL 57

Expert Comment

by:giltjr
Comment Utility
Do you happen to know how big the uncompressed file really should be?  Offhand it looks like you are trying to process a file larger than 4GB on a 32-bit system.
0
 
LVL 34

Assisted Solution

by:Duncan Roe
Duncan Roe earned 143 total points
Comment Utility
Any 32-bit program compiled with -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 will handle 4GB+ files - that is not an issue.
Getting back on track - Are you saying that the file gunzips all right despite being corrupted? You are lucky indeed if it does. Usually gunzip will stop part-way through a corrupted file with a checksum error. If you run
cat oldfile.gz|gunzip|gzip >newfile.gz

Open in new window

does gunzip -l newfile.gz show a good header?
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Thank you experts for your time.

To answer your questions.  My company creates logs of all kinds of metrics and rsyslog is responsible for creating compressed logs.  It has some configuration issue and while all gzip compressed files are good, their headers are always bad.  Yes, files can be very big and the sizes are variable, especially some services create very huge files--uncompressed version is tens of GBs per file at least.  I write fast parsers for the log files.

Problem:  I need a clean solution -- that doesn't create any temp files anywhere.  Most libraries that I tried (for Perl, C++, Java) to read GZ files use header information.  All these libraries only read the first few MBs of file and would not get the entire file.  The GNU utility
gunzip -c filename.gz

would not use header at all.  It always gets me the entire file uncompressed without a problem (-c is to get it on STDOUT).  

So I was wondering if there is any other library, or utility in C++ and Java that would get the entire text.  I have to read line by line.

@giltjr

The uncompressed files can reach around 100GB.

@Duncan Roe
Files are always good, headers are bad.  If I uncompress the file and compress again, yes, the header would repair.  But I cannot do it.  My parser has to parse the data in the least  possible time because it parses 100s of TB or even EB a day and the parsing must not get behind.

So the hint is:  gunzip would always get me the entire file correctly but no C++ or Java library.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
So the hint is:  gunzip would always get me the entire file correctly but no C++ or Java library.
That's why my last recommendations involve the applications you know to work in the Java solution
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
@CEHJ-- I hear you and I don't blame you.  But a pipe in Java is better than named pipe and just want to see if anyone could come up with something better.  Because even I came up with popen that reads first into buffer and then gets me line by line and I know that there are a lot more experienced people out there.

It should be as a last resort.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
It might help if you could post a smallish example of your header-corrupt files
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Like this:
gzip -l sip.log-2013110710-sip6110.voip6.companyname-anclogging6001.txt.gz
         compressed        uncompressed  ratio uncompressed_name
         1074138921              364122 -294894.2% sip.log-2013110710-sip6110.voip6.companyname-anclogging6001.txt


Notice the negative ratio
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Experts,

Here's how you can help me.  Show me all the ways that you know can be used to read GZ files line by line.  I will test those and let you know.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Like this:
By 'example' i meant please attach a sample file
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
I wish I could but the logs have sensitive information of company customers and just talked about it again but it would be a big legal issue.

I myself do not know how to duplicate that issue myself.  I cannot change the files without uncompressing first and it I do that and recompress it, the issue would resolve.
0
 
LVL 57

Expert Comment

by:giltjr
Comment Utility
Wouldn't the simplest solution would be to fix whatever is corrupting the header in the first place?
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
It is not simple :)  I tried it for years and we are finally departing that system eventually as Hadoop is taking over.  So no one would touch it.

First the system admins do not accept it is an issue.  Second, it is hundreds of servers that need to be touched and even if someone finds out what is wrong with rsyslog, it would take extreme red tape, change management, etc.  No one would ever support me even when Hadoop did not exist they didn't accept the problem.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
I wish I could but the logs have sensitive information of company customers and just talked about it again but it would be a big legal issue.
Yes i understand that. Perhaps if you find rsyslog acting on something else with the same result, you can let us know. If  it's not doing it with other files, that might be a clue to the real problem
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 32

Accepted Solution

by:
sarabande earned 213 total points
Comment Utility
Like this:
gzip -l sip.log-2013110710-sip6110.voip6.companyname-anclogging6001.txt.gz
         compressed        uncompressed  ratio uncompressed_name
         1074138921              364122 -294894.2% sip.log-2013110710-sip6110.voip6.companyname-anclogging6001.txt


Notice the negative ratio

if the compressed size is correct, it is more than 1 GB. then the uncompressed file will have a multiple of it what exceeds the 32-bit boundary of 2 GB when dealing with signed integers or 4 GB for unsigned. the gzip output is of no value for that case cause obviously the output shows 32-bit signed integers what explains the negative value for the ratio. for this issue I found the following at http://lists.gnu.org/archive/html/bug-gzip/2010-08/msg00009.html

The most often-reported bug for GNU gzip is that gzip -l reports sizes
modulo 2**32, instead of full sizes.  This is because the gzip format
specifies a 4-byte (32-bit) size field.

A similar problem in gzip format is that it supports only nonzero
32-bit time stamps, which limits it to the range from 1970-01-01
00:00:01 through 2106-02-07 06:28:15 UTC.  OK, so this is not as
pressing a bug, but it wouldn't hurt to fix this while we're at it.
if the above (the info was from 2010) still applies to your version of gzip headers, it would mean that you were using the wrong tool for creating the compressed files. it simply was not capable to handle such file sizes even if the data part of the file could exceed the 32-bit boundaries and could be read by gunzip.

so in my opinion your headers are not corrupt but only some sizes of it were wrapped around the 2^32 maximum. I crossed checked the http://www.ietf.org/rfc/rfc1952.txt where the gzip is specified and it says

ISIZE (Input SIZE)
  This contains the size of the original (uncompressed) input data modulo 2^32.
that means it is not a bug at all and the gzip header simply was not capable to store the file size of the uncompressed data in the header what makes some unzip tools unhappy at least when they should show the uncompressed file size (an information which no longer is available).

to solve the issue you could a library that uses a compression that properly could create and extract files > 4GB. for example the Zip64File library which implements the 64 bit extension of the zip standard.

you also could use the gunzip -c and catch the stdout by a c program.

gunzip -c foo.gz | myreader

Open in new window

here the myreader was a console (shell) program which gets the uncompressed data from stdin.

// myreader.cpp
#include <iostream>
#include <vector>
...
int main()
{
      std::vector<std::string> data;
      std::string line; 
      size_t sz = 0;
      while (std::getline(line, std::cin))
      {
            data.push_back(line);
            sz += line.size();
            // check if a maximum bucket size was exceeded
            if (sz > 1024 * 1024)
            {
                  // evaluate the data so far and clear the vector
                  ...

Open in new window


at end of file the getline should fail. if not, you would need a watch dog thread which could prevent to hang after the dasta was fully read.

the above program is simple and should work. unfortunately, the stdout and stdin are very slow. I guess the above is at least 10 times slower than using a suitable library function. if doing the decompression (gzip  'deflation' mode compressed files) yourself you even could spare some more time.

Sara
0
 
LVL 34

Expert Comment

by:Duncan Roe
Comment Utility
the stdout and stdin are very slow Is that because of using C++ i/o? Can you try fgets() instead? stdin / stdout should not be slow - they're very widely used. Like
cat foo.gz|gzip -d|myreader

Open in new window

where myreader is using fgets() which is certainly reasonably efficient
0
 
LVL 34

Expert Comment

by:Duncan Roe
Comment Utility
I wonder why the libraries care about the size of the input file. Are they perhaps allocating an array to hold the entire contents? For purely sequential processing, this is absolutely unnecessary
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
It would be interesting to see an example of this. I understand that farzanj can't post one of the originals but it would be instructive to see one. It would certainly be a good test of the theories of 'bad headers'
0
 
LVL 32

Assisted Solution

by:sarabande
sarabande earned 213 total points
Comment Utility
the stdout and stdin are very slow Is that because of using C++ i/o? Can you try fgets() instead? stdin / stdout should not be slow - they're very widely used.

probably a misunderstanding. fgets doesn't use a different internal resource than std::getline. both should have the same performance. i compared an unnecessary write operation by gunzip to stdout and a unnecessary read from stdin by myreader with a direct call to a library function which simply returns a pointer to memory. the first is at least 10 times slower which is not measurable for one line and possibly not for 100 lines but surely if you read gigabytes.

Sara
0
 
LVL 57

Expert Comment

by:giltjr
Comment Utility
Based on sarabande post, I would expect all that is needed to recreate the "problem" would be to create a file that when unzip'ed is larger than 4GB.
0
 
LVL 57

Expert Comment

by:giltjr
Comment Utility
On Windows with gzip from Cygwin:

C:\>dir biggger.txt

01/20/2014  07:27 PM     5,793,187,840 bigger.txt

C:\>gzip bigger.txt

C:\>dir biggger.txt.gz
01/20/2014  07:27 PM     2,598,375,097 bigger.txt.gz

C:\gzip -l bigger.txt.gz
         compressed        uncompressed  ratio uncompressed_name
         2598375097          1498220544 -73.4% bigger.txt

Header is not corrupted, its what sarabande posted.  The uncompressed file size, 5,793,187,840, is bigger than 4GB.
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
@giltjr
So, can you use any C++/Java library to read it to its end?  I already have worked out a solution a few years back in C, which uses popen on gunzip -c filename.gz and it works for me but I needed something more sophisticated.

BTW, what is the version of gzip that you are using?  Part of my problem is that all my Linux machines have newer version of gzip, which compresses it just fine with correct header.  It is, I think, on the servers that is using the older version of gzip.
0
 
LVL 34

Assisted Solution

by:Duncan Roe
Duncan Roe earned 143 total points
Comment Utility
Sarah - you mention a library function which simply returns a pointer to memory. Wouldn't that mean the library has first uncompressed the entire file to memory? For the large files that @farzanj has, that is either gigabytes of RAM / swap, or an mmap'd temporary file, which he particularly didn't want (at least the latter) see http:#a39790589
Or  am I off track?
0
 
LVL 32

Assisted Solution

by:sarabande
sarabande earned 213 total points
Comment Utility
Wouldn't that mean the library has first uncompressed the entire file to memory?
no, but surely the read function would do both reading the binary data and the decompressing part in chunks of at least 4k, probably more. so, when you try to get the next (text) line by library call, it normally only returns a pointer to already decompressed data.

when the library function reads a new chunk and decompresses it, nearly all time would go to the i/o part. reading the zip file + decompressing doesn't cost any additional time to reading from uncompressed if you take in account that the compressed data is much less in size.

Sara
0
 
LVL 34

Expert Comment

by:Duncan Roe
Comment Utility
I think decompressing the data takes a lot longer than simply passing it over, all the same. So reading through a pipe is not going to be much slower, if indeed measurably so at all.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
C:\>dir biggger.txt.gz
01/20/2014  07:27 PM     2,598,375,097 bigger.txt.gz
OK, if you or anyone can put something like that at a public url temporarily  (i don't have the resources unfortunately), i'll see how it goes, streamwise, with Java
0
 
LVL 32

Expert Comment

by:sarabande
Comment Utility
I think using program-to-program i/o via pipe in the shell (command interpreter) is much slower than calling a library function in a c program.

you could test it by the following:

take a zip file and make 3 copies of it (to avoid caching effects)

(1) run gunzip -c for the 1st file and redirect the output to nul device . take the total time as t1.

(2) run gunzip -c for the 2nd file and redirect the output to a file. t2 should be greater than t1 and d = t2-t1 is the time to write the temporary file.

(3) open the temp file with an editor. t3 is the time it takes until the editor is ready for input.

(4) open the 3rd zip file in a zip tool with gui for edit. the text editor should be the same as the one in step (3). t4 is the total time between launching the action and the editor ready for input.

if you compare t1 with t5=t4-t3-d you should know which one is faster. t5 is the time the zip tool needed

Sara
0
 
LVL 57

Assisted Solution

by:giltjr
giltjr earned 72 total points
Comment Utility
Using gzip 1.4, part of Cygwin.

We have determined that the header file is not corrupt.

arzanj you stated

--> This data has to be read line by line using both C++ and Java.

Is this because you feel the header record is corrupt?

Or is there some other reason you can just unzip the file?
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
Files are huge and they are read only on the filers.  They must be read and not first uncompressed and then recompressed again.  Storage is huge and log files volume is a big issue.

Line by line because the data is understood line by line.  Parsing has to be done line by line to make any sense of it and extract the important information out of it.


Also, if you issue command
gzip -V you get version.  Version earlier than 1.3.x have this issue.  Version 1.4 is fixed.
This notification describes the versions with problems.
http://www.gzip.org/#faq10

For CEHJ, I am looking for an older version of gzip and I presume he is a Windows user, so I am trying to locate some older Windows binaries so that he can see what is happening.
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
@CEHJ, I have a link for older source code for you
ftp://ftp.netscape.com/pub/unsupported/gnu/gzip-1.2.4.tar.Z

But if that is trouble for you, I can probably get you a "bad" gz file but I am not sure I will be able to upload it on EE, probably not.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
@CEHJ, I have a link for older source code for you
No - i don't want the source to gzip, i want a link to a large file with a 'bad' header.

You're not accusing me of being a Windows user are you? ;)
0
 
LVL 31

Author Comment

by:farzanj
Comment Utility
You're not accusing me of being a Windows user are you? ;)

No sir, I wouldn't dare :)  I just wanted to facilitate it for you.  I don't know where to upload this file :(
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
I don't know where to upload this file
What have you got and how big is it?
0
 
LVL 57

Expert Comment

by:giltjr
Comment Utility
I think he was referring to me as being a Windows users.  My  work computer is Windows, we aren't allowed to use Linux on our desktop computers.  I run Cygwin to get a Linux  environment on Windows.

At home I run Linux and at work I do have access to Linux hosts at my work.

So you don't want to uncompress the whole file due to the size.  You want to uncompress  a chunk, process that chunk, and then uncompress another chunk, and continue on until you have the whole file processed.

The header is not really corrupted.  The field that represents the uncompressed file size is only a 32-bit counter.  The patch allows gzip to process a source file of 4 GB or larger, but the header will still show the bad info.
0
 
LVL 34

Expert Comment

by:Duncan Roe
Comment Utility
Regular pipes should be fine then. It doesn't matter that shell created them: they're just pipes (and are wholly within memory as such).
I used to program with zlib quite a bit - if you want to do it then the specs are all in /usr/include/zlib.h. Personally I'd rather pipe from gzip -d (the same as gunzip -c but shorter (to type;))
BTW @Sarah I believe writes to /dev/null are especially fast because Linux optimizes them to no-ops
0
 
LVL 31

Author Closing Comment

by:farzanj
Comment Utility
Thanks you everyone.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

If you haven’t already, I encourage you to read the first article (http://www.experts-exchange.com/articles/18680/An-Introduction-to-R-Programming-and-R-Studio.html) in my series to gain a basic foundation of R and R Studio.  You will also find the …
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.
This video will show you how to get GIT to work in Eclipse.   It will walk you through how to install the EGit plugin in eclipse and how to checkout an existing repository.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now