[Webinar] Streamline your web hosting managementRegister Today

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 234
  • Last Modified:

Using lseek on a compressed file

I am trying to save space on my client's unix box by compressing several files (each is at least 1 gig before compression).  My problem is that I have a C program that uses these files, one at a time, as input.  This program uses 'lseek' to find a particular location within the file, but this will present a problem once the files are compresses.  Is lseek able to operate on a compressed file?  If so, which method (compress, gzip, pack, etc...) is most compatible?
1 Solution
>>Is lseek able to operate on a compressed file?


>>If so, which method (compress, gzip, pack, etc...) is most compatible?

I am not sure what you mean by "most compatible".  See next statement.

When you open the file in binary mode you are essetially just looking at bytes.  lseek is not going to know that the file is compress or anything about the data that the fp is pointing to.  It is going to be up to you to tell it were to seek to.
How does the program determine which location to lseek to in the uncompressed file?
A seek into a compressed file would gennerally not be particularly meaningful.
Can you uncompress before seeking?
betsywrAuthor Commented:
The program determines the location to lseek by querying a database table. All three lseek parameters (file name, offset and origin) are fields in that table. I am not able to change the information that gets populated in the table, so I'm trying to figure out a way (without running this program through a unix shell script - major performance hit) to utilize this data and still save space.  Any ideas?

By 'most compatible', I mean easiest to work with (i.e. which method of compressing would produce a file that is the least problematic for the lseek function to operate on).
Managing Security Policy in a Changing Environment

The enterprise network environment is evolving rapidly as companies extend their physical data centers to embrace cloud computing and software-defined networking. This new reality means that the challenge of managing the security policy is much more dynamic and complex.

since you are compressing "several" fiels, I would suggets that the easiest thing to do is to uncompress one of them when the program starts to run (or just before you want to seek). The others will remain compressed & if your programs run time is short, you will only have one uncompressed for a short while.

To do what you want to, the only way that I can see is to get hold of the freeware source of the compression code (such things are avaiable) & tweak it. Thus, when compressing, you can store the filepos of certain key data & use that to lseek afterwards. However, you would then need to uncompress the data & since it's not the whole file which you're uncompressing, just a bit, you *must* use the same algorithm to uncompress as to compress.

I *strongly* recommend working on only whole files.

betsywrAuthor Commented:
Uncompressing only one file at a time sounds like a great idea.  However, this is an online customer care system that I'm working with, and there are hundreds of customer service reps hitting these files at any given time.  Therefore, it wouldn't save me any space given that all of the files could be uncompressed at the same time.  Not to mention that I would have major contention problems if someone was trying to uncompress a file which is already uncompressed and being used by another customer service rep.  Thanks for the suggestion though.  

I'm really hoping that someone will know of a form of compression that wouldn't require uncompressing the file to do an lseek on it.  

If this is at ALL possible, please let me know.
well, then we get back to the idea of taking a publicly available compression mechanism & changing it to your own ends (try visiting http://www.gnu.org & getting teh gzip code).

The simplest form of compression is "Run Length Encoding". In this method, you simply replace, say 20 spaces by three bytes, one to say "here comes come compression", one to say "space" and one to say "20 of them".  It's not the most efficient, of course, but which method is depends on your data.
betsywrAuthor Commented:
gzip could help, but it still doesn't help my problem of having to perform an lseek WITHOUT uncompressing the files first.  I think that my co-workers and I have a possible solution, but it is quite messy.  I am most in favor of simply purchasing more disk space!
hmm, on the one hand - thanks for the points - they pushed me over the 30k mark & I am now eligible for another stripe on my T-shirt. Otoh, that wasn't really an answer. I guess that you are like me - I ask questions in order to bounce ideas off of others, then end up implementing my own original idea, but awarding points to everyone who participated in the discussion.

Gzip may not directly help, but tweaking the code so that when compresing you store the offsets to important records in a second, index file might help. I think however that you have the correct idea - storage is cheap (but apparantly not as cheap as your boss <g>) - buy more hard drives.

best wishes,


Featured Post

Turn Raw Data into a Real Career

There’s a growing demand for qualified analysts who can make sense of Big Data. With an MS in Data Analytics, you can become the data mining, management, mapping, and munging expert that today’s leading corporations desperately need.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now