• C

Using lseek on a compressed file

I am trying to save space on my client's unix box by compressing several files (each is at least 1 gig before compression).  My problem is that I have a C program that uses these files, one at a time, as input.  This program uses 'lseek' to find a particular location within the file, but this will present a problem once the files are compresses.  Is lseek able to operate on a compressed file?  If so, which method (compress, gzip, pack, etc...) is most compatible?
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

>>Is lseek able to operate on a compressed file?


>>If so, which method (compress, gzip, pack, etc...) is most compatible?

I am not sure what you mean by "most compatible".  See next statement.

When you open the file in binary mode you are essetially just looking at bytes.  lseek is not going to know that the file is compress or anything about the data that the fp is pointing to.  It is going to be up to you to tell it were to seek to.
How does the program determine which location to lseek to in the uncompressed file?
A seek into a compressed file would gennerally not be particularly meaningful.
Can you uncompress before seeking?
betsywrAuthor Commented:
The program determines the location to lseek by querying a database table. All three lseek parameters (file name, offset and origin) are fields in that table. I am not able to change the information that gets populated in the table, so I'm trying to figure out a way (without running this program through a unix shell script - major performance hit) to utilize this data and still save space.  Any ideas?

By 'most compatible', I mean easiest to work with (i.e. which method of compressing would produce a file that is the least problematic for the lseek function to operate on).
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

since you are compressing "several" fiels, I would suggets that the easiest thing to do is to uncompress one of them when the program starts to run (or just before you want to seek). The others will remain compressed & if your programs run time is short, you will only have one uncompressed for a short while.

To do what you want to, the only way that I can see is to get hold of the freeware source of the compression code (such things are avaiable) & tweak it. Thus, when compressing, you can store the filepos of certain key data & use that to lseek afterwards. However, you would then need to uncompress the data & since it's not the whole file which you're uncompressing, just a bit, you *must* use the same algorithm to uncompress as to compress.

I *strongly* recommend working on only whole files.

betsywrAuthor Commented:
Uncompressing only one file at a time sounds like a great idea.  However, this is an online customer care system that I'm working with, and there are hundreds of customer service reps hitting these files at any given time.  Therefore, it wouldn't save me any space given that all of the files could be uncompressed at the same time.  Not to mention that I would have major contention problems if someone was trying to uncompress a file which is already uncompressed and being used by another customer service rep.  Thanks for the suggestion though.  

I'm really hoping that someone will know of a form of compression that wouldn't require uncompressing the file to do an lseek on it.  

If this is at ALL possible, please let me know.
well, then we get back to the idea of taking a publicly available compression mechanism & changing it to your own ends (try visiting http://www.gnu.org & getting teh gzip code).

The simplest form of compression is "Run Length Encoding". In this method, you simply replace, say 20 spaces by three bytes, one to say "here comes come compression", one to say "space" and one to say "20 of them".  It's not the most efficient, of course, but which method is depends on your data.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
betsywrAuthor Commented:
gzip could help, but it still doesn't help my problem of having to perform an lseek WITHOUT uncompressing the files first.  I think that my co-workers and I have a possible solution, but it is quite messy.  I am most in favor of simply purchasing more disk space!
hmm, on the one hand - thanks for the points - they pushed me over the 30k mark & I am now eligible for another stripe on my T-shirt. Otoh, that wasn't really an answer. I guess that you are like me - I ask questions in order to bounce ideas off of others, then end up implementing my own original idea, but awarding points to everyone who participated in the discussion.

Gzip may not directly help, but tweaking the code so that when compresing you store the offsets to important records in a second, index file might help. I think however that you have the correct idea - storage is cheap (but apparantly not as cheap as your boss <g>) - buy more hard drives.

best wishes,

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.