Link to home
Start Free TrialLog in
Avatar of chsalvia
chsalvia

asked on

Remove data from file

I recently asked a similar question - but the issue has become more relevant to me as I continue working on my program, so I want to ask the question again in greater detail.

After reading over the GNU docs regarding file manipulation, there seems to be no facility for removing data from a file.  My program creates a file which stores various records in no particular order.  If I want my program to delete an arbitrary record, it seems my only option would be to recreate the entire file with one less record, and then delete the old file.  

But recreating the entire file seems like a very expensive operation, especially if the file is large.  Surely database programs which store thousands of records do not recreate the data file every time a single record is removed.  But the only other conceivable way to remove an arbitrary record from a data file would be to seek the record, and then move all subsequent records backwards by one record.  This is still a rather expensive operation, but I guess it's the only option with a disk-based data structure.  

But is it even possible to do this?  I realize there is no such facility defined in ANSI C, but I'm not worried so much about portability.  Still, I can't find anything mentioned in the GNU docs either.  So, what are some ways to remove data from a file without recreating the entire file on a GNU Linux platform?  
Avatar of jkr
jkr
Flag of Germany image

What kind of records are you using? If they are fixed-length, you can just add a status header (e.g. 'valid'/'deleted') and overwrite that header and reuse that space.l Things are getting more difficult with variable length records, though.
SOLUTION
Avatar of Naveen Kumar
Naveen Kumar
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of chsalvia
chsalvia

ASKER

Well, for the time being the records are fixed length - but the problem is that:

1. I plan on making a variable length data structure later, so eventually I'll need to do something about this

2. The data structure uses a hash index.  Collisions are resolved using chaining by appending nodes at the end of the file.  If I want to delete a record which is inside the table, this is no problem - I can just mark it as "deleted" or "empty".  But if the record is part of a chain in the "overflow" area, I wouldn't be able to remove it.  Of course, I could mark it as empty and then connect the two surrounding records, but this would still keep the deleted record in the file - forever useless and just taking up space.

Perhaps a better way to ask my question is, what options are available to a programmer to remove data from a file?  So far, I know of 3 potential ways:

1. Recreate entire file and delete old file (very slow)
2. Overwrite data by moving all subsequent data back the length of one record (better, but still slow)
3. Mark deleted records as "deleted" but keep them in file (fast, but causes file to become cluttered with useless data)

Of course, option 3 is ideal for a fixed-length hash table which resolves collisions internally, but if you resolve collisions externally, or if individual nodes are variable length, then your file will become cluttered with deleted records over time.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>Also, an improvement over your 3rd method would be to run a period defrag kind of routine ...

Right, that was something I was thinking about as well.  Memory mapping a file sounds interesting.  How does that work exactly - does that actually COPY the data to memory, or does it just treat the disk space as if it was virtual memory?  

Also, is the 2nd method I mentioned above even workable?  (Overwriting file data by moving all subsequent data back the length of one record)  I mean, even if I did this using fread and then fwrite, it still wouldn't actually reduce the file size.

In fact...I see no way of actually reducing a file size at all.  The only option seems to be to create another file.  

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Is I/0 usually faster when using mmap as opposed to fseek, fread, fwrite?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
chsalvia,
> Is I/0 usually faster when using mmap as opposed to fseek, fread, fwrite?
No, it's more comfortable - it "virtualizes" file storage and memory storage into one. You can directly access file data via structure pointers.

Stefan
Well - I'm more interested in speed.  So, you're saying that normal stream operations like fseek, fread, etc. are faster than using mmap to manipulate file data?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Typically, mmap is lot faster as compared to fseek, fread etc. However, it is possible to create some degenerate cases where mmap performance is low ... This done by flushing out code and data which you would need soon to bring in the memory mapped file ... This results in large number of page faults which are expensive to serve.