Avatar of chsalvia
chsalvia
 asked on

Remove data from file

I recently asked a similar question - but the issue has become more relevant to me as I continue working on my program, so I want to ask the question again in greater detail.

After reading over the GNU docs regarding file manipulation, there seems to be no facility for removing data from a file.  My program creates a file which stores various records in no particular order.  If I want my program to delete an arbitrary record, it seems my only option would be to recreate the entire file with one less record, and then delete the old file.  

But recreating the entire file seems like a very expensive operation, especially if the file is large.  Surely database programs which store thousands of records do not recreate the data file every time a single record is removed.  But the only other conceivable way to remove an arbitrary record from a data file would be to seek the record, and then move all subsequent records backwards by one record.  This is still a rather expensive operation, but I guess it's the only option with a disk-based data structure.  

But is it even possible to do this?  I realize there is no such facility defined in ANSI C, but I'm not worried so much about portability.  Still, I can't find anything mentioned in the GNU docs either.  So, what are some ways to remove data from a file without recreating the entire file on a GNU Linux platform?  
C

Avatar of undefined
Last Comment
sunnycoder

8/22/2022 - Mon
jkr

What kind of records are you using? If they are fixed-length, you can just add a status header (e.g. 'valid'/'deleted') and overwrite that header and reuse that space.l Things are getting more difficult with variable length records, though.
SOLUTION
Naveen Kumar

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
chsalvia

ASKER
Well, for the time being the records are fixed length - but the problem is that:

1. I plan on making a variable length data structure later, so eventually I'll need to do something about this

2. The data structure uses a hash index.  Collisions are resolved using chaining by appending nodes at the end of the file.  If I want to delete a record which is inside the table, this is no problem - I can just mark it as "deleted" or "empty".  But if the record is part of a chain in the "overflow" area, I wouldn't be able to remove it.  Of course, I could mark it as empty and then connect the two surrounding records, but this would still keep the deleted record in the file - forever useless and just taking up space.

Perhaps a better way to ask my question is, what options are available to a programmer to remove data from a file?  So far, I know of 3 potential ways:

1. Recreate entire file and delete old file (very slow)
2. Overwrite data by moving all subsequent data back the length of one record (better, but still slow)
3. Mark deleted records as "deleted" but keep them in file (fast, but causes file to become cluttered with useless data)

Of course, option 3 is ideal for a fixed-length hash table which resolves collisions internally, but if you resolve collisions externally, or if individual nodes are variable length, then your file will become cluttered with deleted records over time.
SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
ASKER CERTIFIED SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
chsalvia

ASKER
>Also, an improvement over your 3rd method would be to run a period defrag kind of routine ...

Right, that was something I was thinking about as well.  Memory mapping a file sounds interesting.  How does that work exactly - does that actually COPY the data to memory, or does it just treat the disk space as if it was virtual memory?  

Also, is the 2nd method I mentioned above even workable?  (Overwriting file data by moving all subsequent data back the length of one record)  I mean, even if I did this using fread and then fwrite, it still wouldn't actually reduce the file size.

In fact...I see no way of actually reducing a file size at all.  The only option seems to be to create another file.  

Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
chsalvia

ASKER
Is I/0 usually faster when using mmap as opposed to fseek, fread, fwrite?
SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
stefan73

chsalvia,
> Is I/0 usually faster when using mmap as opposed to fseek, fread, fwrite?
No, it's more comfortable - it "virtualizes" file storage and memory storage into one. You can directly access file data via structure pointers.

Stefan
chsalvia

ASKER
Well - I'm more interested in speed.  So, you're saying that normal stream operations like fseek, fread, etc. are faster than using mmap to manipulate file data?
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
sunnycoder

Typically, mmap is lot faster as compared to fseek, fread etc. However, it is possible to create some degenerate cases where mmap performance is low ... This done by flushing out code and data which you would need soon to bring in the memory mapped file ... This results in large number of page faults which are expensive to serve.