databoks
asked on
Incremental Backup
How can i perform incremental Backup of files? Some say that I can use CRC or simply read the file by blocks and then compare them since the last full Backup??
Which one is Easier? Is there a Libary?? Please help :)
Which one is Easier? Is there a Libary?? Please help :)
ASKER
It's not only on text files. But everykind of files.
In many cases, incremental backups are very useful. A lot of software that has incremental backup though, does a full update on every file that has changed and just don't update files that haven't changed since the last update. I think some use timestamps on the file, but most use some kind of hash (CRC, MD5, etc).
To do incremental backup of an indivudual file, you could get as complex as you wanted. If you look into the research done on compression, you'll get tons of ideas.
I would just use a hash to see if it changed at all, and then check blocks to see if it matches for the first portion of the file. If not, I'd just update the whole thing.
But, if you look into research on partial string matching (works on binary data too, not just text strings), then you will get some good ideas on how to find matching portions.
Basically, high level incremental backups are almost identical to compression algorithms, so some research in that field would be of use.
To do incremental backup of an indivudual file, you could get as complex as you wanted. If you look into the research done on compression, you'll get tons of ideas.
I would just use a hash to see if it changed at all, and then check blocks to see if it matches for the first portion of the file. If not, I'd just update the whole thing.
But, if you look into research on partial string matching (works on binary data too, not just text strings), then you will get some good ideas on how to find matching portions.
Basically, high level incremental backups are almost identical to compression algorithms, so some research in that field would be of use.
ASKER
I have looked on crc.. I Will do some research on compression..
I thought about to read the files in blocks and then compare them, but that will not always help, since the changes still be in the same bytes. For example a user has replaced a line with another line and it has the same length but not the same char.
I could use both crc and read the file blocks.
Any recommendation?
I thought about to read the files in blocks and then compare them, but that will not always help, since the changes still be in the same bytes. For example a user has replaced a line with another line and it has the same length but not the same char.
I could use both crc and read the file blocks.
Any recommendation?
You could check the files for long common substrings to see where they are the same. There are many algorithms out there for this. This would also help in the case that the change was longer than the original.
What you need to decide is where in the balance between processor time and data size you want to end up. If you make a complicated incremental algorithm using partial string matching or long common substrings then you will have little data to transmit, but it will take a long time to compute.
If you are doing offsite backup with dialup on a >3GHz computer to a small backup drive, then shrink it down as much as possible.
If you are backing up to a local 1 TB drive, then maybe you want something simpler and faster. Just update the whole file for any that changed.
If you are trying to create a backup program to market and sell, then do both and let the user set a parameter to choose which to use.
If you are doing offsite backup with dialup on a >3GHz computer to a small backup drive, then shrink it down as much as possible.
If you are backing up to a local 1 TB drive, then maybe you want something simpler and faster. Just update the whole file for any that changed.
If you are trying to create a backup program to market and sell, then do both and let the user set a parameter to choose which to use.
ASKER
Okay.
It has to be kind of professional so i cant just replace the files. I have to create deltas.
Imagine a customer who has 1000 textfiles. If the backup is scheduled to run everyday then i have to open 1000 text files and compute the substrings for the day before - this will take to long time.
Maybe i have to base it upon the bytes of blocks and the same time CRC.
It has to be kind of professional so i cant just replace the files. I have to create deltas.
Imagine a customer who has 1000 textfiles. If the backup is scheduled to run everyday then i have to open 1000 text files and compute the substrings for the day before - this will take to long time.
Maybe i have to base it upon the bytes of blocks and the same time CRC.
Yes, you could compute checksums on blocks if the file checksums didn't match.
You could also loop through both files once and when they become different, keep moving through the larger one until they start matching again. In this way you could catch all simple insertions OR deletions. If a file had both, you would need the long complex versions that you don't want so in that case, you would just replace.
That would be more effective than doing it in blocks and take about the same amount of time.
You could also loop through both files once and when they become different, keep moving through the larger one until they start matching again. In this way you could catch all simple insertions OR deletions. If a file had both, you would need the long complex versions that you don't want so in that case, you would just replace.
That would be more effective than doing it in blocks and take about the same amount of time.
ASKER
thanks for reply.
so i first have to use CRC and then afterwards use blocks and see if the checksum is has changed.
can you please help me with some code?
Maybe with the CRC. and file read by blocks. Bear in mind that i will use Microsoft Volume Shadow Copy to take backup of these files.
CRC is a type of checksum. You could use any kind of checksum you wanted, but since you have both files right there, why use a checksum? Why not just compare the blocks? Checksums are for comparing later to verify that there was no transmission error. Why go through and compute the checksum for both files when you can just compare both files?
The code would just look something like
open(file1)
open(file2)
for i from 0 to blocksize
if(file1[i ] != file2[ i])
block is different!
open(file1)
open(file2)
for i from 0 to blocksize
if(file1[i ] != file2[ i])
block is different!
ASKER
You are right.
I only stick to the Blocksize comparing. however i have to create delta's later on, so that is that time, and that headache :)
I will try some code and post it here.
Thanks, Tommy.
I only stick to the Blocksize comparing. however i have to create delta's later on, so that is that time, and that headache :)
I will try some code and post it here.
Thanks, Tommy.
You just need to come up with a format for your deltas. You could just have a bitmask that says which blocks are changed, but again, if the file sizes can change, your deltas should list the inserts and deletes.
Something like insert(location, size, data) delete(location, size). Then you would store the original file and the inserts and deletes.
Something like insert(location, size, data) delete(location, size). Then you would store the original file and the inserts and deletes.
ASKER
Excellent, Tommy. Promise to get back to you with a sample on the blocks.
Thank you in advance.
Thank you in advance.
ASKER
Hi tommy.
this is my code below.. i have to trim a bit and then make it more advanced.
what do you think of the code? the principle in detecting the blocksize changes.
this works on every kind of file.
this is my code below.. i have to trim a bit and then make it more advanced.
what do you think of the code? the principle in detecting the blocksize changes.
this works on every kind of file.
static void Main(string[] args)
{
byte[] bytefile = File.ReadAllBytes("D:\\licens.txt");
byte[] bytefile2 = File.ReadAllBytes("D:\\licens2.txt");
for (int i = 0; i < bytefile.Length; i++)
{
if (bytefile.Length != bytefile2.Length)
{
}
else if (bytefile.Length == bytefile2.Length)
{
//No difference in the bytes; try look in the blocks.
if (bytefile[i] != bytefile2[i])
{
Console.WriteLine("BlockSize change deteced at blocksize" + i);
}
else
{
Console.WriteLine("No changes to the file..");
}
}
else
{
}
}
Console.ReadLine();
}
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thank you.
I will try it. Even though my code worked in differnt scenarios. :)
I will try it. Even though my code worked in differnt scenarios. :)
ASKER
Perfect ;)
Here is a whole project dedicated to it
http://www.codeproject.com/KB/recipes/DiffAlgorithmCS.aspx
and another
http://www.codeproject.com/KB/recipes/diffengine.aspx
Open in new window