Link to home
Create AccountLog in
Avatar of databoks
databoksFlag for Denmark

asked on

Incremental Backup

How can i perform incremental Backup of files? Some say that I can use CRC or simply read the file by blocks and then compare them since the last full Backup??


Which one is Easier? Is there a Libary?? Please help :)
Avatar of r3nder
r3nder
Flag of United States of America image

Do you really want incremental backups - as that can get quite complex. Personally I would just replace version 1 with 2 or compare the file size - heres some direction (below) but if you really want to do it the other way by comparing text
Here is a whole project dedicated to it
http://www.codeproject.com/KB/recipes/DiffAlgorithmCS.aspx
and another
http://www.codeproject.com/KB/recipes/diffengine.aspx

using System;
using System.IO;

class Program
{
    static void Main()
    {
	// The name of the file
	const string fileName = "test.txt";

	// Create new FileInfo object and get the Length.
	FileInfo f = new FileInfo(fileName);
	long s1 = f.Length;

	// Change something with the file. Just for giggles.
	File.AppendAllText(fileName, " More characters.");

	// Create another FileInfo object and get the Length.
	FileInfo f2 = new FileInfo(fileName);
	long s2 = f2.Length;

	// Print out the length of the file before and after.
	Console.WriteLine("Before and after: " + s1.ToString() +
	    " " + s2.ToString());

	// Get the difference between the two sizes.
	long change = s2 - s1;
	Console.WriteLine("Size increase: " + change.ToString());
    }
}

=== Program output ===

Before and after: 3 20
Size increase: 17

Open in new window

Avatar of databoks

ASKER

It's not only on text files. But everykind of files.
Avatar of TommySzalapski
In many cases, incremental backups are very useful. A lot of software that has incremental backup though, does a full update on every file that has changed and just don't update files that haven't changed since the last update. I think some use timestamps on the file, but most use some kind of hash (CRC, MD5, etc).

To do incremental backup of an indivudual file, you could get as complex as you wanted. If you look into the research done on compression, you'll get tons of ideas.
I would just use a hash to see if it changed at all, and then check blocks to see if it matches for the first portion of the file. If not, I'd just update the whole thing.
But, if you look into research on partial string matching (works on binary data too, not just text strings), then you will get some good ideas on how to find matching portions.

Basically, high level incremental backups are almost identical to compression algorithms, so some research in that field would be of use.
I have looked on crc.. I Will do some research on compression..

I thought about to read the files in blocks and then compare them, but that will not always help, since the changes still be in the same bytes. For example a user has replaced a line with another line and it has the same length but not the same char.

I could use both crc and read the file blocks.

Any recommendation?
You could check the files for long common substrings to see where they are the same. There are many algorithms out there for this. This would also help in the case that the change was longer than the original.
What you need to decide is where in the balance between processor time and data size you want to end up. If you make a complicated incremental algorithm using partial string matching or long common substrings then you will have little data to transmit, but it will take a long time to compute.

If you are doing offsite backup with dialup on a >3GHz computer to a small backup drive, then shrink it down as much as possible.
If you are backing up to a local 1 TB drive, then maybe you want something simpler and faster. Just update the whole file for any that changed.
If you are trying to create a backup program to market and sell, then do both and let the user set a parameter to choose which to use.
Okay.

It has to be kind of professional so i cant just replace the files. I have to create deltas.


Imagine a customer who has 1000 textfiles. If the backup is scheduled to run everyday then i have to open 1000 text files and compute the substrings for the day before - this will take to  long time.


Maybe i have to base it upon the bytes of blocks and the same time CRC.


Yes, you could compute checksums on blocks if the file checksums didn't match.
You could also loop through both files once and when they become different, keep moving through the larger one until they start matching again. In this way you could catch all simple insertions OR deletions. If a file had both, you would need the long complex versions that you don't want so in that case, you would just replace.
That would be more effective than doing it in blocks and take about the same amount of time.

thanks for reply.

so i first have to use CRC and then afterwards use blocks and see if the checksum is has changed.

can you please help me with some code?


Maybe with the CRC. and file read by blocks. Bear in mind that i will use Microsoft Volume Shadow Copy to take backup of these files.

CRC is a type of checksum. You could use any kind of checksum you wanted, but since you have both files right there, why use a checksum? Why not just compare the blocks? Checksums are for comparing later to verify that there was no transmission error. Why go through and compute the checksum for both files when you can just compare both files?
The code would just look something like
open(file1)
open(file2)

for i from 0 to blocksize
  if(file1[i ] != file2[ i])
    block is different!
You are right.

I only stick to the Blocksize comparing. however i have to create delta's later on, so that is that time, and that headache :)

I will try some code and post it here.

Thanks, Tommy.
You just need to come up with a format for your deltas. You could just have a bitmask that says which blocks are changed, but again, if the file sizes can change, your deltas should list the inserts and deletes.
Something like insert(location, size, data) delete(location, size). Then you would store the original file and the inserts and deletes.
Excellent, Tommy. Promise to get back to you with a sample on the blocks.

Thank you in advance.
Hi tommy.

this is my code below..  i have to trim a bit and then make it more advanced.

what do you think of the code? the principle in detecting the blocksize changes.

this works on every kind of file.
static void Main(string[] args)
        {





            byte[] bytefile = File.ReadAllBytes("D:\\licens.txt");

            


            byte[] bytefile2 = File.ReadAllBytes("D:\\licens2.txt");

            for (int i = 0; i < bytefile.Length; i++)
            {
                if (bytefile.Length != bytefile2.Length)
                {
            
                
                }
                else if (bytefile.Length == bytefile2.Length)
                {
                    //No difference in the bytes; try look in the blocks.
                    if (bytefile[i] != bytefile2[i])
                    {
                        Console.WriteLine("BlockSize change deteced at blocksize" + i);
                    }
                    else
                    {
                        Console.WriteLine("No changes to the file..");
                    }

                }
                else
                { 
                
                }

            
            }

            Console.ReadLine();

        }

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of TommySzalapski
TommySzalapski
Flag of United States of America image

Link to home
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
See answer
Thank you.

I will try it. Even though my code worked in differnt scenarios. :)
Perfect ;)