Algorithm for file compare

Good afternoon,
Not sure where really to put this question, so hopefully its ok here!

Ive been asked to write some backup software which will compare a file from yesterday and today and show me the difference on a binary comparison. Which I can do, but it means me storing yesterdays file, and compaing it to today which results in a large database.

Ive been looking at some software such as Backup Exec and Super Flexible File Syncronsiers which seem to do binary comparisons, but their databases arn't nearly as big as mine.

Does anyone have any suggestions on how to reduce my historic file database for comparing?

I did think of using checksums, and comparing the first 500Kb of a file and if the checksum isnt the same then backup the entire 500Kb again, but seems a strange way of doing it.

Any suggestions?

Thank you
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Someone asked for a Unique Identifier (UID) for a file. Perhaps you may find something useful in this discussion.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Jose ParrotGraphics ExpertCommented:
If you need to compare Y (YESTERDAY)  x  T (TODAY) files you need both Y&T files at the moment of comparaisons, so full daily backup is mandatory. Alternatively, a full hash (not only for a portion of the file) is needed. Hashes aren't perfect, but the probability of two different files to generate the same hash are negligeted!!

The fist thing is to decide for a process, the right schema of comparaison.
1. Cut time is 9 PM; no WRITING allowed; Full T backup starts 9:01 PM
2. Comparaison Y x T starts 9:01 PM
3. T backup ends 11:xx PB; at 11:59 file area is open for WRITING
4. Comparaison finishes next day at any moment
5. The comparator program generates a report
6. Y modified files are backuped to "Ymodified" area
7. The full X backup is renamed to X

Some special procedures can be adopted, depending of the characteristics of the files or environment:
1. All the monitored files are in directories A, B and C, so only them are backuped or the monitored files pathnames are in a list (to minimize backups)
2. File accesses are monitorized. If they are Office files, use the $copy open by Office as a trigger to backup such files to a candidate area; or set Office to record and store the modifications, so you need just to compare file sizes, optionally maintain only the last version of such documents, by erasing the previous versions.
3. The fisrt approach is file size, depending of the filetype. Second, last modification timedate.
4. If backups are controlled by a program, say Control-M ar backupexec, add your own program in the script

Probably, as you have a huge number of files, the full backup is unpracticable, so hashing would be the right choice, followed by the "last modification datetime" from the operating system, if it is enough for you.

Are you running Windows 7 or Windows server? If so, shadow copies (which you are sort of describing) are built in and just need to be configured. It will keep however many old versions you want and has all kinds of tricks built in to keep the sizes down.

Why write your own? There are many good ones out there that will be a lot better than anything you or I could write on short notice. Crashplan has good rolling backup technology for free for personal use and for low cost for business use. Acronis has a really nice (but more pricey) backup suite.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.