I need to synchronize various files from one machine to another. This is for a disaster recovery solution, so the assumption is that we have two machine talking over a network, and one machine crashes with a bad hard drive. The goal is to have something better than "last night's backup".
I have come up with a few scenarios which appear to work well up to a certain point.
1) Looping thread to monitor status of files in a folder. With this, I can check the status of each file in the watched folder and fire off an event when a file has been added, deleted or modified so the event handlers for these events can copy, remove, or copy (for add, delete, modify). I have all sorts of flexibility with this model, doing MD5 comparisons to see if the file has really changed, using the file size, if the synch folder is on an FTP server with a different time, etc, etc.
The biggest limitation to this model is for files that are too large to be copied quickly, or files that get modified frequently. I am also concerned about "exclusive access" to files. Some files, obviously, will never be synchronized if they are held open exclusively by another process.
2) Implementing an RSync-like solution. I got this idea from a Google Groups query. Break the large file down into, say, 10 or 100 clumps, and send the MD5 for each clump to be compared to the remote file's MD5. If a file has the same MD5 for each clump up to the 5th, then only send the actual data for that 5th clump. This obviously saves a lot of overhead on the network. If a file has clump #5 modified, and has data ADDED to it, then all following clumps will be incorrect, and this can be found comparing the file sizes of both files, but even here, there are strategies, such as comparing the MD5 for the last clump and going backward to the middle, leaving you still with much less of the file to synchronize.
The limitation here is that it only extends the file size limits, but I still face problems with files that are VERY large and are accessed on a regular basis. My example scenario, and the one which typifies the problem, as this is the crux of the solution I am trying to solve, is a database file. If you have a DB file hundreds of MB in size and experiencing UPDATES to large tables on a regular basis, then I could easily find myself getting identical MD5 on the first 5 clumps, updating the 6th clump over the network, all while data is now being updated and written to the DB file in one fo the first 5 clumps. So now, instead of a synchronized file, I have a corrupted file which no longer matches at all. It gets corrected on the next pass, but if the system crashes before that, I have nothing useful.
3) Which leads me to my third solution: Monitoring hard disk writes. If everything I have written above is reasonable (and I know there are elements which I've glossed over and may find problematic as I continue my research), the best solution I have for really large, regularly updated files like SQL DB files would be to have access to if/when data is written to the hard drive, and somehow reconstruct whether or not that data is part of the file that I am monitoring, and after reconstructing the fragments of that file, determine that I have to update Index1 to Index2 of that file with a provided set of data.
Knowing full well that this may require unique libraries or custom code for Python, and assuming this is the only route I have: a) is it even feasible to do some or all of this in Python, b) are there already libraries in existance that provide some of this functionality and c) what is the best direction to take from here?