We help IT Professionals succeed at work.

Performing an rsync before rdiff-backup; caveats?

junipllc
junipllc asked
on
Medium Priority
1,131 Views
Last Modified: 2013-12-01
Hi all,

I'm in the middle of implementing a (very) custom offsite backup solution for two of my clients. I am currently backing up their servers to my local storage servers here at my location. It's part of a bigger strategy.

However, there is quite a bit of data (more than 1TB) involved and pushing all of that over the 'Net just wasn't going to cut it. So I physically drove to both locations and rsynced their machines to a large portable eSATA.

Now, all of their data has been placed on the storage servers where the backups are going to end up. I've rsynced their servers with the local storage servers as well just to ensure I have the most up to date copies of all of their files. For all intents and purposes I have mirror copies of their servers on mine.

Here's my question (finally): The backup solution is going to use rdiff-backup, and I'm wondering how it will behave when I invoke it.

--> If the copies of the files on both ends are identical, will rdiff-backup have to push all the data again, or will is just do a quicker comparison like rsync's algorithm does? I know the initial rdiff-backup backup has to complete in order to get the incrementals rolling. It's the initial backup I'm concerned about.

I know this is the rsync zone, but I couldn't find one about rdiff-backup (or much of anything other than commercial software...). I tried a few tests but they came up inconclusive since I can't actually "see" what rsync-backup is doing in the background.

Thanks everybody!

Mike

Comment
Watch Question

CERTIFIED EXPERT

Commented:
First, 1TB these days is not considered to be huge data.  Could be easily send over by compressing over the network.

Second, for the first time data, rsync is just like ssh but in your situation now, this is the beauty of rsync, it will examine the diffs and will send only the changes in the data.
CERTIFIED EXPERT

Commented:
From the man page of rsync
"It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination."

Command:


rsync -avz sourcedir  user@server:/dest/path

Open in new window

CERTIFIED EXPERT

Commented:
CERTIFIED EXPERT

Author

Commented:
Thank you, farzanj, but that doesn't answer my question. :P

I know how to use rsync and all the benefits of it. It's a wonderful tool, I use it daily. Also, 1TB is most definitely a lot of data in this instance since my clients want the backups to start immediately, and they do not have fast connections where they are. 1TB would take more than 2.5 months to complete over their connection, even with compression. I think they'd fire me instead of waiting 2.5 months.

The thing I'm wondering about is after an rsync (when both sides are identical) and I run an rdiff-backup from source --> destination. Will rdiff-backup have to retransmit all of the data from source to destination, or will it simply use both ends to do a comparison (rdiff-backup uses librsync), thus avoiding all of the hassle of re-sending all the data.

Thanks again!

Mike
CERTIFIED EXPERT
Commented:
I don't\ understand your question.
Why don't you jus try to run rsync and look whetner it needs 2.5 months or  much less.

if you use rsync -av  you have enough verbose output to see what it is does.


or use the -n option (dry run) to see what it would do.

- Did you use rsync to copy the data on your portablehard disk?
- Did your hard disk have the same file system as your hostts to be backed up. (or at least a file system which stores the modification times corrrectly)

Normally  (if I remember well)
Rsync compares the time stamps and the size to decide wehther a file has to be transferred or not. So if you had for example a bad file system on your hard disk then  it might be, that rsync would consider the files to be different and retransmit them.

rsync can use another algorithm (it calculates checksums for each file on the  source and destination and transfers only of the checksums are different),


to do this youhad to use

rsync -cav src dst
instead of
rsync av src dst

However for backup run rsync had to calculate the md5sums for each file (which is perhaps a little too much work)

If the time stamps were 'lost' while copying to disk.
you might coonsider writing a small program, that
gathers the modification times of the files on the source host and that applies them on the backup host.

if you talk about tthe dirrerential storage on your hard disk.


If ytyour concern is the --copy-dest option, then just read the man page:

It says, that a local copy will be performed instead of a remote copy if the file exists in the dest dir.





CERTIFIED EXPERT
Commented:
Hi Mike,

From theory I, rdiff-backup appears to be having similar philosophy as rsync.  It also works on incremental backups.
http://www.gnu.org/savannah-checkouts/non-gnu/rdiff-backup/

In theory it should NOT copy the whole thing over although I have never used it.  I stayed with rsync, tar, etc.
CERTIFIED EXPERT

Author

Commented:
Thank you to both of you.

@gelonida (and for future googlers who might need it): I'm basically trying to do this, where source is the client's server and destination is the ultimate destination of the backup.

source ---(rsync)---> portable drive
 (to speed up the process of the initial "seed", which over the WAN would take an unacceptable amount of time)

drive from client to the location of my local network
 (a slightly faster version of "SneakerNet", depending on highway traffic) ;)

portable drive ---(rsync)---> destination
 (to move the "seed" files to the final location -- on my local LAN for speed)

source ---(rsync)---> destination
 (over the WAN, to update any files that may have changed between the first step and this one -- they are constantly changing files on their server; this would effectively end with an exact mirror of the source in the destination. this step only takes a few minutes to transfer just the changed files, which means rsync is working properly by not re-transmitting the data)

Now comes the part where the issue is. If I:

source ---(rdiff-backup)---> destination

will it have to re-sync the files over the WAN, which basically nullifies the steps I've done to "seed" the destination?

@farzanj, I agree in theory this should work, as rsync and rdiff-backup both use the same algorithms, but when I run the rdiff-backup it does, indeed, run through all of the files (verbosity set at 5).

However, it's not clear if the files are actually being transferred across the WAN, or if it's just doing an rsync-style "comparison" on both ends (not network bound, but more CPU intensive) and updating the rdiff-backup database. There are pauses for larger files, and I don't see much going across the firewall, but due to my environment here the tests are not in any way accurate.

@gelonida: sadly, no, the filesystems are different, but not different enough to cause problems. The timestamps seem to be correct. The first "SneakerNet" steps are done on UFS volumes, and the local rsync to destination is between UFS--->ext3. So the order is UFS (source) --> UFS (portable) --> ext3 (destination). The fact that the source ---(rsync)---> destination step works properly would lead me to believe that all is good with that.

And the golden question you both might be asking: Why don't I just use rsync (only)?  This was ruled out due to space limitations at the destination, and the requirement to keep all past versions of all files. With rsync this would have been cost- and space-prohibitive. rdiff-backup only stores the diffs of the files, and not hard links/multiple copies like rsync. However, please correct me if I'm wrong, and rsync would do the job. I'd prefer to use it if it will accomplish the same goal.

My question was basically a feeler to see if anybody else has tried this, or knows both packages (rsync and rdiff-backup) in practice. The documentation I've been able to find has not touched on this subject.

Sorry for the long post here, I'm just a wordy dude.

I'll do some more testing. Thanks again to you both!

Mike

CERTIFIED EXPERT

Author

Commented:
Hey all,

It turns out that rdiff-backup does not have to re-push the data if it is the same. I did some detailed network analysis on the machines in question over a 12 hour period and the only real traffic going over the network was the coordinating data -- the "hey, server2, is this different from your copy?" "no server1, looks the same" "ok, server2, i'll move on to the next file" stuff.

Thanks for your help!

Mike
CERTIFIED EXPERT

Author

Commented:
Your answers helped me solve the problem, thank you.