Link to home
Start Free TrialLog in
Avatar of junipllc
junipllcFlag for United States of America

asked on

Performing an rsync before rdiff-backup; caveats?

Hi all,

I'm in the middle of implementing a (very) custom offsite backup solution for two of my clients. I am currently backing up their servers to my local storage servers here at my location. It's part of a bigger strategy.

However, there is quite a bit of data (more than 1TB) involved and pushing all of that over the 'Net just wasn't going to cut it. So I physically drove to both locations and rsynced their machines to a large portable eSATA.

Now, all of their data has been placed on the storage servers where the backups are going to end up. I've rsynced their servers with the local storage servers as well just to ensure I have the most up to date copies of all of their files. For all intents and purposes I have mirror copies of their servers on mine.

Here's my question (finally): The backup solution is going to use rdiff-backup, and I'm wondering how it will behave when I invoke it.

--> If the copies of the files on both ends are identical, will rdiff-backup have to push all the data again, or will is just do a quicker comparison like rsync's algorithm does? I know the initial rdiff-backup backup has to complete in order to get the incrementals rolling. It's the initial backup I'm concerned about.

I know this is the rsync zone, but I couldn't find one about rdiff-backup (or much of anything other than commercial software...). I tried a few tests but they came up inconclusive since I can't actually "see" what rsync-backup is doing in the background.

Thanks everybody!

Mike

Avatar of farzanj
farzanj
Flag of Canada image

First, 1TB these days is not considered to be huge data.  Could be easily send over by compressing over the network.

Second, for the first time data, rsync is just like ssh but in your situation now, this is the beauty of rsync, it will examine the diffs and will send only the changes in the data.
From the man page of rsync
"It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination."

Command:


rsync -avz sourcedir  user@server:/dest/path

Open in new window

Avatar of junipllc

ASKER

Thank you, farzanj, but that doesn't answer my question. :P

I know how to use rsync and all the benefits of it. It's a wonderful tool, I use it daily. Also, 1TB is most definitely a lot of data in this instance since my clients want the backups to start immediately, and they do not have fast connections where they are. 1TB would take more than 2.5 months to complete over their connection, even with compression. I think they'd fire me instead of waiting 2.5 months.

The thing I'm wondering about is after an rsync (when both sides are identical) and I run an rdiff-backup from source --> destination. Will rdiff-backup have to retransmit all of the data from source to destination, or will it simply use both ends to do a comparison (rdiff-backup uses librsync), thus avoiding all of the hassle of re-sending all the data.

Thanks again!

Mike
ASKER CERTIFIED SOLUTION
Avatar of gelonida
gelonida
Flag of France image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you to both of you.

@gelonida (and for future googlers who might need it): I'm basically trying to do this, where source is the client's server and destination is the ultimate destination of the backup.

source ---(rsync)---> portable drive
 (to speed up the process of the initial "seed", which over the WAN would take an unacceptable amount of time)

drive from client to the location of my local network
 (a slightly faster version of "SneakerNet", depending on highway traffic) ;)

portable drive ---(rsync)---> destination
 (to move the "seed" files to the final location -- on my local LAN for speed)

source ---(rsync)---> destination
 (over the WAN, to update any files that may have changed between the first step and this one -- they are constantly changing files on their server; this would effectively end with an exact mirror of the source in the destination. this step only takes a few minutes to transfer just the changed files, which means rsync is working properly by not re-transmitting the data)

Now comes the part where the issue is. If I:

source ---(rdiff-backup)---> destination

will it have to re-sync the files over the WAN, which basically nullifies the steps I've done to "seed" the destination?

@farzanj, I agree in theory this should work, as rsync and rdiff-backup both use the same algorithms, but when I run the rdiff-backup it does, indeed, run through all of the files (verbosity set at 5).

However, it's not clear if the files are actually being transferred across the WAN, or if it's just doing an rsync-style "comparison" on both ends (not network bound, but more CPU intensive) and updating the rdiff-backup database. There are pauses for larger files, and I don't see much going across the firewall, but due to my environment here the tests are not in any way accurate.

@gelonida: sadly, no, the filesystems are different, but not different enough to cause problems. The timestamps seem to be correct. The first "SneakerNet" steps are done on UFS volumes, and the local rsync to destination is between UFS--->ext3. So the order is UFS (source) --> UFS (portable) --> ext3 (destination). The fact that the source ---(rsync)---> destination step works properly would lead me to believe that all is good with that.

And the golden question you both might be asking: Why don't I just use rsync (only)?  This was ruled out due to space limitations at the destination, and the requirement to keep all past versions of all files. With rsync this would have been cost- and space-prohibitive. rdiff-backup only stores the diffs of the files, and not hard links/multiple copies like rsync. However, please correct me if I'm wrong, and rsync would do the job. I'd prefer to use it if it will accomplish the same goal.

My question was basically a feeler to see if anybody else has tried this, or knows both packages (rsync and rdiff-backup) in practice. The documentation I've been able to find has not touched on this subject.

Sorry for the long post here, I'm just a wordy dude.

I'll do some more testing. Thanks again to you both!

Mike

Hey all,

It turns out that rdiff-backup does not have to re-push the data if it is the same. I did some detailed network analysis on the machines in question over a 12 hour period and the only real traffic going over the network was the coordinating data -- the "hey, server2, is this different from your copy?" "no server1, looks the same" "ok, server2, i'll move on to the next file" stuff.

Thanks for your help!

Mike
Your answers helped me solve the problem, thank you.