Syncing large amounts of data between servers for backup/archive.

Okay, I have something that I can't seem to find a good answer for.

I have the main storage server that has too much data on it to back it up to tape.  I can backup a lot of it, but just not all of it.  The main storage server is running Server 2003R2.

I have 3 folders on the same volume on the main storage server that have about 1.5TB of data, that contains about 35 million Tiff images.

I took some extra hardware that I have, and built a second server that I am calling the Vault.  It has 3.6TB of disk on it and I will be growing it even larger when we order our new SAN.   The Vault server is running Server 2008.

So, here is what I am trying to do.  Since I can't get the data on tape, my idea is to at least have a second copy of it, just in case something happens to the first one.

The servers are connected via GB Ethernet, and are located in different parts of the building.

I have copied all of the data to new server, so I have a starting point where I don't have to do the initial copy during my sync.

I have tried:

Robocopy - It takes about 50+ hours to complete.
SyncToy - Never finishes and I give up.
DFS - Never seems to complete the initial sync.

Another server that I want to backup is also full of Tiff's, but is not currently on the domain.  For that one I am using Robocopy and it's takes 2 hours to sync 725GB of Encrypted Tiffs.  That one is going to keep growing and will hit about 2TB by the end of the year.

What can I do?  Is there a way for me to do this?   Or do I look at purchasing a SAN that can do this for me?
LVL 3
MarkJenksAsked:
Who is Participating?
 
gurutcCommented:
Hi,

I'd put Linux on that second server instead of Windows.  Then I'd use multiple instances of RSYNC to synchronize the data.  RSYNC only sends changed blocks of files, not the whole files, and is lean and mean which speeds things up in many instances.

On the source side, I'd install Deltacopy first.  This installs the Cygwin.dll and the RSYNC modules for use on a Windows server as well as a nice GUI to set things up.  If things work with the GUI then I'd switch to manual configuration of the RSYNC client operations from this server.

http://www.aboutmyip.com/AboutMyXApp/DeltaCopy.jsp

On the backup server side, I'd configure the Linux environment to optimize file system performance.  I'd do what Google is doing for this and use the Ext4 file system which is very fast and handles huge amounts of huge files gracefully.  You can configure multiple instances of the RSYNC server daemon listening on different ports to allow best use of this backup server's cpu and network i/o resources.  You'll want to tweak your network settings including RWIN values through testing to get the best performance.

To allow you to use multiple client-side threads for backup you configure include/exclude options for the RSYNC client side.  For example, we'll assume your files start with A* through Z*.  RSYNC script one will include A*-E* files and send to RSYNC on the backup server listening on the standard RSYNC  port 873.  Script two will send F*-J* to RSYNC on the backup server listening on port 874 and so on.

References for RSYNC are at http://rsync.samba.org

But for starters, you can test to see if this is something you want to pursue just by using your current two Windows servers.  Put Deltacopy on each configuring source as client and backup server as server.  It may work well enough out of the box and using the GUI.  

I hope this helps you.

Good Luck,
- gurutc
0
 
TripyreCommented:
Try Allwaysync. I use it personally and find it works great.

https://allwaysync.com/business.html

I have a big number of files to sync. Does the synchronizer have any limitations on the number of synced files?
The sync algorithm has no limitations on the number of the synced files but the synchronizer depends on windows restrictions on the amount of the operative memory that can be used by application at once.

64-bit Windows editions can use more memory than 32-bit editions. So if you have a lot of the files that should be synced at once, there is the reason to use 64-bit edition of the synchronizer

0
 
TripyreCommented:
You may want to break up the files into different folders and run jobs at different times against the folders.  You may be running up against a memory constraint.
0
Creating Active Directory Users from a Text File

If your organization has a need to mass-create AD user accounts, watch this video to see how its done without the need for scripting or other unnecessary complexities.

 
MarkJenksAuthor Commented:
I think the problem I am having with the ones that do a compare and only copy the changes, is the fact that it takes so long to see what has changed in the structure.

Remember, we are talking about 35 Million Tiffs that do not compress.
0
 
MarkJenksAuthor Commented:
I am verify familiar with rsync.  (I remember before there was a mouse next to the keyboard, lol)

But I'm not sure that is even going to work.   Unless I have seperate jobs doing folders by ASCII chars.   1* 2* 3* 4* a* b* c* etc.
0
 
gurutcCommented:
Hi,

The strength of RSYNC is that you can do the separate jobs, even if the config is tedious.  Also, it doesn't have to read whole files to compare source/dest but does a pretty good job of finding changed blocks instead.  Client side and Server side each do a local checksum on file blocks and only compare that to decide what to transfer.  That's one reason I like Linux on the Server side since it's not as hamstrung as a Windows file system.

I haven't transferred 35 million files in one folder yet with RSYNC, but I have done 50 million files from a file system having 1 million nested folders and it works pretty well.  I know the selective include does in-fact speed things up.  Non-matching files stay out of the mix.

You can also gain some goody by getting out of the Deltacopy interface and running separate and distinct CMD threads for each iteration on the Client side.

I really hope you try it because I'm very curious.

Regards,
- gurutc
0
 
robtkCommented:
Have you tried RichCopy? It's a fast, multi-threaded file copy utility and free.
http://technet.microsoft.com/en-us/magazine/2009.04.utilityspotlight.aspx
 
0
 
MarkJenksAuthor Commented:
I'm giving richcopy a try on a smaller subset of data and will compare that speed to robocopy.
0
 
gurutcCommented:
One thing I see is both Robocopy and Richcopy put all the work on one side of the transfer.  There's now client and server side to these that splits up the compare work.

Good Luck,

- gurutc
0
 
MarkJenksAuthor Commented:
I'm not against rsync at all, but I'd rather have a windows app so other people understand what is going on.
Not enough linux/unix people in the world for my liking.  lol

So far, the fact that richcopy can thread, it seems to be working pretty well.

I'm running it with 1mil files and about 16gb right now and trying to get a feel for it.
0
 
MarkJenksAuthor Commented:
Okay, Richcopy did a mirror copy is 35mins, where robocopy took 1 hour.

Full copy without mirror(purge) took 10mins.

It took quite a while before the copy portion took off due to the matching, so I'm guessing 25mins for the match and 10mins for the copy.  That was for 875000 files.

So, 35mil/875k=40.   25min+40 = 16.66 hours for the match, and then the copy time.  Unless I try to add more threads to it.  I have Richcopy set to 30/30/30.

Looks possible, but still going to be too slow.   Only about 1% of the data changes in a week.

I'm going to try and add more threads to it and see how much it changes.

Might be looking at rsync yet.

-Mark

0
 
MarkJenksAuthor Commented:
Well, 40/40/40 threads got it down to 32mins complete.

Still way better than robocopy (which is the best answer I had so far)

I was really hoping dfs was going to work for me since it's so simple to setup, but even that doesn't like what I'm doing..

Off to try Deltacopy now..

-Mark

0
 
MarkJenksAuthor Commented:
Okay, deltacopy.   4mins..

Off to try the big stuff now!
0
 
gurutcCommented:
yeah! uh huh!  and so on!

- gurutc
0
 
MarkJenksAuthor Commented:
Okay, here goes the final answer..    

Delta copy is the answer, but you have to be careful about the amount of files, not the size of them.

I had to split it off into 2 chunks, but not by file size.    But number of files.

If you push too many files at it, it does real good up to a certain point and just dies performance wise.

So, all said and done, running 2 jobs one after the other, it takes about 6.5 hours to pull this off.  The next thing to try is both at the same time, and I'll give that a try this week.

Thanks gurutc...

-Mark
0
 
MarkJenksAuthor Commented:
Okay, here goes the final answer..    

Delta copy is the answer, but you have to be careful about the amount of files, not the size of them.

I had to split it off into 2 chunks, but not by file size.    But number of files.

If you push too many files at it, it does real good up to a certain point and just dies performance wise.

So, all said and done, running 2 jobs one after the other, it takes about 6.5 hours to pull this off.  The next thing to try is both at the same time, and I'll give that a try this week.

Thanks gurutc...

-Mark
0
 
MarkJenksAuthor Commented:
Pointed me in the right direction, but even he didn't have an answer on how to handle the mass amount of data.

I hope everyone finds this question and learns from it.
0
 
gurutcCommented:
Glad to help.  The RSYNC evangelism continues!

- gurutc
0
 
MarkJenksAuthor Commented:
Broke it up in to 2 jobs,.   They are both scheduled to run at the same time, but they don't run at the same time.

So, it takes 4.65 hours to sync the whole thing.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.