Efficiently copying lots of small files over a network

Hi everybody,

Every once in a while I have a customer's computer that I need to quickly make a backup of. I have a linux fileserver + samba for this with a high performance raid storage array connected via gigabit ethernet for this purpose.

Normally I just use robocopy and copy all of the necessary files over the network, and this works pretty good.

The problem, though, is that if the client has hundreds of thousands of tiny files - e.g. if I am copying windows, cache folders, temp data etc - it is not efficient, it is quite slow.

Is there a more efficient file copying utility that I can use to make these copies that will:

1) FAST and easy installation on the client's machine. I don't want to go through a whole big Windows installer or complicated configuration (e.g. Cygwin installation of cwRsync). Ideally just a single .EXE that I can run would be nice.

2) Recursively handle large volumes (e.g. 500,000+) of files, lots of deep nested folders etc, - e.g. a quality, robust, efficient copying program

3) Gracefully handle errors like files being in use, and keep pushing through the copy, hopefully producing a logfile of the results afterwards

4) Efficiently copy to a linux file system and not get bogged down by lots of tiny files

I can setup whatever server side stuff is necessary on the linux machine. I have Samba setup, but I could set up SSH, NFS, something proprietary etc. whatever is necessary.
LVL 31
Who is Participating?
duffmeConnect With a Mentor Commented:
I had similar constraints when needing to replicate 50GB a night between two 2008 servers.  I wound up using a mix of Robocopy and RichCopy.  Richcopy is not completely stable and occassionally crashes or acts a bit flaky.  Multi-threading is key.  Older versions of Robocopy are single threaded.  I think only Win7 and Server 2008 R2 include the /MT option for RB, and it can't be installed on an earlier version.  If you have one of these make sure to use the /MT option.  There are some third party apps, but I wasn't allowed to use them.  

Use exclude filters.  If you are wasting bandwidth copying useless cache and such then just don't copy it.  If you are looking for a total backup then a hot imaging solution like Acronis can be great, but not cheap.

Only copy deltas.  If you are using Robocopy you can use the /MIR option and time parameters and only copy what has changed.  You could possibly do this with Archive bits too.

Can you pull the copy using rsync or cpio through SAMBA client accessing the Windows root shares (C$, etc.)?
Have a look at RichCopy http://en.wikipedia.org/wiki/RichCopy
I would setup a FTP server. Then use for example FileZilla Client to transfer the files efficiently.

FileZilla Server and client for linux and windows: http://filezilla-project.org/
Protect Your Employees from Wi-Fi Threats

As Wi-Fi growth and popularity continues to climb, not everyone understands the risks that come with connecting to public Wi-Fi or even offering Wi-Fi to employees, visitors and guests. Download the resource kit to make sure your safe wherever business takes you!

You might consider GoodSync, that will handle the diff version of your files.
So on first start, it may be long, but next, it will only copy (synchronize) the files that have been changed, and I'm pretty sure that on 500,000 files, a whole majority of them don't change.

It works on a heaps of service type, and on the server side, you could make a script to backup the different versions if you need versionning.

hrr1963Connect With a Mentor Commented:
I still recommend you to use a FTP solution, then tweak the client to transfer 10 files at the same time.
Frosty555Author Commented:
Hi everyone,

It looks like multithreading the copy is really the way to go (or at least finding a copy utility that takes advantage of that).

I've never heard of richcopy before. I'll check it out.

I've tried the FTP approach... and I also tried WinSCP, but I found it to actually be slower than regular file copies due to the constant back and forth handshaking needed in the FTP protocol. It was fine for large files, but again like the others, choked on lots of small files. It might work if I told FileZilla to copy 10 at a time. I haven't actually tried that yet. Will FileZilla continue to be robust and fast when I have half a million files in the queue? I'm not sure if it was ever designed to handle that kind of file volume.

I can't pull a copy of the files using the administrative share - while that works on all of *MY* computers, it doesn't necessarily work on my clients who have a wide array of firewalls, weird windows settings and broken services on their computers.

Also I wasn't aware of the /MT option in Robocopy. Some of my clients are Windows XP, but that's not going to be the case forever, I'll look into the /MT.
As far as the admin shares, you can create your own root shares with specific security and using a service account, but this may not be possible on client boxes.  Remember that richcopy is not the most stable utility, though many of us still find a use for it. Some third party apps use multithreading and offer great features for not too much money, but /MT with robocopy should work great if your Win boxes are new enough.
duffmeConnect With a Mentor Commented:
btw, I was reminded today: in RichCopy there are options for multithreading pertaining to searching, directory, and file operations.  Multithreading file operations will try to use multiple threads per single file copy, which can cause a lot fo errors.  The other options will allow multiple copies (each in its own thread) and allow multiple threads for file comparison, dir search etc.  The point is, you may need to tweak particular multithread options depending on what utility you wind up using.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.