Efficiently copying lots of small files over a network

Posted on 2011-09-17
Last Modified: 2012-06-21
Hi everybody,

Every once in a while I have a customer's computer that I need to quickly make a backup of. I have a linux fileserver + samba for this with a high performance raid storage array connected via gigabit ethernet for this purpose.

Normally I just use robocopy and copy all of the necessary files over the network, and this works pretty good.

The problem, though, is that if the client has hundreds of thousands of tiny files - e.g. if I am copying windows, cache folders, temp data etc - it is not efficient, it is quite slow.

Is there a more efficient file copying utility that I can use to make these copies that will:

1) FAST and easy installation on the client's machine. I don't want to go through a whole big Windows installer or complicated configuration (e.g. Cygwin installation of cwRsync). Ideally just a single .EXE that I can run would be nice.

2) Recursively handle large volumes (e.g. 500,000+) of files, lots of deep nested folders etc, - e.g. a quality, robust, efficient copying program

3) Gracefully handle errors like files being in use, and keep pushing through the copy, hopefully producing a logfile of the results afterwards

4) Efficiently copy to a linux file system and not get bogged down by lots of tiny files

I can setup whatever server side stuff is necessary on the linux machine. I have Samba setup, but I could set up SSH, NFS, something proprietary etc. whatever is necessary.
Question by:Frosty555
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 49

Expert Comment

ID: 36554885
Have a look at RichCopy

Expert Comment

ID: 36554997
I would setup a FTP server. Then use for example FileZilla Client to transfer the files efficiently.

FileZilla Server and client for linux and windows:

Expert Comment

ID: 36555956
You might consider GoodSync, that will handle the diff version of your files.
So on first start, it may be long, but next, it will only copy (synchronize) the files that have been changed, and I'm pretty sure that on 500,000 files, a whole majority of them don't change.

It works on a heaps of service type, and on the server side, you could make a script to backup the different versions if you need versionning.
Get 15 Days FREE Full-Featured Trial

Benefit from a mission critical IT monitoring with Monitis Premium or get it FREE for your entry level monitoring needs.
-Over 200,000 users
-More than 300,000 websites monitored
-Used in 197 countries
-Recommended by 98% of users


Accepted Solution

duffme earned 334 total points
ID: 36556165
I had similar constraints when needing to replicate 50GB a night between two 2008 servers.  I wound up using a mix of Robocopy and RichCopy.  Richcopy is not completely stable and occassionally crashes or acts a bit flaky.  Multi-threading is key.  Older versions of Robocopy are single threaded.  I think only Win7 and Server 2008 R2 include the /MT option for RB, and it can't be installed on an earlier version.  If you have one of these make sure to use the /MT option.  There are some third party apps, but I wasn't allowed to use them.  

Use exclude filters.  If you are wasting bandwidth copying useless cache and such then just don't copy it.  If you are looking for a total backup then a hot imaging solution like Acronis can be great, but not cheap.

Only copy deltas.  If you are using Robocopy you can use the /MIR option and time parameters and only copy what has changed.  You could possibly do this with Archive bits too.

Can you pull the copy using rsync or cpio through SAMBA client accessing the Windows root shares (C$, etc.)?

Assisted Solution

hrr1963 earned 166 total points
ID: 36556447
I still recommend you to use a FTP solution, then tweak the client to transfer 10 files at the same time.
LVL 31

Author Comment

ID: 36568133
Hi everyone,

It looks like multithreading the copy is really the way to go (or at least finding a copy utility that takes advantage of that).

I've never heard of richcopy before. I'll check it out.

I've tried the FTP approach... and I also tried WinSCP, but I found it to actually be slower than regular file copies due to the constant back and forth handshaking needed in the FTP protocol. It was fine for large files, but again like the others, choked on lots of small files. It might work if I told FileZilla to copy 10 at a time. I haven't actually tried that yet. Will FileZilla continue to be robust and fast when I have half a million files in the queue? I'm not sure if it was ever designed to handle that kind of file volume.

I can't pull a copy of the files using the administrative share - while that works on all of *MY* computers, it doesn't necessarily work on my clients who have a wide array of firewalls, weird windows settings and broken services on their computers.

Also I wasn't aware of the /MT option in Robocopy. Some of my clients are Windows XP, but that's not going to be the case forever, I'll look into the /MT.

Expert Comment

ID: 36569221
As far as the admin shares, you can create your own root shares with specific security and using a service account, but this may not be possible on client boxes.  Remember that richcopy is not the most stable utility, though many of us still find a use for it. Some third party apps use multithreading and offer great features for not too much money, but /MT with robocopy should work great if your Win boxes are new enough.

Assisted Solution

duffme earned 334 total points
ID: 36590051
btw, I was reminded today: in RichCopy there are options for multithreading pertaining to searching, directory, and file operations.  Multithreading file operations will try to use multiple threads per single file copy, which can cause a lot fo errors.  The other options will allow multiple copies (each in its own thread) and allow multiple threads for file comparison, dir search etc.  The point is, you may need to tweak particular multithread options depending on what utility you wind up using.

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There’s a movement in Information Technology (IT), and while it’s hard to define, it is gaining momentum. Some call it “stream-lined IT;” others call it “thin-model IT.”
We take a look at some of the most common obstacles that IT teams run into as they work relentlessly to keep all the alarms and sirens from going off at once.
This tutorial will walk an individual through the steps necessary to enable the VMware\Hyper-V licensed feature of Backup Exec 2012. In addition, how to add a VMware server and configure a backup job. The first step is to acquire the necessary licen…
To efficiently enable the rotation of USB drives for backups, storage pools need to be created. This way no matter which USB drive is installed, the backups will successfully write without any administrative intervention. Multiple USB devices need t…

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question