Efficient way to copy / backup millions of small files


Would like to know how people out there are handling this where
you have millions of small files (I've seen this in logfiles, Lotus Notes
email or Outlook files) :

In the past I dread copying or backing up (using Windows Explorer
or Windows or Unix OS copy commands) to copy directories with
million over small files.

Q1:
Does Veritas Netbackup or HP Dataprotector take very long to backup
such a directory to tape ?  

Q2:
Would it be faster to use SAN or snapshot technology to just take a
quick snap of the entire SAN disk partition (where such small files
reside) or would it be more efficient to place these millions of small
files on a Solid State Disk (SSD) as SSD do not have seek/rotational
delays of conventional disks (SCSI or IDE) ?
sunhuxAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Kent OlsenDBACommented:
Hi Sun,

How volatile are the small files?  That is, you were to perform a full backup today, how many new files would there be tomorrow?  How many of these small files would have changed?

If files are "relatively static", that is, the existing files don't change very often, an incremental backup would seem to solve your problem.


Kent

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Sjef BosmanGroupware ConsultantCommented:
Small files? How small? Small enough to ZIP the lot in one single file?
sunhuxAuthor Commented:


Most of it are static, but about 120,000 of the files get updated or new ones
get created daily.  The small files' sizes range from 7kBytes to 90kBytes
Your Guide to Achieving IT Business Success

The IT Service Excellence Tool Kit has best practices to keep your clients happy and business booming. Inside, you’ll find everything you need to increase client satisfaction and retention, become more competitive, and increase your overall success.

Kent OlsenDBACommented:
Hi Sun,

Almost all commercial backup tools can handle this with an "incremental backup".  If the "average" size of the files is about 50K, the total size of the backup should be about 6GB.  That's not too much to ask of a backup.  :)

Veritas should handle that very nicely.  Each file would have to be read from disk and written to the backup device, which could be tape, CD, DVD, or other disk storage.  For a relatively small backup (6GB) I would expect this to complete in fairly quickly.  But with 120K files to be backed up, and certainly more than that in your file system, Veritas will perform quite a few reads of the file catalogs (directories) as well as at least 120,000 seeks/reads from the files to be backed up.  Still, I would expect this to finish in under 15 minutes.


Kent
Sjef BosmanGroupware ConsultantCommented:
What is the purpose of making the backup? To be able to revert to an earlier situation? Or to secure the current data? Could you maybe achieve the same using a RAID disk and weekly backups? Just asking...
Thomas RushCommented:
Any commercial business-oriented backup program should be able to handle these files; the challenge is the time it will take to complete a backup, as I suspect the OP knows.  Small files are terrible for performance, and may bring even a fast server and fast storage to its knees, with performance of undr 10MB/sec.

There are two ways I'd recommend performing the backup.  One, supported by most backup applications, is to perform an image backup, where you back up the sectors on the disk in the order they're laid out, without reference to the files or file system.    This is fast because there's only the absolute minimum of head movement -- we read all of cylinder 0, then cylinder 1, 2, 3, ...    The downside?   Restores can be slow, as it's got to synthesize the directory info from the dereferencing directory pointers on tape (full tape restores may also be fast, but a restore of a directory with many files, or a large fragmented file could take much longer than you expect).

The other option, which Data Protector supports and which works well, is to perform an incremental forever backup to disk, and periodically create synthetic full backups to tape.   In this method, you'll perform a full backup from disk only once; after that you only do incremental backups.   Once a week (or as often as business processes require), you tell Data Protector to create the synthetic full backup; it will use the information it has stored to spin off a backup tape that is exactly as if you'd done a full backup from disk to tape at that point in time.    The advantage is that all backups after the first happen reasonably fast since they are only incremental, and that you still get your files backed up to tape, with the ability to restore individual files or to a particular archive copy.   The disadvantage is the (relatively minor) cost of the additional space required for the initial backup-to-disk target.
sunhuxAuthor Commented:

> One, supported by most backup applications, is to perform an image backup, where you back
> up the sectors on the disk in the order they're laid out, without reference to the files or file system

Does DataProtector support the above method (image backup by sector)?  I happen to have one
though I'm not the administrator
Kent OlsenDBACommented:

Hi Sun,

I'm not sure that you really want an image backup.  It provides an outstanding restore point as restoring the image puts your system, including mass storage, exactly as it was at the time of the backup.  But unless you need that kind of capability, or intend to clone the image to another machine it's overkill.

There are more efficient ways to take backups.


Kent
Thomas RushCommented:
Data Protector does allow for image backups.

I don't know if Kdo has ever run benchmarks of backups -- as I have over the last ten years as part of my job -- but there is *no* more efficient way to perform a backup than by doing it sector by sector without regard to the file system, and this is particularly true when the disk has many small files and/or a complex directory structure.  Because of the minimal disk head movement, the disk will read at its full streaming speed, and be most likely to feed the tape drive at the tape's full streaming speed.

Image backup by a traditional backup application is not "overkill", but it does have downsides as I have mentioned previously, the biggest being that the files are not stored in file order, and a restore will tend to take longer -- sometimes significantly longer -- than a restore from a traditional file backup would.    If you need to restore from backup only in rare cases, or if the restores are not time-critical, this is probably not a huge issue... and in any case, it's always about tradeoffs: you can get low cost, fast backups, or fast restores, but you can generally only pick two.
kevinhsiehCommented:
I have volumes with millions of small files. There are several things that you can do, depending on why you're trying to do the backup, and for how long you want to keep the backups. Putting the files on a SAN or filer and taking snapshots is one option. You can also move them to a virtual machine and then take a backup of the VHD or VMDK file. It's not very space efficient to do that, but backups are a lot faster becasue you are backing up a few large files instead of searching through millions of little files.

I use robocopy for my backups, becasue my files don't change, so I don't need to worry about going back to a previous version from backup. It takes about 30 minutes to go through the filesystem to do the backup. I run it a few times a day.

You can use shadow copies on your file server to go back to earlier versions of files, and then combine that with robocopy or a few SAN/filer snapshots in case the $#!& hits the fan.

I think that if you let us know what your restore requirements are we can better help you.
sunhuxAuthor Commented:
excellent insights
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Storage Software

From novice to tech pro — start learning today.