Link to home
Start Free TrialLog in
Avatar of CarlosMMartins
CarlosMMartins

asked on

How to handle Millions of files? (Bogs windows file system)

I'm developing a program that will need to keep a record of millions of images, created at a rate of about 5 per second.
Right now, I'm simply writing them, separately (each image in its own separate file) into a folder.

However, as soon as the folder gets 5, 10, 20 thousand images, if you try to open it in "explorer", the desktop stops responding for a few seconds, getting increasingly worse with the number of images.


So, what I need to know is, what's the best way to handle such large amount of files?

I tried setting the folder as compressed, to see if the internal management of a large number of files would be more efficient - but it's the same.

My plausible options are:
1) splitting it up into different folders, each having just about 1000 images. (but causing a "management" nightmare in the software side)
2) storing the images inside a MySQL database (haven't tried it out - don't know if it would be very efficient storing binary images at those rates in there - creating a 200GB database)
3) any options to speed up windows folder "parsing"?
4) Whatever you can think of that might be of help

(using NTFS)
SOLUTION
Avatar of Jessie Gill, CISSP
Jessie Gill, CISSP
Flag of Canada image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of CarlosMMartins
CarlosMMartins

ASKER

Thanks for your assistance.
@jessiepak:

Yes, I'm running on "maxed out" hardware: quad-core, 4GB ram, raid 0 spanning 4 Sata disks, etc.
I also have all the possible and imaginable options disabled (thumbanails, last file access, thumbnails, etc.)

@rurne

The thing is exactly that, it seems like windows itself "gets stuck" just parsing the NTFS file structure - which is likely not well suited for this kind of application with millions of files.

However, I couldn't find any good info info regarding if any other file system (linuxes, or otherwise) is well suited for that kind of application.
(That's what I hoped I could achieve - to some degree - by compressing the folder, effectively turning it into a single file with better "parsing" - but it didn't perform as expected. It was slightly faster, but not nearly to acceptable levels)

Until now we were developping with just about 1000 test images - so, there was no problem with that.
Just when I started testing it with 10,000+ files, it became apparent that this was going to get tricky....
And we were hoping to use about 200GB of disk space for that "last images captured" buffer (allowing over 2 million individual files) I'm now trying to figure the best way to do it.
(If possible without that #1 approach ;)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm not an Linux expert, anyone can tell me if the same would happen in its file system, or does it handle well these number of files?
It depends :-)  The Linux kernel itself has no problem with that many files.  Any console app will blaze right through them.  Enumerating a file list is a kernel call, so it's fast and efficient.

That being said, desktop managers such as KDE or GNOME might try to do the same thing Explorer does -- get info about those files that goes beyond a simple kernel call.  On both Windows and Linux, it's an application problem, not an OS one (although NTFS is a bit less efficient at searching and traversing a b-tree).

The good news with Linux is that you're not stuck with the kool-aid they give you like you are with Explorer in Windows.  I don't know how KDE and GNOME react to that situation, as I've not tested it with them.  But the chances are high that if it were a problem, someone somewhere would've gotten pissed off by now and threaded it out better :-)
@cuziyq

That made me think about something... Maybe I really can do without the standard windows explorer?
The program will be running in a sort of "closed up" PC, running my software 24/7 - I was even considering to tweak the registry so it started my program and remove the "explorer" from even running.

As a last resort I could code a very simple alternative "file explorer" (though there might already be some out there I could use)

I'll code a sample program to try and assess the performance difference if the folder parsing is not being done by the explorer - just getting the file name instead of the file info, image dimensions, etc.etc.
Here's a blast from the past for ya . . . all versions of Windows up to Win2000 came with the good ol' classic File Manager app.  Remember that from the days Win3.1 and NT 3.51?  Anyway, it does everything you should want it to do.  Wish I'd thought of it sooner.  I am sure if you just copy it from a Win2k CD, it should work in XP/2K3 Server.  Supports long filenames, allows you to change permissions, and lets you copy/move things from one place to another.  It even lets you map network drives.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Even when doing a "dir" in the command line (which would prevent it from doing all the "slow" stuff explorer does) the process freezes for a long time - so it seems to be File System related.

I'll try coding a benchmark util, to time how long it takes to get the folder file list, with 1 to 50,000 in steps of 100 files, to see if it increases linearly, or if there's a "sweet spot" after which performance becomes unaceptable.

If that's the case, I'll have no other option but to spilt the files into multiple folders... :/
What OS is this? I have a file server on Server 2003 with single folders with several hundred thousand files with little problem working with the files. I prefer some organization myself anyway and recommend you go the route of splitting up that number of files for effective management anyway, but I would sure like to see what the cause of the problem is and hopefully we can improve file system performance for you regardless of the usage.
Win XP SP2 fully updated, using NTFS
Here's a question. You have RAID 0 across four disks. That speeds it but also adds an element of danger in that if one disk fails, you lose your entire volume.

Also if you are using NTFS, what is

1) the average size of each file
2) The cluster size when you formatted your partition

There could be a correlation there too!

Here is a link that you could try for some ideas now that I have searched the net for something I read a long time ago:

http://www.tweakxp.com/article37043.aspx
http://www.windowsdevcenter.com/pub/a/windows/2005/02/08/NTFS_Hacks.html

Look at the option of disabling, the last access time feature ...
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for all your suggestions. A "Server OS" is not an option - this application will be deployed as an embedded system, will have to work with Embedded XP in the end.

I coded a small "benchmark" program to try and figure out the performance regarding the number of files, however Windows caches the results, and it would require frequent booting up after each test to have reliable results - something that would take far longer than I'm willing to spend right now.

I'll implement a "year/month/day/hour/15min/" folder structure that will give me about 2500 files per folder which seem to show up nicely without delays.