How to handle Millions of files? (Bogs windows file system)

I'm developing a program that will need to keep a record of millions of images, created at a rate of about 5 per second.
Right now, I'm simply writing them, separately (each image in its own separate file) into a folder.

However, as soon as the folder gets 5, 10, 20 thousand images, if you try to open it in "explorer", the desktop stops responding for a few seconds, getting increasingly worse with the number of images.


So, what I need to know is, what's the best way to handle such large amount of files?

I tried setting the folder as compressed, to see if the internal management of a large number of files would be more efficient - but it's the same.

My plausible options are:
1) splitting it up into different folders, each having just about 1000 images. (but causing a "management" nightmare in the software side)
2) storing the images inside a MySQL database (haven't tried it out - don't know if it would be very efficient storing binary images at those rates in there - creating a 200GB database)
3) any options to speed up windows folder "parsing"?
4) Whatever you can think of that might be of help

(using NTFS)
LVL 11
CarlosMMartinsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jessie Gill, CISSPTechnical ArchitectCommented:
well one way is to beef up your machine if it is not already.  This will help open the folder faster.
Quad core, 4 gig ram, fast hardrives min (7200 rpm) in a strip or raid 5 or 10 mirror.  also make sure the folder with the files is on details, and no thumbnails, will help speed it up
0
RurneCommented:
Realistically, you're going to be stuck with option #1.  From my experience:

#2 won't work, especially as the database grows past physical RAM.  You typically can adjust buffer pool sizes to try to keep your table index in memory as much as possible, but any query for several images at once is going to slow considerably (especially if your MYI files fragment, which inevitably happens as they get huge).

#3 is a NTFS issue.  If you're previewing thumbnails, you'd want to disable that, but otherwise, there isn't much you can do.  Again, fragmentation, plus a relatively slow filetable lookup, means you have to wait for it to parse out the entire listing of the folder before it can even try to render.

How are you currently looking the files up in your management application?
0
CarlosMMartinsAuthor Commented:
Thanks for your assistance.
@jessiepak:

Yes, I'm running on "maxed out" hardware: quad-core, 4GB ram, raid 0 spanning 4 Sata disks, etc.
I also have all the possible and imaginable options disabled (thumbanails, last file access, thumbnails, etc.)

@rurne

The thing is exactly that, it seems like windows itself "gets stuck" just parsing the NTFS file structure - which is likely not well suited for this kind of application with millions of files.

However, I couldn't find any good info info regarding if any other file system (linuxes, or otherwise) is well suited for that kind of application.
(That's what I hoped I could achieve - to some degree - by compressing the folder, effectively turning it into a single file with better "parsing" - but it didn't perform as expected. It was slightly faster, but not nearly to acceptable levels)

Until now we were developping with just about 1000 test images - so, there was no problem with that.
Just when I started testing it with 10,000+ files, it became apparent that this was going to get tricky....
And we were hoping to use about 200GB of disk space for that "last images captured" buffer (allowing over 2 million individual files) I'm now trying to figure the best way to do it.
(If possible without that #1 approach ;)
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

cuziyqCommented:
There's really nothing you can do about it.  I have the same problem.  What's happening is that Explorer is parsing a file list and then trying to make sense of them (i.e. getting their doc type, size, icons, etc.)  All of this executes in a single thread which brings down explorer until it finishes.

You can't do much to solve the problem, but you can mitigate it somewhat.  Open up My Computer and select Tools -> Folder options from the menu bar.  Then hit the View tab.  Select the checkbox "Launch folder windows in a seperate process" and hit OK.  This will make the window spawn a new thread.

The window will still lock up while it's doing its thing, but other Explorer windows will still be usable.  Note that file copy/paste operations in other windows will still wait for that thread to finish, so it's not a perfect solution.

This is one of the things that is supposed to be solved by WinFS, which got scrapped in Vista but is supposed to be a part of Windows 7 when it comes out.
0
CarlosMMartinsAuthor Commented:
I'm not an Linux expert, anyone can tell me if the same would happen in its file system, or does it handle well these number of files?
0
cuziyqCommented:
It depends :-)  The Linux kernel itself has no problem with that many files.  Any console app will blaze right through them.  Enumerating a file list is a kernel call, so it's fast and efficient.

That being said, desktop managers such as KDE or GNOME might try to do the same thing Explorer does -- get info about those files that goes beyond a simple kernel call.  On both Windows and Linux, it's an application problem, not an OS one (although NTFS is a bit less efficient at searching and traversing a b-tree).

The good news with Linux is that you're not stuck with the kool-aid they give you like you are with Explorer in Windows.  I don't know how KDE and GNOME react to that situation, as I've not tested it with them.  But the chances are high that if it were a problem, someone somewhere would've gotten pissed off by now and threaded it out better :-)
0
CarlosMMartinsAuthor Commented:
@cuziyq

That made me think about something... Maybe I really can do without the standard windows explorer?
The program will be running in a sort of "closed up" PC, running my software 24/7 - I was even considering to tweak the registry so it started my program and remove the "explorer" from even running.

As a last resort I could code a very simple alternative "file explorer" (though there might already be some out there I could use)

I'll code a sample program to try and assess the performance difference if the folder parsing is not being done by the explorer - just getting the file name instead of the file info, image dimensions, etc.etc.
0
cuziyqCommented:
Here's a blast from the past for ya . . . all versions of Windows up to Win2000 came with the good ol' classic File Manager app.  Remember that from the days Win3.1 and NT 3.51?  Anyway, it does everything you should want it to do.  Wish I'd thought of it sooner.  I am sure if you just copy it from a Win2k CD, it should work in XP/2K3 Server.  Supports long filenames, allows you to change permissions, and lets you copy/move things from one place to another.  It even lets you map network drives.
0
naldiianCommented:
It looks like you are already on the right track as far as the casue here, but I wanted to iterate - This is clearly not a file system issue here, but rather a GUI Explorer issue. Are the files stored on and accessed from the same machine or is this going to be separate systems?

The reason there are so many image management software and such out thre is because storing the metadata about files in a database that is not always updating the metadata each time it is viewed is much preferred to the Explorer method. I would definitely go with a shell for your app that it not Explorer, and use a seperate database to keep the metadata elsewhere if you need that.

Too bad the Indexing services on Windows systems does not seem to be used all that actively by Explorer itself. The Windows client OSs are much worse than server when it comes to viewing folders and such, but there are ways to tweak their actions as well, so you can likely get much better performance with the options set as mentioned to not show some details, or thumbnails and such.
0
CarlosMMartinsAuthor Commented:
Even when doing a "dir" in the command line (which would prevent it from doing all the "slow" stuff explorer does) the process freezes for a long time - so it seems to be File System related.

I'll try coding a benchmark util, to time how long it takes to get the folder file list, with 1 to 50,000 in steps of 100 files, to see if it increases linearly, or if there's a "sweet spot" after which performance becomes unaceptable.

If that's the case, I'll have no other option but to spilt the files into multiple folders... :/
0
naldiianCommented:
What OS is this? I have a file server on Server 2003 with single folders with several hundred thousand files with little problem working with the files. I prefer some organization myself anyway and recommend you go the route of splitting up that number of files for effective management anyway, but I would sure like to see what the cause of the problem is and hopefully we can improve file system performance for you regardless of the usage.
0
CarlosMMartinsAuthor Commented:
Win XP SP2 fully updated, using NTFS
0
za_mkhIT ManagerCommented:
Here's a question. You have RAID 0 across four disks. That speeds it but also adds an element of danger in that if one disk fails, you lose your entire volume.

Also if you are using NTFS, what is

1) the average size of each file
2) The cluster size when you formatted your partition

There could be a correlation there too!

0
za_mkhIT ManagerCommented:
Here is a link that you could try for some ideas now that I have searched the net for something I read a long time ago:

http://www.tweakxp.com/article37043.aspx
http://www.windowsdevcenter.com/pub/a/windows/2005/02/08/NTFS_Hacks.html

Look at the option of disabling, the last access time feature ...
0
za_mkhIT ManagerCommented:
I slept on this, and have another few ideas. I recall that we have a server running Windows 2000 a folder containing an excessive amount (>1 million) of files about 1-10kb in size, that I was worried about last year when we migrated to Active Directory. I thought ADMT would take an entire weekend to update security ACLS on it, but it took only 1.5 hours. If I remember the report, the total number of files on that server was in excess of 9 million.

The Drive is hosted on a SCSI RAID 5 partition (not the best for performance) but I have not heard anybody complain about explorer crashing when viewing that folder, so I would advise you to also consider this:

1) If you do have access to a Windows 2000/2003 Server OS. Try your tests on this OS
2) Try this test on a SCSI disk system - SCSI is good since it doesn't 'interrupt' the processor when a disk reads/write has to occur. I still haven't read about SATA - it could still follow the IDE methodology.

I will go to work and check up on that server and come back to you with my findings.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
CarlosMMartinsAuthor Commented:
Thanks for all your suggestions. A "Server OS" is not an option - this application will be deployed as an embedded system, will have to work with Embedded XP in the end.

I coded a small "benchmark" program to try and figure out the performance regarding the number of files, however Windows caches the results, and it would require frequent booting up after each test to have reliable results - something that would take far longer than I'm willing to spend right now.

I'll implement a "year/month/day/hour/15min/" folder structure that will give me about 2500 files per folder which seem to show up nicely without delays.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows OS

From novice to tech pro — start learning today.