Multi-threaded file / directory search

I want to design / develop a multi-threaded file search process that will search a drived for files that match a name token.  The search should begin in folder c:\xxxxxx\.... and search all sub-folders.  However, I want to be able to start a thread to search at each sub folder that is one child below the begining folder.  So:
C:\xxxxx\a\...
C:\xxxxx\b\...
C:\xxxxx\c\...
C:\xxxxx\d\.....
Searching c:\xxxx above will start 4 threads, one for a, b, c, d.
Then the results from each thread should be aggregated for presentation.
My problem is that I've tried this in the past and I believe that I had a problem with the file system, using multi threaded access.  In other words, can the file system - be searched in a multi-threaded manner - does it support this type of search.  I seem to remember finding that it did not... Since the file system is a hardware device; can this type of search - if it can be done - be expected to be slower than a single thread?  I'm thinking of the drive being accessed by multiple thread search different location and the seek time being increased... Any way, I need some way to speed up file searching in a large cdn environment - where it is possible for users to have thousands/millions of files.
Any ideas?
jparlatoAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wdosanjosCommented:
You can use a Parallel.ForEach to handle the threading.  The file system does support multi threaded access.  Here is some sample code:
var folder = @"C:\temp\";
var subfolders = Directory.GetDirectories(folder);
var allfiles = new List<string>();

Parallel.ForEach(subfolders, subfolder => 
    {
        var files = Directory.GetFiles(subfolder);
        
        lock(allfiles)
        {
            allfiles.AddRange(files);
        }
    }
);

foreach (var file in allfiles)
{
    Console.WriteLine(file);
}

Open in new window


I hope this helps.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
HooKooDooKuCommented:
The general rule is that making an application multi-threaded doesn't make it run faster UNLESS you've got the hardware to support multiple threads.

In other words, back in the days of single core CPU's, making an application multi-threaded did NOT make it run faster, because only one of the threads could ever run at one time.  But what it did do was to make the application more responsive, because it allowed the application to interact with the user while some background task was executing.

In today's world of multi-core CPU's, from what I understand, multiple threads DO run at the same time now.  But if you only have one physical hard-drive, the two threads can't access the drive at the same time.

Now where multi-threading CAN speed things up for even access to a single disk is if you have a bunch of things that have to be done between accessing the drive.  In that situation, you hopefully get one thread accessing the disk while the other thread is processing the data.

The thing I would think might make the application run the fastest might be to set up a situation where only one thread accesses the disk while another thread does any processing.  Set yourself of some "hand-shaking" between the two threads and have the 1st one search for files, and if files need to be read, have the 1st thread load a file into a memory buffer.  Then while the secondary thread(s) process the file(s) from the memory buffer, the 1st thread can do additional file searching and file reading.
käµfm³d 👽Commented:
Definitely agree with HooKooDooKu on this:  the disk drive is the bottleneck. If you were searching multiple drives, then maybe you could get some benefit from threading.
SolarWinds® Network Configuration Manager (NCM)

SolarWinds® Network Configuration Manager brings structure and peace of mind to configuration management. Bulk config deployment, automatic backups, change detection, vulnerability assessments, and config change templates reduce the time needed for repetitive tasks.

wdosanjosCommented:
I think the comments around limitations on multi-threaded access to disk devices does not take into account high-end RAID-5 and RAID-10 devices that do support concurrent access.

I would avoid early optimization of the code.  Try to make it work multi-threaded, and only if the performance is not the expected add thread synchronization around the directory/file operations as recommended by HooKooDooKu.  It's possible that due to buffering and your hardware configuration multi-threaded directory/file services will work just fine.
käµfm³d 👽Commented:
I think the comments around limitations on multi-threaded access to disk devices does not take into account high-end RAID-5 and RAID-10 devices that do support concurrent access.
...and you know the author is using a "high-end RAID-5 [or] RAID-10" device how?
käµfm³d 👽Commented:
Besides...

From http://en.wikipedia.org/wiki/RAID :
RAID 5 requires at least three disks.
RAID 1+0: (a.k.a. RAID 10) mirrored sets in a striped set (minimum four drives; even number of drives)...

Both of which would corroborate my claim of:

If you were searching multiple drives, then maybe you could get some benefit from threading.

= )
wdosanjosCommented:
Hi @kaufmed,

I was talking in generic terms as the first comments give the impression that all disk devices cannot support simultaneous access, which is not the case.

I think the author should try without synchronization between the threads, and there is a chance it will work with good performance.  And synchronizing the directory/file operations may actually slow down the process depending on the particular hw configuration.  If possible the author should try both to determine which one works better for his/her case.
jparlatoAuthor Commented:
Both answers were very helpful... as well as the comments from others.  I got exactly what I needed and am testing the code now.  I will post back as soon as I know if I was able to improve performance.  Thanks to everyone that contributed.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
C#

From novice to tech pro — start learning today.