Solved

Multi-threaded file / directory search

Posted on 2012-04-03
8
594 Views
Last Modified: 2012-04-04
I want to design / develop a multi-threaded file search process that will search a drived for files that match a name token.  The search should begin in folder c:\xxxxxx\.... and search all sub-folders.  However, I want to be able to start a thread to search at each sub folder that is one child below the begining folder.  So:
C:\xxxxx\a\...
C:\xxxxx\b\...
C:\xxxxx\c\...
C:\xxxxx\d\.....
Searching c:\xxxx above will start 4 threads, one for a, b, c, d.
Then the results from each thread should be aggregated for presentation.
My problem is that I've tried this in the past and I believe that I had a problem with the file system, using multi threaded access.  In other words, can the file system - be searched in a multi-threaded manner - does it support this type of search.  I seem to remember finding that it did not... Since the file system is a hardware device; can this type of search - if it can be done - be expected to be slower than a single thread?  I'm thinking of the drive being accessed by multiple thread search different location and the seek time being increased... Any way, I need some way to speed up file searching in a large cdn environment - where it is possible for users to have thousands/millions of files.
Any ideas?
0
Comment
Question by:jparlato
8 Comments
 
LVL 23

Accepted Solution

by:
wdosanjos earned 200 total points
Comment Utility
You can use a Parallel.ForEach to handle the threading.  The file system does support multi threaded access.  Here is some sample code:
var folder = @"C:\temp\";
var subfolders = Directory.GetDirectories(folder);
var allfiles = new List<string>();

Parallel.ForEach(subfolders, subfolder => 
    {
        var files = Directory.GetFiles(subfolder);
        
        lock(allfiles)
        {
            allfiles.AddRange(files);
        }
    }
);

foreach (var file in allfiles)
{
    Console.WriteLine(file);
}

Open in new window


I hope this helps.
0
 
LVL 16

Assisted Solution

by:HooKooDooKu
HooKooDooKu earned 200 total points
Comment Utility
The general rule is that making an application multi-threaded doesn't make it run faster UNLESS you've got the hardware to support multiple threads.

In other words, back in the days of single core CPU's, making an application multi-threaded did NOT make it run faster, because only one of the threads could ever run at one time.  But what it did do was to make the application more responsive, because it allowed the application to interact with the user while some background task was executing.

In today's world of multi-core CPU's, from what I understand, multiple threads DO run at the same time now.  But if you only have one physical hard-drive, the two threads can't access the drive at the same time.

Now where multi-threading CAN speed things up for even access to a single disk is if you have a bunch of things that have to be done between accessing the drive.  In that situation, you hopefully get one thread accessing the disk while the other thread is processing the data.

The thing I would think might make the application run the fastest might be to set up a situation where only one thread accesses the disk while another thread does any processing.  Set yourself of some "hand-shaking" between the two threads and have the 1st one search for files, and if files need to be read, have the 1st thread load a file into a memory buffer.  Then while the secondary thread(s) process the file(s) from the memory buffer, the 1st thread can do additional file searching and file reading.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Definitely agree with HooKooDooKu on this:  the disk drive is the bottleneck. If you were searching multiple drives, then maybe you could get some benefit from threading.
0
 
LVL 23

Expert Comment

by:wdosanjos
Comment Utility
I think the comments around limitations on multi-threaded access to disk devices does not take into account high-end RAID-5 and RAID-10 devices that do support concurrent access.

I would avoid early optimization of the code.  Try to make it work multi-threaded, and only if the performance is not the expected add thread synchronization around the directory/file operations as recommended by HooKooDooKu.  It's possible that due to buffering and your hardware configuration multi-threaded directory/file services will work just fine.
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
I think the comments around limitations on multi-threaded access to disk devices does not take into account high-end RAID-5 and RAID-10 devices that do support concurrent access.
...and you know the author is using a "high-end RAID-5 [or] RAID-10" device how?
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Besides...

From http://en.wikipedia.org/wiki/RAID :
RAID 5 requires at least three disks.
RAID 1+0: (a.k.a. RAID 10) mirrored sets in a striped set (minimum four drives; even number of drives)...

Both of which would corroborate my claim of:

If you were searching multiple drives, then maybe you could get some benefit from threading.

= )
0
 
LVL 23

Expert Comment

by:wdosanjos
Comment Utility
Hi @kaufmed,

I was talking in generic terms as the first comments give the impression that all disk devices cannot support simultaneous access, which is not the case.

I think the author should try without synchronization between the threads, and there is a chance it will work with good performance.  And synchronizing the directory/file operations may actually slow down the process depending on the particular hw configuration.  If possible the author should try both to determine which one works better for his/her case.
0
 

Author Closing Comment

by:jparlato
Comment Utility
Both answers were very helpful... as well as the comments from others.  I got exactly what I needed and am testing the code now.  I will post back as soon as I know if I was able to improve performance.  Thanks to everyone that contributed.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

The 6120xp switches seem to have a bug when you create a fiber port channel when you have a UCS fabric interconnects talking to them.  If you follow the Cisco guide for the UCS, the FC Port channel will never come up and it will say that there are n…
More or less everybody in the IT market understands the basics of Networking, however when we start talking about Storage Networks, things get a bit dizzier, and this is where I would like to help.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now