?
Solved

Multi-threaded file / directory search

Posted on 2012-04-03
8
Medium Priority
?
695 Views
Last Modified: 2012-04-04
I want to design / develop a multi-threaded file search process that will search a drived for files that match a name token.  The search should begin in folder c:\xxxxxx\.... and search all sub-folders.  However, I want to be able to start a thread to search at each sub folder that is one child below the begining folder.  So:
C:\xxxxx\a\...
C:\xxxxx\b\...
C:\xxxxx\c\...
C:\xxxxx\d\.....
Searching c:\xxxx above will start 4 threads, one for a, b, c, d.
Then the results from each thread should be aggregated for presentation.
My problem is that I've tried this in the past and I believe that I had a problem with the file system, using multi threaded access.  In other words, can the file system - be searched in a multi-threaded manner - does it support this type of search.  I seem to remember finding that it did not... Since the file system is a hardware device; can this type of search - if it can be done - be expected to be slower than a single thread?  I'm thinking of the drive being accessed by multiple thread search different location and the seek time being increased... Any way, I need some way to speed up file searching in a large cdn environment - where it is possible for users to have thousands/millions of files.
Any ideas?
0
Comment
Question by:jparlato
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 23

Accepted Solution

by:
wdosanjos earned 800 total points
ID: 37802481
You can use a Parallel.ForEach to handle the threading.  The file system does support multi threaded access.  Here is some sample code:
var folder = @"C:\temp\";
var subfolders = Directory.GetDirectories(folder);
var allfiles = new List<string>();

Parallel.ForEach(subfolders, subfolder => 
    {
        var files = Directory.GetFiles(subfolder);
        
        lock(allfiles)
        {
            allfiles.AddRange(files);
        }
    }
);

foreach (var file in allfiles)
{
    Console.WriteLine(file);
}

Open in new window


I hope this helps.
0
 
LVL 16

Assisted Solution

by:HooKooDooKu
HooKooDooKu earned 800 total points
ID: 37802668
The general rule is that making an application multi-threaded doesn't make it run faster UNLESS you've got the hardware to support multiple threads.

In other words, back in the days of single core CPU's, making an application multi-threaded did NOT make it run faster, because only one of the threads could ever run at one time.  But what it did do was to make the application more responsive, because it allowed the application to interact with the user while some background task was executing.

In today's world of multi-core CPU's, from what I understand, multiple threads DO run at the same time now.  But if you only have one physical hard-drive, the two threads can't access the drive at the same time.

Now where multi-threading CAN speed things up for even access to a single disk is if you have a bunch of things that have to be done between accessing the drive.  In that situation, you hopefully get one thread accessing the disk while the other thread is processing the data.

The thing I would think might make the application run the fastest might be to set up a situation where only one thread accesses the disk while another thread does any processing.  Set yourself of some "hand-shaking" between the two threads and have the 1st one search for files, and if files need to be read, have the 1st thread load a file into a memory buffer.  Then while the secondary thread(s) process the file(s) from the memory buffer, the 1st thread can do additional file searching and file reading.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 37804116
Definitely agree with HooKooDooKu on this:  the disk drive is the bottleneck. If you were searching multiple drives, then maybe you could get some benefit from threading.
0
Get real performance insights from real users

Key features:
- Total Pages Views and Load times
- Top Pages Viewed and Load Times
- Real Time Site Page Build Performance
- Users’ Browser and Platform Performance
- Geographic User Breakdown
- And more

 
LVL 23

Expert Comment

by:wdosanjos
ID: 37804364
I think the comments around limitations on multi-threaded access to disk devices does not take into account high-end RAID-5 and RAID-10 devices that do support concurrent access.

I would avoid early optimization of the code.  Try to make it work multi-threaded, and only if the performance is not the expected add thread synchronization around the directory/file operations as recommended by HooKooDooKu.  It's possible that due to buffering and your hardware configuration multi-threaded directory/file services will work just fine.
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 37804380
I think the comments around limitations on multi-threaded access to disk devices does not take into account high-end RAID-5 and RAID-10 devices that do support concurrent access.
...and you know the author is using a "high-end RAID-5 [or] RAID-10" device how?
0
 
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 37804402
Besides...

From http://en.wikipedia.org/wiki/RAID :
RAID 5 requires at least three disks.
RAID 1+0: (a.k.a. RAID 10) mirrored sets in a striped set (minimum four drives; even number of drives)...

Both of which would corroborate my claim of:

If you were searching multiple drives, then maybe you could get some benefit from threading.

= )
0
 
LVL 23

Expert Comment

by:wdosanjos
ID: 37804429
Hi @kaufmed,

I was talking in generic terms as the first comments give the impression that all disk devices cannot support simultaneous access, which is not the case.

I think the author should try without synchronization between the threads, and there is a chance it will work with good performance.  And synchronizing the directory/file operations may actually slow down the process depending on the particular hw configuration.  If possible the author should try both to determine which one works better for his/her case.
0
 

Author Closing Comment

by:jparlato
ID: 37805435
Both answers were very helpful... as well as the comments from others.  I got exactly what I needed and am testing the code now.  I will post back as soon as I know if I was able to improve performance.  Thanks to everyone that contributed.
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Hyper-convergence systems have taken the IT world by storm and have quickly started to change our point of view of how the data center should and could be architected. In this article, I’ll explain the benefits of employing a hyper-converged system …
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…
Suggested Courses

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question