Using parallelization to find all directories containing a specific filename

nightshadz
nightshadz used Ask the Experts™
on
I am developing an application which takes a root directory, finds all sub directories created the prior day, then adds to a List all of those directories that contain a file with a specific filename.

I was researching the fastest way to do this and came across PLINQ. PLINQ is new to me so I'm not sure if I'm missing something important. Am I using PLINQ correctly?

static void Main(string[] args)
        {
            List<FileInfo> processThis = new List<FileInfo>();

            //List<string> paths = new List<string>();
            Stopwatch stopWatch = new Stopwatch();
            stopWatch.Start();

            DirectoryInfo root = new DirectoryInfo(@"\\server\Shared\CC\Jobs");

            var dirs = root.EnumerateDirectories().Where(x => x.CreationTime.Date == DateTime.Now.AddDays(-1).Date);
            Console.WriteLine(dirs.Count());

            Parallel.ForEach(dirs, dir => processThis.AddRange(dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1"))));
            Console.WriteLine(processThis.Count());

            stopWatch.Stop();
            Console.WriteLine(stopWatch.Elapsed);
            Console.ReadKey();
        }

Open in new window

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Fractional CTO
Distinguished Expert 2018
Commented:
If you try to parallelize any type of disk traversal all you'll do is slow down your code to a crawl.

Better to use a system command like find, which is optimized to quickly walk large directory hierarchies.

Using .NET primitives might work + best test speed difference between .NET + command line tool speed.

If you parallelize this access, all you'll do is have many threads causing random disk head seeks, so 100s of threads all fighting with each other... running disk heads chaotically across platters will be far slower, then starting a find at the top of your disk hierarchy in a single thread, so hopefully disk reads occur sequentially, as much as possible.

Author

Commented:
I thought the Parallel.ForEach is acting on an in-memory IEnumerable of directories, then enumerating the filenames in that directory.  I'm curious why a regular ForEach is slower than Parallel.ForEach in this instance?

           //slower
            foreach (var dir in dirs)
            {
                if (dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1")).Any())
                    processThis.Add(dir.FullName);
            }
            
           //faster
            Parallel.ForEach(dirs, dir =>
            {
                if (dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1")).Any())
                    processThis.Add(dir.FullName);
                //processThis.AddRange(dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1"))
            });

Open in new window

ǩa̹̼͍̓̂ͪͤͭ̓u͈̳̟͕̬ͩ͂̌͌̾̀ͪf̭̤͉̅̋͛͂̓͛̈m̩̘̱̃e͙̳͊̑̂ͦ̌ͯ̚d͋̋ͧ̑ͯ͛̉Glanced up at my screen and thought I had coded the Matrix...  Turns out, I just fell asleep on the keyboard.
Most Valuable Expert 2011
Top Expert 2015
Commented:
running disk heads chaotically across platters will be far slower
Do SSDs have platters now?    😜

But in all seriousness, as David alludes to you could have issues with accessing information from a single drive in parallel. You're dealing with how the disk controller--whether the disk be spinning platter or solid-state--manages accessing the bits. You'll just need to test what works for your environment.

Also, since you're accessing an I/O device, you may consider looking into async/await. You'll be able to let your CPU do meaningful work while it's waiting for I/O to complete.
Bootstrap 4: Exploring New Features

Learn how to use and navigate the new features included in Bootstrap 4, the most popular HTML, CSS, and JavaScript framework for developing responsive, mobile-first websites.

David FavorFractional CTO
Distinguished Expert 2018

Commented:
With SSD you have a similar problem, where you only have so much SSD cache memory or OS file buffers.

If you constantly end up ejecting all your buffers all the time because of parallel threads, then your process still slows down.

With SSD drives, you won't "feel this slowness" as badly as mechanical drives + the net effect is you end up with complex code, producing less speed than simple code.

Now if you have an array of 1000s of SSD drives + can arrange a single thread/drive, then likely this will be faster.

For one drive or one array, best use simple + single threaded directory walking.
David FavorFractional CTO
Distinguished Expert 2018

Commented:
Actually kaufmed brings up an additional slowing effect.

With many parallel threads you'll continually involve the i/o subsystem (interrupts, contexts switching, driver serialization, CPU buffer cache ejection), which for SSD drives will likely be a more pronounced slowdown then continual i/o buffer ejection.

Good call kaufmed.

Author

Commented:
Thank you both for the deep explanation. This give me a greater understanding of how these things work. I'll stick with File.Copy.
David FavorFractional CTO
Distinguished Expert 2018

Commented:
You're welcome!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial