Using parallelization to find all directories containing a specific filename

I am developing an application which takes a root directory, finds all sub directories created the prior day, then adds to a List all of those directories that contain a file with a specific filename.

I was researching the fastest way to do this and came across PLINQ. PLINQ is new to me so I'm not sure if I'm missing something important. Am I using PLINQ correctly?

static void Main(string[] args)
            List<FileInfo> processThis = new List<FileInfo>();

            //List<string> paths = new List<string>();
            Stopwatch stopWatch = new Stopwatch();

            DirectoryInfo root = new DirectoryInfo(@"\\server\Shared\CC\Jobs");

            var dirs = root.EnumerateDirectories().Where(x => x.CreationTime.Date == DateTime.Now.AddDays(-1).Date);

            Parallel.ForEach(dirs, dir => processThis.AddRange(dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1"))));


Open in new window

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

David FavorLinux/LXD/WordPress/Hosting SavantCommented:
If you try to parallelize any type of disk traversal all you'll do is slow down your code to a crawl.

Better to use a system command like find, which is optimized to quickly walk large directory hierarchies.

Using .NET primitives might work + best test speed difference between .NET + command line tool speed.

If you parallelize this access, all you'll do is have many threads causing random disk head seeks, so 100s of threads all fighting with each other... running disk heads chaotically across platters will be far slower, then starting a find at the top of your disk hierarchy in a single thread, so hopefully disk reads occur sequentially, as much as possible.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
nightshadzAuthor Commented:
I thought the Parallel.ForEach is acting on an in-memory IEnumerable of directories, then enumerating the filenames in that directory.  I'm curious why a regular ForEach is slower than Parallel.ForEach in this instance?

            foreach (var dir in dirs)
                if (dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1")).Any())
            Parallel.ForEach(dirs, dir =>
                if (dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1")).Any())
                //processThis.AddRange(dir.EnumerateFiles().Where(f => f.Name.Contains("LUADUHC1"))

Open in new window

kaufmed   ( ⚆ _ ⚆ )Commented:
running disk heads chaotically across platters will be far slower
Do SSDs have platters now?    😜

But in all seriousness, as David alludes to you could have issues with accessing information from a single drive in parallel. You're dealing with how the disk controller--whether the disk be spinning platter or solid-state--manages accessing the bits. You'll just need to test what works for your environment.

Also, since you're accessing an I/O device, you may consider looking into async/await. You'll be able to let your CPU do meaningful work while it's waiting for I/O to complete.
Learn SQL Server Core 2016

This course will introduce you to SQL Server Core 2016, as well as teach you about SSMS, data tools, installation, server configuration, using Management Studio, and writing and executing queries.

David FavorLinux/LXD/WordPress/Hosting SavantCommented:
With SSD you have a similar problem, where you only have so much SSD cache memory or OS file buffers.

If you constantly end up ejecting all your buffers all the time because of parallel threads, then your process still slows down.

With SSD drives, you won't "feel this slowness" as badly as mechanical drives + the net effect is you end up with complex code, producing less speed than simple code.

Now if you have an array of 1000s of SSD drives + can arrange a single thread/drive, then likely this will be faster.

For one drive or one array, best use simple + single threaded directory walking.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
Actually kaufmed brings up an additional slowing effect.

With many parallel threads you'll continually involve the i/o subsystem (interrupts, contexts switching, driver serialization, CPU buffer cache ejection), which for SSD drives will likely be a more pronounced slowdown then continual i/o buffer ejection.

Good call kaufmed.
nightshadzAuthor Commented:
Thank you both for the deep explanation. This give me a greater understanding of how these things work. I'll stick with File.Copy.
David FavorLinux/LXD/WordPress/Hosting SavantCommented:
You're welcome!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
.NET Programming

From novice to tech pro — start learning today.