[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Efficient use of LINQ to FileInfo objects

Posted on 2008-06-09
19
Medium Priority
?
4,691 Views
Last Modified: 2013-12-17
I have a folder with a very large quantity of files in it.
I have written a small console application as per the code snippet using LINQ to Objects

My aim is to count the amount of files in the folder that are between two given dates
Unfortunately, LINQ does not seem to be very efficient, as it still tries to enumerate through all of the files in the folder, even though I am trying to aid its work by sorting the collection and filtering it by creation date.

Is there a way to make this code more efficient?
class Program
    {
        static void Main(string[] args)
        {
            DateTime FromDate = DateTime.Parse(args[0]);
            DateTime ToDate = DateTime.Parse(args[1]);
 
            GetStatistics(FromDate,ToDate); 
        }
 
        private static void GetStatistics(DateTime FromDate, DateTime ToDate)
        {
            string Location = ConfigurationManager.AppSettings["InboxLocation"];
            DirectoryInfo diInbox = new DirectoryInfo(Location);
            FileInfo[] InboxFiles = diInbox.GetFiles();
            var files = from file in InboxFiles
                        where file.CreationTime >= FromDate //&& file.CreationTime <= ToDate
                        orderby file.CreationTime descending
                        select file;
            int count = files.Count();
            Console.WriteLine("Inbox Count: {0}", count);
            Console.ReadLine();
        }
    }

Open in new window

0
Comment
Question by:Dabas
  • 9
  • 5
  • 5
19 Comments
 
LVL 6

Accepted Solution

by:
cottsak earned 2000 total points
ID: 21748757
so that the LINQ to Object query returns all the results based on ur 'where' condition it MUST enumerate all the File objects.. much like SQL will in a table (tho it uses indexes to speed it up). your orderby will not assist in speeding up the process.. thats not how LINQ works.
i dont think you can get around this
0
 
LVL 16

Expert Comment

by:CuteBug
ID: 21749838
Hi,
     You can give the following query
     
     
     var files = from file in InboxFiles
                 where file is FileInfo
                 let fileCr = (DateTime)file.CreationTime
                 where (fileCr.CompareTo(FromDate) >= 0) && (fileCr.CompareTo(ToDate) <= 0)
                 select file;

Open in new window

0
 
LVL 16

Expert Comment

by:CuteBug
ID: 21749858
Hi,
   Here is a further simplified query
var files = from file in InboxFiles
        	   where (file.LastWriteTime.CompareTo(FromDate) >= 0) && (file.LastWriteTime.CompareTo(ToDate) <= 0)
        	   select file;

Open in new window

0
NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

 
LVL 16

Expert Comment

by:CuteBug
ID: 21749869
Hi,
    Regarding the last comment you have to use file.CreationTime instead of file.LastWriteTime.
    Sorry for the mistake!

    So the simplified query would look like this

   
var query = from file in InboxFiles
        	   where (file.CreationTime.CompareTo(FromDate) >= 0) && (file.CreationTime.CompareTo(ToDate) <= 0)
        	   select file;

Open in new window

0
 
LVL 27

Author Comment

by:Dabas
ID: 21755703
Thanks CuteBug

Why do you think that your code is more efficient than my original one?
Take into account that this folder has thousands of files, and I am only interested in a few

0
 
LVL 16

Expert Comment

by:CuteBug
ID: 21755986
Hi Dabas,
        You were using the >= and <= operators to compare two DateTime objects.

        Using the DateTime.CompareTo() methoad is a safer and better option.
0
 
LVL 27

Author Comment

by:Dabas
ID: 21756028
Is it quicker?
It is taking LINQ AGES to traverse through ALL of the files in the folder, when all I want is a count of the ones that meet the criteria.  Anyway to do something similar to SQL's TOP 100?
0
 
LVL 16

Expert Comment

by:CuteBug
ID: 21756069
Sorry Dabas,
        I dont know much about SQL.

        No matter which method you use... you will have to traverse through all the files in order to compare their creation time with the given FromDate and ToDate.
        LinQ provides a simple-to-declare query method, whereas using foreach will require you to write more code just to achieve the same...

        Why dont you try running this querying of files in a separate thread? This will take the load off the main thread!
0
 
LVL 27

Author Comment

by:Dabas
ID: 21756113
It is the only thread... And it is taking too long! And as time goes by it will only take longer :(
0
 
LVL 6

Expert Comment

by:cottsak
ID: 21756203
Dabas,

CuteBug's code is not any quicker.. LINQ will still need to enumerate ALL the file objects.

LINQ does have an SQL TOP equivalent, its called .Take() but it wont help u either, LINQ will need to build the range set in memory (ALL file objects) before if can give you the top 100   .Take(100)

perhaps you need to consider that this will be a time consuming process and implement some sort of feedback system for the user - for this i suggest the BackgroundWorker - http://www.albahari.com/threading/part3.html#_BackgroundWorker
0
 
LVL 27

Author Comment

by:Dabas
ID: 21756232
Thanks cottsak

As per my answer to CuteBug's suggestion of using a separate thread, same goes for the BackgroundWorker.
The utility is for my own purposes, and has just this one purpose: To find out the number of files that belong to the same day.

I will try your Take() suggestion. As far as I know, in SQL, when you specify top 100, it does NOT load the whole table into memory. If I am lucky, the same will happen here.

0
 
LVL 6

Expert Comment

by:cottsak
ID: 21756305
i dont think you will be lucky.

i think you need to look into another angle. like dont use LINQ for a start, i think the sorting part of the LINQ query is taking the most time. try a simple foreach over the diInbox.GetFiles() and conditionally increment a counter where the properties for each file object meed your criteria. benchmark that! i think you may get better results. :D
0
 
LVL 27

Author Comment

by:Dabas
ID: 21756341
OK. Will do and let you know
0
 
LVL 27

Author Comment

by:Dabas
ID: 21827016
Here is my current code.
It achieves its goal, but as you all have pointed out, it still traverses through ALL of the files in the folder before coming to a result.
This also is true if you use Take() as suggested and mentioned by cottsak.
(I would have expected the code to do a Dir /o-d first to order the files in the folder first, then only apply the LINQ statement to the first files specified.)

CuteBug: There was no significant difference between using <= directly or CompareTo

Cottsak: I do not think that using a foreach to achieve the same will be significantly faster. LINQ does make this code quite easy to write (and read) though!


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Configuration;
using System.IO;
using System.Timers;
namespace PSCRate
{
    class Program
    {
        static void Main(string[] args)
        {
            DateTime FromDate = DateTime.Parse(args[0]);
            DateTime ToDate = DateTime.Parse(args[1]);
 
            GetStatistics(FromDate, ToDate);
        }
 
        private static void GetStatistics(DateTime FromDate, DateTime ToDate)
        {
            try
            {
                DateTime dt = DateTime.Now;
                string Location = ConfigurationManager.AppSettings["InboxLocation"];
 
                DirectoryInfo diInbox = new DirectoryInfo(Location);
                FileInfo[] InboxFiles = diInbox.GetFiles();
                Console.WriteLine(DateTime.Now - dt);
                var files = from file in InboxFiles
                            where file.CreationTime.CompareTo(FromDate) >=0 && file.CreationTime.CompareTo(ToDate) <= 0
                            group file by file.CreationTime.Date into g
                            select new
                            {
                                Count = g.Count(),
                                Date = g.Key
                            };
 
 
                foreach (var g in files)
                {
                    Console.WriteLine("{0} {1}", g.Date, g.Count);
                }
                
                Console.WriteLine(DateTime.Now - dt);
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                System.IO.StreamWriter sw = new StreamWriter("Error.log", false);
                sw.WriteLine(ex.ToString());
                sw.Close();
                Console.ReadLine();
         
            }
 
        }
    }
}

Open in new window

0
 
LVL 6

Expert Comment

by:cottsak
ID: 21827767
do whatever feels more comfortable to you. its very important that a developer can read his/her own code. as for the slowness of linq - the sorting is the slowest part.. as with any sorting of any ordered collection. are you happy with the code you have now?
0
 
LVL 27

Author Comment

by:Dabas
ID: 21827862
cottsak:

Thanks. I will test one or more alternatives and then allocate the points
0
 
LVL 27

Author Comment

by:Dabas
ID: 21863001
I think I have an improvement that incorporates Take()

My benchmarks so far show that using Take IS effectively quicker than if I leave Take out (not by much)
        private static void GetStatistics(DateTime FromDate, DateTime ToDate, int Top)
        {
            try
            {
                DateTime dt = DateTime.Now;
                string Location = ConfigurationManager.AppSettings["InboxLocation"];
 
                DirectoryInfo diInbox = new DirectoryInfo(Location);
                FileInfo[] InboxFiles = diInbox.GetFiles();
                Console.WriteLine(DateTime.Now - dt);
 
                //First we get the files and select only their dates while we order them descending
                var files = from file in InboxFiles
                            orderby file.CreationTime.Date descending
                            select new
                            {
                                Date = file.CreationTime.Date
                            };
 
                //Attempt using TakeWhile.. did not work
                //var files1 = files.TakeWhile((file, Date) => (file.Date >= FromDate && file.Date <= ToDate));
 
                //We take a subset by using Take. Top is the number of files we want to restrict our search on
                var files1 = files.Take(Top);
 
                //We group the results just found
                var files2 = from file in files1
                             where file.Date >= FromDate && file.Date <= ToDate
                             group file by file.Date into g
                             
                             select new
                            {
                                Count = g.Count(),
                                Date = g.Key
                            };
 
                foreach (var g in files2)
                {
                    Console.WriteLine("{0} {1}", g.Date, g.Count);
                }
                //Console.WriteLine(ts.Seconds);
                Console.WriteLine(DateTime.Now - dt);
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                System.IO.StreamWriter sw = new StreamWriter("Error.log", false);
                sw.WriteLine(ex.ToString());
                sw.Close();
                Console.ReadLine();
         
            }
 
        }

Open in new window

0
 
LVL 6

Expert Comment

by:cottsak
ID: 21863368
wow. surprising... good work.
my scepticism still tries to tell me that the benchmark increase u discovered is a statistical outlier that might not be reproducible with scaled testing... but dont listen to that voice. good work!!! :D
0
 
LVL 27

Author Comment

by:Dabas
ID: 21941644
I finally dedicated more time to test my theory further.
Unfortunately, my previous conclusion seems to be wrong. Pity

It seems that it does not significantly matter if you use Take or not.

Back to the drawing board... :(

I will keep you posted, should I find something different.
0

Featured Post

Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It seems a simple enough task, yet I see repeated questions asking how to do it: how to pass data between two forms. In this article, I will show you the different mechanisms available for you to do just that. This article is directed towards the .N…
This article describes relatively difficult and non-obvious issues that are likely to arise when creating COM class in Visual Studio and deploying it by professional MSI-authoring tools. It is assumed that the reader is already familiar with the cla…
Despite its rising prevalence in the business world, "the cloud" is still misunderstood. Some companies still believe common misconceptions about lack of security in cloud solutions and many misuses of cloud storage options still occur every day. …
As many of you are aware about Scanpst.exe utility which is owned by Microsoft itself to repair inaccessible or damaged PST files, but the question is do you really think Scanpst.exe is capable to repair all sorts of PST related corruption issues?
Suggested Courses

872 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question