Solved

Efficient use of LINQ to FileInfo objects

Posted on 2008-06-09
19
4,445 Views
Last Modified: 2013-12-17
I have a folder with a very large quantity of files in it.
I have written a small console application as per the code snippet using LINQ to Objects

My aim is to count the amount of files in the folder that are between two given dates
Unfortunately, LINQ does not seem to be very efficient, as it still tries to enumerate through all of the files in the folder, even though I am trying to aid its work by sorting the collection and filtering it by creation date.

Is there a way to make this code more efficient?
class Program

    {

        static void Main(string[] args)

        {

            DateTime FromDate = DateTime.Parse(args[0]);

            DateTime ToDate = DateTime.Parse(args[1]);
 

            GetStatistics(FromDate,ToDate); 

        }
 

        private static void GetStatistics(DateTime FromDate, DateTime ToDate)

        {

            string Location = ConfigurationManager.AppSettings["InboxLocation"];

            DirectoryInfo diInbox = new DirectoryInfo(Location);

            FileInfo[] InboxFiles = diInbox.GetFiles();

            var files = from file in InboxFiles

                        where file.CreationTime >= FromDate //&& file.CreationTime <= ToDate

                        orderby file.CreationTime descending

                        select file;

            int count = files.Count();

            Console.WriteLine("Inbox Count: {0}", count);

            Console.ReadLine();

        }

    }

Open in new window

0
Comment
Question by:Dabas
  • 9
  • 5
  • 5
19 Comments
 
LVL 6

Accepted Solution

by:
cottsak earned 500 total points
Comment Utility
so that the LINQ to Object query returns all the results based on ur 'where' condition it MUST enumerate all the File objects.. much like SQL will in a table (tho it uses indexes to speed it up). your orderby will not assist in speeding up the process.. thats not how LINQ works.
i dont think you can get around this
0
 
LVL 16

Expert Comment

by:CuteBug
Comment Utility
Hi,
     You can give the following query
     
     
     var files = from file in InboxFiles

                 where file is FileInfo

                 let fileCr = (DateTime)file.CreationTime

                 where (fileCr.CompareTo(FromDate) >= 0) && (fileCr.CompareTo(ToDate) <= 0)

                 select file;

Open in new window

0
 
LVL 16

Expert Comment

by:CuteBug
Comment Utility
Hi,
   Here is a further simplified query
var files = from file in InboxFiles

        	   where (file.LastWriteTime.CompareTo(FromDate) >= 0) && (file.LastWriteTime.CompareTo(ToDate) <= 0)

        	   select file;

Open in new window

0
 
LVL 16

Expert Comment

by:CuteBug
Comment Utility
Hi,
    Regarding the last comment you have to use file.CreationTime instead of file.LastWriteTime.
    Sorry for the mistake!

    So the simplified query would look like this

   
var query = from file in InboxFiles

        	   where (file.CreationTime.CompareTo(FromDate) >= 0) && (file.CreationTime.CompareTo(ToDate) <= 0)

        	   select file;

Open in new window

0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
Thanks CuteBug

Why do you think that your code is more efficient than my original one?
Take into account that this folder has thousands of files, and I am only interested in a few

0
 
LVL 16

Expert Comment

by:CuteBug
Comment Utility
Hi Dabas,
        You were using the >= and <= operators to compare two DateTime objects.

        Using the DateTime.CompareTo() methoad is a safer and better option.
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
Is it quicker?
It is taking LINQ AGES to traverse through ALL of the files in the folder, when all I want is a count of the ones that meet the criteria.  Anyway to do something similar to SQL's TOP 100?
0
 
LVL 16

Expert Comment

by:CuteBug
Comment Utility
Sorry Dabas,
        I dont know much about SQL.

        No matter which method you use... you will have to traverse through all the files in order to compare their creation time with the given FromDate and ToDate.
        LinQ provides a simple-to-declare query method, whereas using foreach will require you to write more code just to achieve the same...

        Why dont you try running this querying of files in a separate thread? This will take the load off the main thread!
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
It is the only thread... And it is taking too long! And as time goes by it will only take longer :(
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 6

Expert Comment

by:cottsak
Comment Utility
Dabas,

CuteBug's code is not any quicker.. LINQ will still need to enumerate ALL the file objects.

LINQ does have an SQL TOP equivalent, its called .Take() but it wont help u either, LINQ will need to build the range set in memory (ALL file objects) before if can give you the top 100   .Take(100)

perhaps you need to consider that this will be a time consuming process and implement some sort of feedback system for the user - for this i suggest the BackgroundWorker - http://www.albahari.com/threading/part3.html#_BackgroundWorker
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
Thanks cottsak

As per my answer to CuteBug's suggestion of using a separate thread, same goes for the BackgroundWorker.
The utility is for my own purposes, and has just this one purpose: To find out the number of files that belong to the same day.

I will try your Take() suggestion. As far as I know, in SQL, when you specify top 100, it does NOT load the whole table into memory. If I am lucky, the same will happen here.

0
 
LVL 6

Expert Comment

by:cottsak
Comment Utility
i dont think you will be lucky.

i think you need to look into another angle. like dont use LINQ for a start, i think the sorting part of the LINQ query is taking the most time. try a simple foreach over the diInbox.GetFiles() and conditionally increment a counter where the properties for each file object meed your criteria. benchmark that! i think you may get better results. :D
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
OK. Will do and let you know
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
Here is my current code.
It achieves its goal, but as you all have pointed out, it still traverses through ALL of the files in the folder before coming to a result.
This also is true if you use Take() as suggested and mentioned by cottsak.
(I would have expected the code to do a Dir /o-d first to order the files in the folder first, then only apply the LINQ statement to the first files specified.)

CuteBug: There was no significant difference between using <= directly or CompareTo

Cottsak: I do not think that using a foreach to achieve the same will be significantly faster. LINQ does make this code quite easy to write (and read) though!


using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using System.Configuration;

using System.IO;

using System.Timers;

namespace PSCRate

{

    class Program

    {

        static void Main(string[] args)

        {

            DateTime FromDate = DateTime.Parse(args[0]);

            DateTime ToDate = DateTime.Parse(args[1]);
 

            GetStatistics(FromDate, ToDate);

        }
 

        private static void GetStatistics(DateTime FromDate, DateTime ToDate)

        {

            try

            {

                DateTime dt = DateTime.Now;

                string Location = ConfigurationManager.AppSettings["InboxLocation"];
 

                DirectoryInfo diInbox = new DirectoryInfo(Location);

                FileInfo[] InboxFiles = diInbox.GetFiles();

                Console.WriteLine(DateTime.Now - dt);

                var files = from file in InboxFiles

                            where file.CreationTime.CompareTo(FromDate) >=0 && file.CreationTime.CompareTo(ToDate) <= 0

                            group file by file.CreationTime.Date into g

                            select new

                            {

                                Count = g.Count(),

                                Date = g.Key

                            };
 
 

                foreach (var g in files)

                {

                    Console.WriteLine("{0} {1}", g.Date, g.Count);

                }

                

                Console.WriteLine(DateTime.Now - dt);

                Console.ReadLine();

            }

            catch (Exception ex)

            {

                Console.WriteLine(ex.ToString());

                System.IO.StreamWriter sw = new StreamWriter("Error.log", false);

                sw.WriteLine(ex.ToString());

                sw.Close();

                Console.ReadLine();

         

            }
 

        }

    }

}

Open in new window

0
 
LVL 6

Expert Comment

by:cottsak
Comment Utility
do whatever feels more comfortable to you. its very important that a developer can read his/her own code. as for the slowness of linq - the sorting is the slowest part.. as with any sorting of any ordered collection. are you happy with the code you have now?
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
cottsak:

Thanks. I will test one or more alternatives and then allocate the points
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
I think I have an improvement that incorporates Take()

My benchmarks so far show that using Take IS effectively quicker than if I leave Take out (not by much)
        private static void GetStatistics(DateTime FromDate, DateTime ToDate, int Top)

        {

            try

            {

                DateTime dt = DateTime.Now;

                string Location = ConfigurationManager.AppSettings["InboxLocation"];
 

                DirectoryInfo diInbox = new DirectoryInfo(Location);

                FileInfo[] InboxFiles = diInbox.GetFiles();

                Console.WriteLine(DateTime.Now - dt);
 

                //First we get the files and select only their dates while we order them descending

                var files = from file in InboxFiles

                            orderby file.CreationTime.Date descending

                            select new

                            {

                                Date = file.CreationTime.Date

                            };
 

                //Attempt using TakeWhile.. did not work

                //var files1 = files.TakeWhile((file, Date) => (file.Date >= FromDate && file.Date <= ToDate));
 

                //We take a subset by using Take. Top is the number of files we want to restrict our search on

                var files1 = files.Take(Top);
 

                //We group the results just found

                var files2 = from file in files1

                             where file.Date >= FromDate && file.Date <= ToDate

                             group file by file.Date into g

                             

                             select new

                            {

                                Count = g.Count(),

                                Date = g.Key

                            };
 

                foreach (var g in files2)

                {

                    Console.WriteLine("{0} {1}", g.Date, g.Count);

                }

                //Console.WriteLine(ts.Seconds);

                Console.WriteLine(DateTime.Now - dt);

                Console.ReadLine();

            }

            catch (Exception ex)

            {

                Console.WriteLine(ex.ToString());

                System.IO.StreamWriter sw = new StreamWriter("Error.log", false);

                sw.WriteLine(ex.ToString());

                sw.Close();

                Console.ReadLine();

         

            }
 

        }

Open in new window

0
 
LVL 6

Expert Comment

by:cottsak
Comment Utility
wow. surprising... good work.
my scepticism still tries to tell me that the benchmark increase u discovered is a statistical outlier that might not be reproducible with scaled testing... but dont listen to that voice. good work!!! :D
0
 
LVL 27

Author Comment

by:Dabas
Comment Utility
I finally dedicated more time to test my theory further.
Unfortunately, my previous conclusion seems to be wrong. Pity

It seems that it does not significantly matter if you use Take or not.

Back to the drawing board... :(

I will keep you posted, should I find something different.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

A basic question.. “What is the Garbage Collector?” The usual answer given back: “Garbage collector is a background thread run by the CLR for freeing up the memory space used by the objects which are no longer used by the program.” I wondered …
Welcome my friends to the second instalment and follow-up to our Minify and Concatenate Your Scripts and Stylesheets (http://www.experts-exchange.com/Programming/Languages/.NET/ASP.NET/A_4334-Minify-and-Concatenate-Your-Scripts-and-Stylesheets.html)…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now