Link to home
Start Free TrialLog in
Avatar of Dabas
DabasFlag for Australia

asked on

Efficient use of LINQ to FileInfo objects

I have a folder with a very large quantity of files in it.
I have written a small console application as per the code snippet using LINQ to Objects

My aim is to count the amount of files in the folder that are between two given dates
Unfortunately, LINQ does not seem to be very efficient, as it still tries to enumerate through all of the files in the folder, even though I am trying to aid its work by sorting the collection and filtering it by creation date.

Is there a way to make this code more efficient?
class Program
    {
        static void Main(string[] args)
        {
            DateTime FromDate = DateTime.Parse(args[0]);
            DateTime ToDate = DateTime.Parse(args[1]);
 
            GetStatistics(FromDate,ToDate); 
        }
 
        private static void GetStatistics(DateTime FromDate, DateTime ToDate)
        {
            string Location = ConfigurationManager.AppSettings["InboxLocation"];
            DirectoryInfo diInbox = new DirectoryInfo(Location);
            FileInfo[] InboxFiles = diInbox.GetFiles();
            var files = from file in InboxFiles
                        where file.CreationTime >= FromDate //&& file.CreationTime <= ToDate
                        orderby file.CreationTime descending
                        select file;
            int count = files.Count();
            Console.WriteLine("Inbox Count: {0}", count);
            Console.ReadLine();
        }
    }

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of cottsak
cottsak
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi,
     You can give the following query
     
     
     var files = from file in InboxFiles
                 where file is FileInfo
                 let fileCr = (DateTime)file.CreationTime
                 where (fileCr.CompareTo(FromDate) >= 0) && (fileCr.CompareTo(ToDate) <= 0)
                 select file;

Open in new window

Hi,
   Here is a further simplified query
var files = from file in InboxFiles
        	   where (file.LastWriteTime.CompareTo(FromDate) >= 0) && (file.LastWriteTime.CompareTo(ToDate) <= 0)
        	   select file;

Open in new window

Hi,
    Regarding the last comment you have to use file.CreationTime instead of file.LastWriteTime.
    Sorry for the mistake!

    So the simplified query would look like this

   
var query = from file in InboxFiles
        	   where (file.CreationTime.CompareTo(FromDate) >= 0) && (file.CreationTime.CompareTo(ToDate) <= 0)
        	   select file;

Open in new window

Avatar of Dabas

ASKER

Thanks CuteBug

Why do you think that your code is more efficient than my original one?
Take into account that this folder has thousands of files, and I am only interested in a few

Hi Dabas,
        You were using the >= and <= operators to compare two DateTime objects.

        Using the DateTime.CompareTo() methoad is a safer and better option.
Avatar of Dabas

ASKER

Is it quicker?
It is taking LINQ AGES to traverse through ALL of the files in the folder, when all I want is a count of the ones that meet the criteria.  Anyway to do something similar to SQL's TOP 100?
Sorry Dabas,
        I dont know much about SQL.

        No matter which method you use... you will have to traverse through all the files in order to compare their creation time with the given FromDate and ToDate.
        LinQ provides a simple-to-declare query method, whereas using foreach will require you to write more code just to achieve the same...

        Why dont you try running this querying of files in a separate thread? This will take the load off the main thread!
Avatar of Dabas

ASKER

It is the only thread... And it is taking too long! And as time goes by it will only take longer :(
Dabas,

CuteBug's code is not any quicker.. LINQ will still need to enumerate ALL the file objects.

LINQ does have an SQL TOP equivalent, its called .Take() but it wont help u either, LINQ will need to build the range set in memory (ALL file objects) before if can give you the top 100   .Take(100)

perhaps you need to consider that this will be a time consuming process and implement some sort of feedback system for the user - for this i suggest the BackgroundWorker - http://www.albahari.com/threading/part3.html#_BackgroundWorker
Avatar of Dabas

ASKER

Thanks cottsak

As per my answer to CuteBug's suggestion of using a separate thread, same goes for the BackgroundWorker.
The utility is for my own purposes, and has just this one purpose: To find out the number of files that belong to the same day.

I will try your Take() suggestion. As far as I know, in SQL, when you specify top 100, it does NOT load the whole table into memory. If I am lucky, the same will happen here.

i dont think you will be lucky.

i think you need to look into another angle. like dont use LINQ for a start, i think the sorting part of the LINQ query is taking the most time. try a simple foreach over the diInbox.GetFiles() and conditionally increment a counter where the properties for each file object meed your criteria. benchmark that! i think you may get better results. :D
Avatar of Dabas

ASKER

OK. Will do and let you know
Avatar of Dabas

ASKER

Here is my current code.
It achieves its goal, but as you all have pointed out, it still traverses through ALL of the files in the folder before coming to a result.
This also is true if you use Take() as suggested and mentioned by cottsak.
(I would have expected the code to do a Dir /o-d first to order the files in the folder first, then only apply the LINQ statement to the first files specified.)

CuteBug: There was no significant difference between using <= directly or CompareTo

Cottsak: I do not think that using a foreach to achieve the same will be significantly faster. LINQ does make this code quite easy to write (and read) though!


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Configuration;
using System.IO;
using System.Timers;
namespace PSCRate
{
    class Program
    {
        static void Main(string[] args)
        {
            DateTime FromDate = DateTime.Parse(args[0]);
            DateTime ToDate = DateTime.Parse(args[1]);
 
            GetStatistics(FromDate, ToDate);
        }
 
        private static void GetStatistics(DateTime FromDate, DateTime ToDate)
        {
            try
            {
                DateTime dt = DateTime.Now;
                string Location = ConfigurationManager.AppSettings["InboxLocation"];
 
                DirectoryInfo diInbox = new DirectoryInfo(Location);
                FileInfo[] InboxFiles = diInbox.GetFiles();
                Console.WriteLine(DateTime.Now - dt);
                var files = from file in InboxFiles
                            where file.CreationTime.CompareTo(FromDate) >=0 && file.CreationTime.CompareTo(ToDate) <= 0
                            group file by file.CreationTime.Date into g
                            select new
                            {
                                Count = g.Count(),
                                Date = g.Key
                            };
 
 
                foreach (var g in files)
                {
                    Console.WriteLine("{0} {1}", g.Date, g.Count);
                }
                
                Console.WriteLine(DateTime.Now - dt);
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                System.IO.StreamWriter sw = new StreamWriter("Error.log", false);
                sw.WriteLine(ex.ToString());
                sw.Close();
                Console.ReadLine();
         
            }
 
        }
    }
}

Open in new window

do whatever feels more comfortable to you. its very important that a developer can read his/her own code. as for the slowness of linq - the sorting is the slowest part.. as with any sorting of any ordered collection. are you happy with the code you have now?
Avatar of Dabas

ASKER

cottsak:

Thanks. I will test one or more alternatives and then allocate the points
Avatar of Dabas

ASKER

I think I have an improvement that incorporates Take()

My benchmarks so far show that using Take IS effectively quicker than if I leave Take out (not by much)
        private static void GetStatistics(DateTime FromDate, DateTime ToDate, int Top)
        {
            try
            {
                DateTime dt = DateTime.Now;
                string Location = ConfigurationManager.AppSettings["InboxLocation"];
 
                DirectoryInfo diInbox = new DirectoryInfo(Location);
                FileInfo[] InboxFiles = diInbox.GetFiles();
                Console.WriteLine(DateTime.Now - dt);
 
                //First we get the files and select only their dates while we order them descending
                var files = from file in InboxFiles
                            orderby file.CreationTime.Date descending
                            select new
                            {
                                Date = file.CreationTime.Date
                            };
 
                //Attempt using TakeWhile.. did not work
                //var files1 = files.TakeWhile((file, Date) => (file.Date >= FromDate && file.Date <= ToDate));
 
                //We take a subset by using Take. Top is the number of files we want to restrict our search on
                var files1 = files.Take(Top);
 
                //We group the results just found
                var files2 = from file in files1
                             where file.Date >= FromDate && file.Date <= ToDate
                             group file by file.Date into g
                             
                             select new
                            {
                                Count = g.Count(),
                                Date = g.Key
                            };
 
                foreach (var g in files2)
                {
                    Console.WriteLine("{0} {1}", g.Date, g.Count);
                }
                //Console.WriteLine(ts.Seconds);
                Console.WriteLine(DateTime.Now - dt);
                Console.ReadLine();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.ToString());
                System.IO.StreamWriter sw = new StreamWriter("Error.log", false);
                sw.WriteLine(ex.ToString());
                sw.Close();
                Console.ReadLine();
         
            }
 
        }

Open in new window

wow. surprising... good work.
my scepticism still tries to tell me that the benchmark increase u discovered is a statistical outlier that might not be reproducible with scaled testing... but dont listen to that voice. good work!!! :D
Avatar of Dabas

ASKER

I finally dedicated more time to test my theory further.
Unfortunately, my previous conclusion seems to be wrong. Pity

It seems that it does not significantly matter if you use Take or not.

Back to the drawing board... :(

I will keep you posted, should I find something different.