Dabas
asked on
Efficient use of LINQ to FileInfo objects
I have a folder with a very large quantity of files in it.
I have written a small console application as per the code snippet using LINQ to Objects
My aim is to count the amount of files in the folder that are between two given dates
Unfortunately, LINQ does not seem to be very efficient, as it still tries to enumerate through all of the files in the folder, even though I am trying to aid its work by sorting the collection and filtering it by creation date.
Is there a way to make this code more efficient?
I have written a small console application as per the code snippet using LINQ to Objects
My aim is to count the amount of files in the folder that are between two given dates
Unfortunately, LINQ does not seem to be very efficient, as it still tries to enumerate through all of the files in the folder, even though I am trying to aid its work by sorting the collection and filtering it by creation date.
Is there a way to make this code more efficient?
class Program
{
static void Main(string[] args)
{
DateTime FromDate = DateTime.Parse(args[0]);
DateTime ToDate = DateTime.Parse(args[1]);
GetStatistics(FromDate,ToDate);
}
private static void GetStatistics(DateTime FromDate, DateTime ToDate)
{
string Location = ConfigurationManager.AppSettings["InboxLocation"];
DirectoryInfo diInbox = new DirectoryInfo(Location);
FileInfo[] InboxFiles = diInbox.GetFiles();
var files = from file in InboxFiles
where file.CreationTime >= FromDate //&& file.CreationTime <= ToDate
orderby file.CreationTime descending
select file;
int count = files.Count();
Console.WriteLine("Inbox Count: {0}", count);
Console.ReadLine();
}
}
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Hi,
Here is a further simplified query
Here is a further simplified query
var files = from file in InboxFiles
where (file.LastWriteTime.CompareTo(FromDate) >= 0) && (file.LastWriteTime.CompareTo(ToDate) <= 0)
select file;
Hi,
Regarding the last comment you have to use file.CreationTime instead of file.LastWriteTime.
Sorry for the mistake!
So the simplified query would look like this
Regarding the last comment you have to use file.CreationTime instead of file.LastWriteTime.
Sorry for the mistake!
So the simplified query would look like this
var query = from file in InboxFiles
where (file.CreationTime.CompareTo(FromDate) >= 0) && (file.CreationTime.CompareTo(ToDate) <= 0)
select file;
ASKER
Thanks CuteBug
Why do you think that your code is more efficient than my original one?
Take into account that this folder has thousands of files, and I am only interested in a few
Why do you think that your code is more efficient than my original one?
Take into account that this folder has thousands of files, and I am only interested in a few
Hi Dabas,
You were using the >= and <= operators to compare two DateTime objects.
Using the DateTime.CompareTo() methoad is a safer and better option.
You were using the >= and <= operators to compare two DateTime objects.
Using the DateTime.CompareTo() methoad is a safer and better option.
ASKER
Is it quicker?
It is taking LINQ AGES to traverse through ALL of the files in the folder, when all I want is a count of the ones that meet the criteria. Anyway to do something similar to SQL's TOP 100?
It is taking LINQ AGES to traverse through ALL of the files in the folder, when all I want is a count of the ones that meet the criteria. Anyway to do something similar to SQL's TOP 100?
Sorry Dabas,
I dont know much about SQL.
No matter which method you use... you will have to traverse through all the files in order to compare their creation time with the given FromDate and ToDate.
LinQ provides a simple-to-declare query method, whereas using foreach will require you to write more code just to achieve the same...
Why dont you try running this querying of files in a separate thread? This will take the load off the main thread!
I dont know much about SQL.
No matter which method you use... you will have to traverse through all the files in order to compare their creation time with the given FromDate and ToDate.
LinQ provides a simple-to-declare query method, whereas using foreach will require you to write more code just to achieve the same...
Why dont you try running this querying of files in a separate thread? This will take the load off the main thread!
ASKER
It is the only thread... And it is taking too long! And as time goes by it will only take longer :(
Dabas,
CuteBug's code is not any quicker.. LINQ will still need to enumerate ALL the file objects.
LINQ does have an SQL TOP equivalent, its called .Take() but it wont help u either, LINQ will need to build the range set in memory (ALL file objects) before if can give you the top 100 .Take(100)
perhaps you need to consider that this will be a time consuming process and implement some sort of feedback system for the user - for this i suggest the BackgroundWorker - http://www.albahari.com/threading/part3.html#_BackgroundWorker
CuteBug's code is not any quicker.. LINQ will still need to enumerate ALL the file objects.
LINQ does have an SQL TOP equivalent, its called .Take() but it wont help u either, LINQ will need to build the range set in memory (ALL file objects) before if can give you the top 100 .Take(100)
perhaps you need to consider that this will be a time consuming process and implement some sort of feedback system for the user - for this i suggest the BackgroundWorker - http://www.albahari.com/threading/part3.html#_BackgroundWorker
ASKER
Thanks cottsak
As per my answer to CuteBug's suggestion of using a separate thread, same goes for the BackgroundWorker.
The utility is for my own purposes, and has just this one purpose: To find out the number of files that belong to the same day.
I will try your Take() suggestion. As far as I know, in SQL, when you specify top 100, it does NOT load the whole table into memory. If I am lucky, the same will happen here.
As per my answer to CuteBug's suggestion of using a separate thread, same goes for the BackgroundWorker.
The utility is for my own purposes, and has just this one purpose: To find out the number of files that belong to the same day.
I will try your Take() suggestion. As far as I know, in SQL, when you specify top 100, it does NOT load the whole table into memory. If I am lucky, the same will happen here.
i dont think you will be lucky.
i think you need to look into another angle. like dont use LINQ for a start, i think the sorting part of the LINQ query is taking the most time. try a simple foreach over the diInbox.GetFiles() and conditionally increment a counter where the properties for each file object meed your criteria. benchmark that! i think you may get better results. :D
i think you need to look into another angle. like dont use LINQ for a start, i think the sorting part of the LINQ query is taking the most time. try a simple foreach over the diInbox.GetFiles() and conditionally increment a counter where the properties for each file object meed your criteria. benchmark that! i think you may get better results. :D
ASKER
OK. Will do and let you know
ASKER
Here is my current code.
It achieves its goal, but as you all have pointed out, it still traverses through ALL of the files in the folder before coming to a result.
This also is true if you use Take() as suggested and mentioned by cottsak.
(I would have expected the code to do a Dir /o-d first to order the files in the folder first, then only apply the LINQ statement to the first files specified.)
CuteBug: There was no significant difference between using <= directly or CompareTo
Cottsak: I do not think that using a foreach to achieve the same will be significantly faster. LINQ does make this code quite easy to write (and read) though!
It achieves its goal, but as you all have pointed out, it still traverses through ALL of the files in the folder before coming to a result.
This also is true if you use Take() as suggested and mentioned by cottsak.
(I would have expected the code to do a Dir /o-d first to order the files in the folder first, then only apply the LINQ statement to the first files specified.)
CuteBug: There was no significant difference between using <= directly or CompareTo
Cottsak: I do not think that using a foreach to achieve the same will be significantly faster. LINQ does make this code quite easy to write (and read) though!
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Configuration;
using System.IO;
using System.Timers;
namespace PSCRate
{
class Program
{
static void Main(string[] args)
{
DateTime FromDate = DateTime.Parse(args[0]);
DateTime ToDate = DateTime.Parse(args[1]);
GetStatistics(FromDate, ToDate);
}
private static void GetStatistics(DateTime FromDate, DateTime ToDate)
{
try
{
DateTime dt = DateTime.Now;
string Location = ConfigurationManager.AppSettings["InboxLocation"];
DirectoryInfo diInbox = new DirectoryInfo(Location);
FileInfo[] InboxFiles = diInbox.GetFiles();
Console.WriteLine(DateTime.Now - dt);
var files = from file in InboxFiles
where file.CreationTime.CompareTo(FromDate) >=0 && file.CreationTime.CompareTo(ToDate) <= 0
group file by file.CreationTime.Date into g
select new
{
Count = g.Count(),
Date = g.Key
};
foreach (var g in files)
{
Console.WriteLine("{0} {1}", g.Date, g.Count);
}
Console.WriteLine(DateTime.Now - dt);
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
System.IO.StreamWriter sw = new StreamWriter("Error.log", false);
sw.WriteLine(ex.ToString());
sw.Close();
Console.ReadLine();
}
}
}
}
do whatever feels more comfortable to you. its very important that a developer can read his/her own code. as for the slowness of linq - the sorting is the slowest part.. as with any sorting of any ordered collection. are you happy with the code you have now?
ASKER
cottsak:
Thanks. I will test one or more alternatives and then allocate the points
Thanks. I will test one or more alternatives and then allocate the points
ASKER
I think I have an improvement that incorporates Take()
My benchmarks so far show that using Take IS effectively quicker than if I leave Take out (not by much)
My benchmarks so far show that using Take IS effectively quicker than if I leave Take out (not by much)
private static void GetStatistics(DateTime FromDate, DateTime ToDate, int Top)
{
try
{
DateTime dt = DateTime.Now;
string Location = ConfigurationManager.AppSettings["InboxLocation"];
DirectoryInfo diInbox = new DirectoryInfo(Location);
FileInfo[] InboxFiles = diInbox.GetFiles();
Console.WriteLine(DateTime.Now - dt);
//First we get the files and select only their dates while we order them descending
var files = from file in InboxFiles
orderby file.CreationTime.Date descending
select new
{
Date = file.CreationTime.Date
};
//Attempt using TakeWhile.. did not work
//var files1 = files.TakeWhile((file, Date) => (file.Date >= FromDate && file.Date <= ToDate));
//We take a subset by using Take. Top is the number of files we want to restrict our search on
var files1 = files.Take(Top);
//We group the results just found
var files2 = from file in files1
where file.Date >= FromDate && file.Date <= ToDate
group file by file.Date into g
select new
{
Count = g.Count(),
Date = g.Key
};
foreach (var g in files2)
{
Console.WriteLine("{0} {1}", g.Date, g.Count);
}
//Console.WriteLine(ts.Seconds);
Console.WriteLine(DateTime.Now - dt);
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
System.IO.StreamWriter sw = new StreamWriter("Error.log", false);
sw.WriteLine(ex.ToString());
sw.Close();
Console.ReadLine();
}
}
wow. surprising... good work.
my scepticism still tries to tell me that the benchmark increase u discovered is a statistical outlier that might not be reproducible with scaled testing... but dont listen to that voice. good work!!! :D
my scepticism still tries to tell me that the benchmark increase u discovered is a statistical outlier that might not be reproducible with scaled testing... but dont listen to that voice. good work!!! :D
ASKER
I finally dedicated more time to test my theory further.
Unfortunately, my previous conclusion seems to be wrong. Pity
It seems that it does not significantly matter if you use Take or not.
Back to the drawing board... :(
I will keep you posted, should I find something different.
Unfortunately, my previous conclusion seems to be wrong. Pity
It seems that it does not significantly matter if you use Take or not.
Back to the drawing board... :(
I will keep you posted, should I find something different.
You can give the following query
Open in new window