state of the art FILE search in Delphi


I want to search thousands of files for a fixed string. Case Insensitive, exact match.

Boyer Moore Horspool is the obvious thought and there is a nice looking bit of code by Angus Johnson at

This uses file mapping which I guess is cool but every file is mapped into memory separately according to its size. This seems like there must be a lot of memory reallocs going on behind the scenes .. one time I have a file of 42 bytes, the next 42mb.

Q1. Is filemapping the best way to go for searching large numbers of files of varying sizes

Q2. Is there a way of using and reusing a fixed say 64K buffer instead of asking for large amounts of memory, but still using file mapping

Q3. Am I correct in my assumption that file reading and memory mapping will be the bottleneck, and that therefore Boyer-Moore is as good as anything ?

All insights and mega fast code gratefully received

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Wim ten BrinkSelf-employed developerCommented:
Filemapping isn't always a good idea. Especially if you're going through the file only once, in one single direction. Furthermore, filemapping has some nasty size limitations since the filesize is considered to be RAM for a temporary moment, thus increasing the amount of memory your system has available but this cannot grow above 4 GB. Thus, huge files will be a big trouble to read.

A filemap is a kind of window on a part of a file. You could create a new map to some other portion of this file, thus reading it part by part.

If the BM method is working correctly, it would perform just as fast with a filestream as a filemap. By using the TFilestream class you can go through it pretty fast. Use two buffers, though. Read the first 32KB in the first buffer and the next 32 KB in the second buffer. Once you're done walking through the first buffer then you can read the next 32 in the first buffer while processing the second buffer. Once you're done with the second buffer you go back to the first buffer while reading the next 32 KB in the second buffer. Continue until end of file...
Mutley2003Author Commented:

Alex, sounds like you are suggesting 2 threads?

If so, that would make things a bit messy. I'd have to have the 2 buffers shared between the threads and maybe some locking mechanism.  And the buffer boundary issue seems like a problem.

Or are you just suggesting a simple sequential approach in a single thread
.. cant see it

Or is it

  SearchBuffer1; ReadBuffer1;
  SearchBuffer2; ReadBuffer2;

thanks for the insight into filemapping

I guess with the buffer method, there are buffer boundary issues with BM. I guess this can be solved by keeping at most M-1 bytes (where M is the length of search string) from the tail of buffer 1 and M-1 from the beginning of buffer 2.. and BM searching that .. hmm

OK, let's step back a bit. The slow bit is the read op, the first read op being slowest because of head positioning. So for a file >32K ideally we are searching the first buffer while reading the second? But then we have boundary issues .. we cannot actually finish searching the first buffer until we get the second and that buffer overlap issue continues on forever
ie we need to search
   buffer1 tail + buffer 2 head
   buffer2 tail + buffer 3 head

A solution would be nice. Code even better<g>

Wim ten BrinkSelf-employed developerCommented:
Yes, you could use two threads, or just some kind of asynchronous filereads. You don't want to really wait for the buffer to be fully read while you're processing the second buffer. It is a bit complicated technique, though, since you might process the buffer faster than the other buffer is read, making things a bit complicated.

Basically, the order is:


And the / is where you use either two threads or asynchronous reads to fill the other buffer. The challenge will be to find a buffersize with the highest efficiency though. Maybe performance is better with a 4 MB buffer...

Basically, to walk through the file, all you have to do is go from 0 to Filesize-1. But the position in the buffer is calculated by "BufPos := I mod (2*Buffersize)" and if BufPos is less than BufferSize then you're in buffer1, else in buffer2. However, you will also need to check if you're passed a buffer border, thus you also need to calculate "I div BufferSize" and compare it with an old value. (initialized at 0 since you start with one full buffer already.) If the value changes then check the new value. If even, you need to reread buffer2, else reread buffer1.

This method could become very fast but requires a lot of calculations to come up with a good algorithm. Personally, just using a TFileStream would be a lot easier and will be very fast too. In general, the Windows filesystem will do some buffering for you already so using your own buffers might not even have much effects...

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Mutley2003Author Commented:
some good ideas there Alex, thanks, I will experiment
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.