asked on

state of the art FILE search in Delphi

hi

I want to search thousands of files for a fixed string. Case Insensitive, exact match.

Boyer Moore Horspool is the obvious thought and there is a nice looking bit of code by Angus Johnson at http://www.users.on.net/johnson/delphi/searches.html

This uses file mapping which I guess is cool but every file is mapped into memory separately according to its size. This seems like there must be a lot of memory reallocs going on behind the scenes .. one time I have a file of 42 bytes, the next 42mb.

Q1. Is filemapping the best way to go for searching large numbers of files of varying sizes

Q2. Is there a way of using and reusing a fixed say 64K buffer instead of asking for large amounts of memory, but still using file mapping

Q3. Am I correct in my assumption that file reading and memory mapping will be the bottleneck, and that therefore Boyer-Moore is as good as anything ?

All insights and mega fast code gratefully received

Wim ten Brink

Filemapping isn't always a good idea. Especially if you're going through the file only once, in one single direction. Furthermore, filemapping has some nasty size limitations since the filesize is considered to be RAM for a temporary moment, thus increasing the amount of memory your system has available but this cannot grow above 4 GB. Thus, huge files will be a big trouble to read.

A filemap is a kind of window on a part of a file. You could create a new map to some other portion of this file, thus reading it part by part.

If the BM method is working correctly, it would perform just as fast with a filestream as a filemap. By using the TFilestream class you can go through it pretty fast. Use two buffers, though. Read the first 32KB in the first buffer and the next 32 KB in the second buffer. Once you're done walking through the first buffer then you can read the next 32 in the first buffer while processing the second buffer. Once you're done with the second buffer you go back to the first buffer while reading the next 32 KB in the second buffer. Continue until end of file...

Mutley2003

ASKER

Alex, sounds like you are suggesting 2 threads?

If so, that would make things a bit messy. I'd have to have the 2 buffers shared between the threads and maybe some locking mechanism. And the buffer boundary issue seems like a problem.

Or are you just suggesting a simple sequential approach in a single thread
ReadBuffer1;
SearchBuffer1;
ReadBuffer2;
SearchBuffer2;
.. cant see it

Or is it
ReadBuffer1;
ReadBuffer2;

SearchBuffer1; ReadBuffer1;
SearchBuffer2; ReadBuffer2;

thanks for the insight into filemapping

I guess with the buffer method, there are buffer boundary issues with BM. I guess this can be solved by keeping at most M-1 bytes (where M is the length of search string) from the tail of buffer 1 and M-1 from the beginning of buffer 2.. and BM searching that .. hmm

OK, let's step back a bit. The slow bit is the read op, the first read op being slowest because of head positioning. So for a file >32K ideally we are searching the first buffer while reading the second? But then we have boundary issues .. we cannot actually finish searching the first buffer until we get the second and that buffer overlap issue continues on forever
ie we need to search
buffer1 tail + buffer 2 head
buffer2 tail + buffer 3 head

A solution would be nice. Code even better<g>

ASKER CERTIFIED SOLUTION

Wim ten Brink

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Mutley2003

ASKER

some good ideas there Alex, thanks, I will experiment