Solved

state of the art FILE search in Delphi

Posted on 2004-10-27
230 Views
Last Modified: 2010-04-05
hi

I want to search thousands of files for a fixed string. Case Insensitive, exact match.

Boyer Moore Horspool is the obvious thought and there is a nice looking bit of code by Angus Johnson at http://www.users.on.net/johnson/delphi/searches.html

This uses file mapping which I guess is cool but every file is mapped into memory separately according to its size. This seems like there must be a lot of memory reallocs going on behind the scenes .. one time I have a file of 42 bytes, the next 42mb.

Q1. Is filemapping the best way to go for searching large numbers of files of varying sizes

Q2. Is there a way of using and reusing a fixed say 64K buffer instead of asking for large amounts of memory, but still using file mapping

Q3. Am I correct in my assumption that file reading and memory mapping will be the bottleneck, and that therefore Boyer-Moore is as good as anything ?

All insights and mega fast code gratefully received

0
Question by:Mutley2003
    4 Comments
     
    LVL 17

    Expert Comment

    by:Wim ten Brink
    Filemapping isn't always a good idea. Especially if you're going through the file only once, in one single direction. Furthermore, filemapping has some nasty size limitations since the filesize is considered to be RAM for a temporary moment, thus increasing the amount of memory your system has available but this cannot grow above 4 GB. Thus, huge files will be a big trouble to read.

    A filemap is a kind of window on a part of a file. You could create a new map to some other portion of this file, thus reading it part by part.

    If the BM method is working correctly, it would perform just as fast with a filestream as a filemap. By using the TFilestream class you can go through it pretty fast. Use two buffers, though. Read the first 32KB in the first buffer and the next 32 KB in the second buffer. Once you're done walking through the first buffer then you can read the next 32 in the first buffer while processing the second buffer. Once you're done with the second buffer you go back to the first buffer while reading the next 32 KB in the second buffer. Continue until end of file...
    0
     

    Author Comment

    by:Mutley2003

    Alex, sounds like you are suggesting 2 threads?

    If so, that would make things a bit messy. I'd have to have the 2 buffers shared between the threads and maybe some locking mechanism.  And the buffer boundary issue seems like a problem.


    Or are you just suggesting a simple sequential approach in a single thread
      ReadBuffer1;
      SearchBuffer1;
      ReadBuffer2;
      SearchBuffer2;
    .. cant see it

    Or is it
      ReadBuffer1;
      ReadBuffer2;

      SearchBuffer1; ReadBuffer1;
      SearchBuffer2; ReadBuffer2;




    thanks for the insight into filemapping

    I guess with the buffer method, there are buffer boundary issues with BM. I guess this can be solved by keeping at most M-1 bytes (where M is the length of search string) from the tail of buffer 1 and M-1 from the beginning of buffer 2.. and BM searching that .. hmm

    OK, let's step back a bit. The slow bit is the read op, the first read op being slowest because of head positioning. So for a file >32K ideally we are searching the first buffer while reading the second? But then we have boundary issues .. we cannot actually finish searching the first buffer until we get the second and that buffer overlap issue continues on forever
    ie we need to search
       buffer1 tail + buffer 2 head
       buffer2 tail + buffer 3 head

    A solution would be nice. Code even better<g>





    0
     
    LVL 17

    Accepted Solution

    by:
    Yes, you could use two threads, or just some kind of asynchronous filereads. You don't want to really wait for the buffer to be fully read while you're processing the second buffer. It is a bit complicated technique, though, since you might process the buffer faster than the other buffer is read, making things a bit complicated.

    Basically, the order is:

    ReadBuffer1
    SearchBuffer1/ReadBuffer2
    SearchBuffer2/ReadBuffer1
    SearchBuffer1/ReadBuffer2
    SearchBuffer2/ReadBuffer1
    SearchBuffer1/ReadBuffer2

    And the / is where you use either two threads or asynchronous reads to fill the other buffer. The challenge will be to find a buffersize with the highest efficiency though. Maybe performance is better with a 4 MB buffer...

    Basically, to walk through the file, all you have to do is go from 0 to Filesize-1. But the position in the buffer is calculated by "BufPos := I mod (2*Buffersize)" and if BufPos is less than BufferSize then you're in buffer1, else in buffer2. However, you will also need to check if you're passed a buffer border, thus you also need to calculate "I div BufferSize" and compare it with an old value. (initialized at 0 since you start with one full buffer already.) If the value changes then check the new value. If even, you need to reread buffer2, else reread buffer1.

    This method could become very fast but requires a lot of calculations to come up with a good algorithm. Personally, just using a TFileStream would be a lot easier and will be very fast too. In general, the Windows filesystem will do some buffering for you already so using your own buffers might not even have much effects...
    0
     

    Author Comment

    by:Mutley2003
    some good ideas there Alex, thanks, I will experiment
    0

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone. Privacy Policy Terms of Use

    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
    Objective: - This article will help user in how to convert their numeric value become words. How to use 1. You can copy this code in your Unit as function 2. than you can perform your function by type this code The Code   (CODE) The Im…
    Need more eyes on your posted question? Go ahead and follow the quick steps in this video to learn how to Request Attention to your question. *Log into your Experts Exchange account *Find the question you want to Request Attention for *Go to the e…
    Hi everyone! This is Experts Exchange customer support.  This quick video will show you how to change your primary email address.  If you have any questions, then please Write a Comment below!

    875 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    8 Experts available now in Live!

    Get 1:1 Help Now