Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

state of the art FILE search in Delphi

Posted on 2004-10-27
4
Medium Priority
?
240 Views
Last Modified: 2010-04-05
hi

I want to search thousands of files for a fixed string. Case Insensitive, exact match.

Boyer Moore Horspool is the obvious thought and there is a nice looking bit of code by Angus Johnson at http://www.users.on.net/johnson/delphi/searches.html

This uses file mapping which I guess is cool but every file is mapped into memory separately according to its size. This seems like there must be a lot of memory reallocs going on behind the scenes .. one time I have a file of 42 bytes, the next 42mb.

Q1. Is filemapping the best way to go for searching large numbers of files of varying sizes

Q2. Is there a way of using and reusing a fixed say 64K buffer instead of asking for large amounts of memory, but still using file mapping

Q3. Am I correct in my assumption that file reading and memory mapping will be the bottleneck, and that therefore Boyer-Moore is as good as anything ?

All insights and mega fast code gratefully received

0
Comment
Question by:Mutley2003
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
4 Comments
 
LVL 17

Expert Comment

by:Wim ten Brink
ID: 12426867
Filemapping isn't always a good idea. Especially if you're going through the file only once, in one single direction. Furthermore, filemapping has some nasty size limitations since the filesize is considered to be RAM for a temporary moment, thus increasing the amount of memory your system has available but this cannot grow above 4 GB. Thus, huge files will be a big trouble to read.

A filemap is a kind of window on a part of a file. You could create a new map to some other portion of this file, thus reading it part by part.

If the BM method is working correctly, it would perform just as fast with a filestream as a filemap. By using the TFilestream class you can go through it pretty fast. Use two buffers, though. Read the first 32KB in the first buffer and the next 32 KB in the second buffer. Once you're done walking through the first buffer then you can read the next 32 in the first buffer while processing the second buffer. Once you're done with the second buffer you go back to the first buffer while reading the next 32 KB in the second buffer. Continue until end of file...
0
 

Author Comment

by:Mutley2003
ID: 12430960

Alex, sounds like you are suggesting 2 threads?

If so, that would make things a bit messy. I'd have to have the 2 buffers shared between the threads and maybe some locking mechanism.  And the buffer boundary issue seems like a problem.


Or are you just suggesting a simple sequential approach in a single thread
  ReadBuffer1;
  SearchBuffer1;
  ReadBuffer2;
  SearchBuffer2;
.. cant see it

Or is it
  ReadBuffer1;
  ReadBuffer2;

  SearchBuffer1; ReadBuffer1;
  SearchBuffer2; ReadBuffer2;




thanks for the insight into filemapping

I guess with the buffer method, there are buffer boundary issues with BM. I guess this can be solved by keeping at most M-1 bytes (where M is the length of search string) from the tail of buffer 1 and M-1 from the beginning of buffer 2.. and BM searching that .. hmm

OK, let's step back a bit. The slow bit is the read op, the first read op being slowest because of head positioning. So for a file >32K ideally we are searching the first buffer while reading the second? But then we have boundary issues .. we cannot actually finish searching the first buffer until we get the second and that buffer overlap issue continues on forever
ie we need to search
   buffer1 tail + buffer 2 head
   buffer2 tail + buffer 3 head

A solution would be nice. Code even better<g>





0
 
LVL 17

Accepted Solution

by:
Wim ten Brink earned 2000 total points
ID: 12437182
Yes, you could use two threads, or just some kind of asynchronous filereads. You don't want to really wait for the buffer to be fully read while you're processing the second buffer. It is a bit complicated technique, though, since you might process the buffer faster than the other buffer is read, making things a bit complicated.

Basically, the order is:

ReadBuffer1
SearchBuffer1/ReadBuffer2
SearchBuffer2/ReadBuffer1
SearchBuffer1/ReadBuffer2
SearchBuffer2/ReadBuffer1
SearchBuffer1/ReadBuffer2

And the / is where you use either two threads or asynchronous reads to fill the other buffer. The challenge will be to find a buffersize with the highest efficiency though. Maybe performance is better with a 4 MB buffer...

Basically, to walk through the file, all you have to do is go from 0 to Filesize-1. But the position in the buffer is calculated by "BufPos := I mod (2*Buffersize)" and if BufPos is less than BufferSize then you're in buffer1, else in buffer2. However, you will also need to check if you're passed a buffer border, thus you also need to calculate "I div BufferSize" and compare it with an old value. (initialized at 0 since you start with one full buffer already.) If the value changes then check the new value. If even, you need to reread buffer2, else reread buffer1.

This method could become very fast but requires a lot of calculations to come up with a good algorithm. Personally, just using a TFileStream would be a lot easier and will be very fast too. In general, the Windows filesystem will do some buffering for you already so using your own buffers might not even have much effects...
0
 

Author Comment

by:Mutley2003
ID: 12442835
some good ideas there Alex, thanks, I will experiment
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A lot of questions regard threads in Delphi.   One of the more specific questions is how to show progress of the thread.   Updating a progressbar from inside a thread is a mistake. A solution to this would be to send a synchronized message to the…
Introduction Raise your hands if you were as upset with FireMonkey as I was when I discovered that there was no TListview.  I use TListView in almost all of my applications I've written, and I was not going to compromise by resorting to TStringGrid…
Visualize your data even better in Access queries. Given a date and a value, this lesson shows how to compare that value with the previous value, calculate the difference, and display a circle if the value is the same, an up triangle if it increased…
We’ve all felt that sense of false security before—locking down external access to a database or component and feeling like we’ve done all we need to do to secure company data. But that feeling is fleeting. Attacks these days can happen in many w…
Suggested Courses

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question