Link to home
Start Free TrialLog in
Avatar of tadzio_blah
tadzio_blah

asked on

Searching through a big file.

Hi

I was wondering what can be the most efficient way to search through the contents of a large binary data file (eg. 10GB) to find a specified string ?
Avatar of Geert G
Geert G
Flag of Belgium image

any specifics about the binary file ?
Avatar of tadzio_blah
tadzio_blah

ASKER

nope... none... some random data - just need to find the specified string in it.
start to search from several locations with threads
keep in mind that the io will be the bottleneck
Try the code below:

John

function MemPos(const Filename, TextToFind: string; iFrom: cardinal = 0;
         CaseSensitive: boolean = False): int64;
var
  buffer : string;
  len : integer;
  MS : TMemoryStream;
  charsA, charsB : set of char;
  p, pEnd : PChar;
begin
  result := -1;
  if (TextToFind <> '') and FileExists(Filename) then
   begin
    len := Length(TextToFind);
    SetLength(buffer, len);
    if CaseSensitive then
     begin
      charsA := [TextToFind[1]];
      charsB := [TextToFind[len]];
    end
    else
     begin
      charsA := [ AnsiLowerCase(TextToFind)[1],   AnsiUpperCase(TextToFind)[1] ];
      charsB := [ AnsiLowerCase(TextToFind)[len], AnsiUpperCase(TextToFind)[len]];
    end;
    MS := TMemoryStream.Create;
    try
      MS.LoadFromFile(Filename);
      if (len <= MS.Size) and (iFrom <= (MS.Size - len) + 1) then
       begin
        p := MS.Memory;
        pEnd := p;
        Inc(p, iFrom);
        Inc(pEnd, (MS.Size - len) + 1);
        while (p <= pEnd) and (result < 0)
        do
         begin
          if (p^ in charsA) and ((p +len -1)^ in charsB) then
           begin
            MoveMemory(@buffer[1], p, len);
            if (CaseSensitive and (AnsiCompareStr(TextToFind, buffer) = 0)) or
                 (AnsiCompareText(TextToFind, buffer) = 0) then
                   result := p -MS.Memory;
           end;
          Inc(p);
         end;
       end;
    finally
      MS.Free;
    end;
  end;
end;
 
 
// Usage
 
if MemPos(Path + File.Name, TextToFind.Text, 0 , CaseSense.Checked) > 0 then
 begin
   .... do whatever
 end;

Open in new window

PS..

I have tried this on CD ISO files 650 MB and a few DVD ISOs > 1GB.

For really large files, it will take some time! Not certain how to optimize it further.

John
Let's say that it can't load more than 100mb of that file at one time (because i don't think that loading). The code you provided only works when file size < system memory size.
Did you try it?

Windows will cache memory to the hard disk. Your virtual memory settings will control the amount.

John
yes, i've tried it, that's way i wrote the previous comment. i'm getting an out of memory exception.
Snap1.jpg
Hmmm... your PC should have cached it in virtual memory... I think. Check your settings there and increase as necessary.

And, sorry I didn't get out of your comment that you tried it as by writing, "Let's say that it can't load more than 100mb...' sounds hypothetical. The misleading part was "let's say".

I do have a lot of RAM on my development PCs so that could very well be a problem for you as it is an in memory function.

Sorry this did not work out for you. I tried but everyone has different needs and machines!

John
I have 4GB of mem on my dev laptop.
Did you trie to load a file >6GB with that function ?
Well, the way I see it is that your going to have read the file in blocks. At the 10gb size using a memory stream that size is not a good idea. Your going to have to read the files in blocks(16-128Kb)'s at a time and search through each of those blocks.

I'll give it a whirl but the catch here is that the data might be split between two blocks.
I do have a question, is the data your searching for normal strings? I.E. 'This is a string'#0?
i won't load more than 2 gig or maybe 4 gig (windows boundary)

you need to use a TFileStream and start reading from a certain point

i'll try an example
Nope... my largest was a DVD ISO image. 1.2 gig + or -.

John
oops, looks there is more than 1 expert with that idea
and my vmem size is to min 6gb and max 20gb... so that shouldn't be a problem
btw i'm using D2007 - maybe that's the problem (maybe they've chenged something in the memstream that it has to fit the available physical mem).
i'm searching for a standard delphi string in those files
This is my go, probably have some errors as I didn't get to test it.
function StrChr(pStr: PChar; ch: char; Length: DWORD): PChar;
var
  EndPtr : Pointer;
begin
  EndPtr := pStr + Length;
  Result := nil;
  While (Addr(pStr) <> EndPtr) Do
    begin
    If (pStr^ = ch) Then
      begin
      Result := pStr;
      Exit;
    end;
    Inc(pStr);
  end;
end;
 
function GetTextOffset(const FileName: PChar; TextToFind: PChar; TextLength: DWORD; BlockSize: DWORD = $4000): Int64;
var
  FileHandle : DWORD;
  Buffer : PChar;
  Data : PChar;
  BytesRead : DWORD;
  BytesToRead : DWORD;
  CharStart,
  CharEnd : Char;
  DoLoop : Bool;
  Offset : Int64;
  Original : Pointer;
  DoInit : Boolean;
begin
  //Initialize result. -1 = String not found
  Result := -1;
  //Open the file read only
  FileHandle := CreateFile(FileName,GENERIC_READ,0,nil,OPEN_EXISTING,0,0);
  //If we could not open the file exit
  If FileHandle = INVALID_HANDLE_VALUE Then Exit;
  Try
    //Allocate memory the size specified
    Buffer := AllocMem(BlockSize);
    //We couldn't get the memory, exit
    If @Buffer = nil Then Exit;
    Try
      //Mark the original buffer address
      Original := @Buffer;
      //Initialize the variable that will exit our loop
      DoLoop := True;
      //Initialize the DoInit variable to false, we'll use this after
      //  the first loop
      DoInit := False;
      //Save this for later
      BytesToRead := BlockSize;
      Repeat
        //initialize the constants back to normal if we didn't find a partial
        If DoInit Then
          begin
          Buffer := Original;
          BytesToRead := BlockSize;
        end;
        //zero the memory of the buffer every reiteration
        ZeroMemory(Buffer,BytesToRead);
        //read the data from the file
        ReadFile(FileHandle,Buffer,BytesToRead,BytesRead,nil);
        //if we didn't read enough we hit the end of the file
        //  so we are done with the loop
        If BytesRead < BytesToRead Then DoLoop := False;
        //We refilled the buffer, let's set it back
        If Not(DoInit) Then
          begin
          Buffer := Original;
        end;
        //increment our current position(easier than calling SetFilePointer)
        Inc(Offset,BytesRead);
        //Search the buffer for the first char in TextToFind
        Buffer := StrChr(Buffer,TextToFind^,BytesRead);
        //the first char isn't there, so restart the loop
        If @Buffer = nil Then Continue;
        //we hit the end of the buffer, not enough room for the string anyway
        If (Buffer + TextLength) >= Original Then
          begin
          //Put what we found, which could be a partial of the TextToFind
          //  in the beginning of the buffer
          Move(Buffer,Original,BlockSize - (Original - Buffer));
          Buffer := Original;
          Inc(Buffer,(BlockSize - (Original - Buffer)));
          Dec(BytesToRead,BlockSize - (Original - Buffer));
          DoInit := False;
          Continue;
        end;
        //Check to see if the last char and string length is matching
        If Buffer[TextLength - 1] = TextToFind[TextLength - 1] Then
          begin
          //Match the actual string
          If StrComp(Buffer,TextToFind) = 0 Then
            begin
            //It matches. Bring the offset back to where Buffer is
            Result := Offset - (Buffer - Original);
            //Exit, we found what we're looking for
            Exit;
          end;
        end;
        DoInit := True;
      Until (Not(DoLoop));
    Finally
      //Deallocate the memory, we're done with it
      FreeMem(Buffer,BlockSize);
    end;
  Finally
    //Close the handle to the file so windows will like us
    CloseHandle(FileHandle);
  end;
end;

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Russell Libby
Russell Libby
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Russell,

Nice function! Works well. I am anxious to do some speed tests on  < 1 gig files comparing the two functions provided.

One question.....

// Buffer read size: keep in multiples of 1 MB
const
  BLOCK_SIZE        =  1048576;

Shouldn't block_size = 1024000?

That is actually 1 meg...

Just wondering.

John
Russell,

Never mind the stupid brain dead question.  duh.... 2 - 20 power = 1,048,576.

Sorry.

You got my points!

John
John, no problem ;-)

Hopefully you should find the two pretty close (performance wise) on smaller files. Let me know how it goes.

Russell
ThievingSix:
your code doesn't work. i'm getting access violation on the line:

        If (Buffer + TextLength) >= Original Then
          begin
          //Put what we found, which could be a partial of the TextToFind
          //  in the beginning of the buffer
>>>          Move(Buffer,Original,BlockSize - (Original - Buffer));
          Buffer := Original;
          Inc(Buffer,(BlockSize - (Original - Buffer)));
          Dec(BytesToRead,BlockSize - (Original - Buffer));
          DoInit := False;
          Continue;
        end;

can you fix that please ? i'd like to do a timetest on both provided functions.

Btw, if you plan on timing the two functions (which is a good idea), then you need to modify my routine to "even" things out, as I return ALL matches, not just the first one.


                            // Check match
                             if bMatch then
                             begin
                                // Get current size
                                dwIndex:=Length(Found);
                                // Allocate space on array
                                SetLength(Found, Succ(dwIndex));
                                // Set entry
                                Found[dwIndex]:=dwHold + (dwBlock - dwSearch);
                                //  Done
                                exit;
                             end;
----
Russell
Don't worry ;) I've done it allready.
Just need the working version of ThievingSix's function.
Let's see. Add a variable I, type Integer. Then replace the Move with the following:
For I := 0 To (BlockSize - (Original - Buffer)) Do
  begin
  PByteArray(Original)[I] := Buffer[I];
end;

Open in new window

error in the code. should be:

For i := 0 To (BlockSize - (Original - Buffer)) Do
  begin
  PByteArray(Original)[i] := byte(Buffer[i]);
end;

, but even now i'm getting access violation on: PByteArray(Original)[i] := byte(Buffer[i]); line.
off topic:
>>Russell
nice to have your site back
Geert,
Thanks, didn't know anyone still used my site.

ThievingSix,
You might want to take another pass through your code. There are a number of pointer errors (still can't get it to run my system). Eg:

//Mark the original buffer address
Original := @Buffer; // should be Original := Buffer;

Also, the ReadFile is failing to read anything due to the missing ^ . Change to:
ReadFile(FileHandle, Buffer^, BytesToRead, BytesRead, nil);

----

Regards,
Russell
I've got tired of waiting, and used Russell's code in my app. Thanks.
I had no problem with you waiting, regardless if you already started using my routine.

I will share this final note though...

In timing tests where I placed the match string at the end of a 1 GB file, 13500 ms of the routine was spent loading the file (avg speed of 60~70 MB / sec, which was consistent with the avg throughput of my SATA 300 drives), and 500ms was spent performing the actual search though the 1 GB of data, for a combined total of 14 seconds. In the end you would find that most of the search routines are fairly comparable; (like I said earlier) the disk IO is the biggest bottleneck.

Regards,
Russell
your function works fine. i've made some changes to it to better fit my needs.
it returns only one position, it always searches case insensitive, and you can give a starting position of the search (for use with incremental searching).

function szukaj(FileName, Search: string; StartPos : int64) : int64;
const
  BLOCK_SIZE = 10485760;
 
var
  lpShiftTable:  Array [#0..#255] of LongWord;
  lpCompTable:   Array [#0..#255] of LongWord;
  hOpen:         THandle;
  dwIndex:       LongWord;
  dwRead:        LongWord;
  dwBlock:       LongWord;
  dwSearch:      LongWord;
  dwHold:        Int64;
  lpBuffer:      PChar;
  bMatch:        Boolean;
  cChar:         Char;
  SP:            LARGE_INTEGER;
 
begin
  Result := -1;
  dwSearch:=Length(Search);
  if (dwSearch > 0) then
    begin
      hOpen:=CreateFile(PChar(FileName), GENERIC_READ, FILE_SHARE_READ or FILE_SHARE_WRITE, nil, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, 0);
      if (hOpen <> INVALID_HANDLE_VALUE) then
        begin
          try
            if (StartPos > 0) then
              begin
                SP := LARGE_INTEGER(StartPos);
                SetFilePointer(hOpen, Longint(SP.LowPart), @SP.HighPart, FILE_BEGIN);
                pBar.Progress := StartPos;
                Application.ProcessMessages;
              end;
            cChar:=Search[dwSearch];
            FillChar(lpCompTable, SizeOf(lpCompTable), 255);
            lpCompTable[cChar]:=0;
            case cChar of
              'A'..'Z' :  lpCompTable[Char(Ord(cChar) + 32)]:=0;
              'a'..'z' :  lpCompTable[UpCase(cChar)]:=0;
            end;
            for cChar:=#0 to #255 do
              lpShiftTable[cChar]:=dwSearch;
            for dwIndex:=1 to Pred(dwSearch) do
              begin
                lpShiftTable[Search[dwIndex]]:=dwSearch - dwIndex;
                case Search[dwIndex] of
                  'A'..'Z' :  lpShiftTable[Char(Byte(Search[dwIndex]) + 32)]:=dwSearch - dwIndex;
                  'a'..'z' :  lpShiftTable[Char(Byte(Search[dwIndex]) - 32)]:=dwSearch - dwIndex;
                end;
              end;
            lpBuffer:=AllocMem(BLOCK_SIZE + Succ(dwSearch));
            try
              dwSearch := dwSearch - 1;
              if ReadFile(hOpen, lpBuffer^, BLOCK_SIZE, dwRead, nil) then
                begin
                  dwHold:=StartPos;
                  pBar.Progress := pBar.Progress + dwRead;
                  Application.ProcessMessages;
                  while (dwRead > 0) do
                    begin
                      dwBlock:=dwSearch;
                      while (dwBlock < dwRead) do
                        begin
                          dwIndex:=lpCompTable[lpBuffer[dwBlock]];
                          if (dwIndex = 0) then
                            begin
                              bMatch:=(StrLIComp(Pointer(Search), @lpBuffer[dwBlock - dwSearch], dwSearch) = 0);
                              if bMatch then
                                begin
                                  Result:=dwHold + dwBlock;
                                  Exit;
                                end;
                            end;
                          dwBlock := dwBlock + lpShiftTable[lpBuffer[dwBlock]];
                        end;
                        if (dwRead < BLOCK_SIZE) then
                          Break
                        else
                          begin
                            dwHold := dwHold + (dwRead - dwSearch);
                            Move(lpBuffer[dwRead - dwSearch], lpBuffer[0], dwSearch);
                            if not(ReadFile(hOpen, lpBuffer[dwSearch], BLOCK_SIZE, dwRead, nil)) then
                              Break
                            else
                              pBar.Progress := pBar.Progress + dwRead;
                            Application.ProcessMessages;
                            dwRead := dwRead + dwSearch;
                          end;
                    end;
                end;
            finally
              FreeMem(lpBuffer);
            end;
          finally
            CloseHandle(hOpen);
          end;
        end;
    end;
end;

Open in new window

oh... and the position returned is just behind the search string.