Link to home
Start Free TrialLog in
Avatar of Paer Toernell
Paer ToernellFlag for Thailand

asked on

How to search in a TmemoryStream

I have a quite big tmemorystream, about 250 mb of text. I want to search for a substring in this text and get the ofset, but i cant find any information about how this is done. I have a 64 bit system.
Avatar of jimyX
jimyX

You can use TStringStream instead, and use the property DataString to read the data.
procedure TForm1.Button1Click(Sender: TObject);
var
  S:TStringStream;
begin
  S:=TStringStream.Create;
  S.LoadFromFile('C:\test.txt');
  // handle the string through the property DataString
  S.DataString;
  ...
  S.Free;
end;

Open in new window

Avatar of Paer Toernell

ASKER

Thanx, but how do i search?
How can i find a string in a TMemoryStream var?
http://www.delphipages.com/forum/showthread.php?t=20735
You can use pos() to find the substring.
Can you clarify more please, what are you looking for in that string?
Here is an example of how to find sub-string:
procedure TForm1.Button1Click(Sender: TObject);
var
  M:tmemorystream;
  S:TStringStream;
begin
  M:=tmemorystream.Create;
  S:=TStringStream.Create;
  M.LoadFromFile('C:\test.txt');
  M.Position := 0;
  S.LoadFromStream(M);
  if pos('unit',S.DataString) > 0 then
    begin
      showmessage(IntToStr(Pos('unit', S.DataString)));
      S.SaveToFile('C:\unit.txt');
    end
  else
    showmessage('Not found');
  M.Free;
  S.Free;
end;

Open in new window

P.S: the TMemoryStream is used here just to demonstrate the conversion, but it could be eliminated as per my first post.
JimyX,

(1)
Is the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.

(2)
I am looking for a substring in a tMemoryStream.
(sorry should be:)
(1)
If the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.
Pos will be used with TStringStream not TMemoryStream.
If you are looking for sub-string in the TMemoryStream then you need to search (seek) by position:
procedure TForm1.Button2Click(Sender: TObject);
var
  M:tmemorystream;
  S:TStringStream;
begin
  M:=tmemorystream.Create;
  S:=TStringStream.Create;
  M.LoadFromFile('C:\test.txt');
  M.Position := 0;

  // copy 4 bytes starting from the location 123
  M.Seek(123, soFromBeginning);
  S.CopyFrom(M,4);

  showmessage(S.DataString);
  M.Free;
  S.Free;
end;

Open in new window

jimyX,

I still don't see any implementation of POS. I guess i can load the Stream into a TStringStream, but can pos really handle a 200 mb string?
Pos function is used to search a substring in a string.
That is the only way to easily locate a string within a text, so you have to first load your whole stream into a string variable.

let's say it is an ANSI string and not a Wide String :

 
Function LocateInStream(M:TMemoryString;SubStr:ANSIString):Integer;
Var
 S:ANSIString;
begin
 SetLength(S,M.Size);
 M.Position:=0;
 M.ReadBuffer(S[1],M.Size); // That loads the stream into a (ANSI)string variable
 Result:=Pos(SubStr,S);
end;

Open in new window

Of course, that will put some load in the memory, and you could well read the stream piece by piece and search using Pos in fragment of the stream instead of the whole string. It would be a bit more complex, while the general idea is exactly the same as above, and the above code is the best in terms of performance provided you can afford the temporary load of 250Mb in system memory
epasquier,

Im not realy happy - because i need to duplicate all 200 mb, to create a a string.

I made up some code that is a start - it finds all "A" in the text.

 
var 
 mem : TMemoryStream;
 ch : char;
 a:integer;

begin

"ReadMemFromDisk"....

memSize:=mem.Size;
    for i := 0 to memSize do  begin

     mem.read(ch,1);
     if ch='A' then inc(a);

    end;

Open in new window


But can i make it faster? Now i copy everything to the char CH - can i not search for a straight into the memory?
What happens if you use Pos with TStringStream?

There is a PAQ about finding text in TMemoryStream:
https://www.experts-exchange.com/questions/23979572/find-a-pattern-in-a-binary-stream.html#23157461
Have you already tried this from http://www.delphipages.com/forum/showthread.php?t=20735

function FindInMemStream(Stream: TMemoryStream; What: String):Integer;
var
  bufBuffer, bufBuffer2: array[0..254] of Char;
  i: Integer;
begin
  Result := 0;
  FillChar(bufBuffer, 255, #0);
  FillChar(bufBuffer2, 255, #0);
  StrPCopy(@bufBuffer2, What);
  Stream.Position:=0;
  while Stream.Position <> Stream.Size do
  begin
    Stream.Read(bufBuffer[0],Length(What));
    if CompareMem(@bufBuffer,@bufBuffer2,Length(What)) then
    begin
      Result := Stream.Position-Length(What);
      Exit;
    end;
  end;
end;

Open in new window

> Im not realy happy - because i need to duplicate all 200 mb, to create a a string.
that can hardly be helped if you want to re-use standard Delphi functions, that can operate only on String type.

>What happens if you use Pos with TStringStream?
You mean using Pos on the DataString property ? That would mean
a) copy the MemoryStream into the StringStream
b) copy the StringStream into a string (returned as DataString property)
so it would be even less optimized

> But can i make it faster?
> Now i copy everything to the char CH - can i not search for a straight into the memory?
Your only option would be to duplicate the Pos algorithm on the memory directly, you would then avoid the memory duplication but the search would be slower, unless you want to go ASM

One question : where does your MemoryStream comes from ? if it handles only text, then you'd better store that data into an ANSIString type in the first place, and use that in all your code
Here is the code of Pos (Delphi code) modified to use MemoryStream instead of ANSIString as source
It is about sure to be slower than Delphi Pos function that is implemented in ASM using extensively
REPNE SCASB

that instruction alone will perform the search of the initial character of the substring, so you can bet that modern CPU will take no more than one CPU tick per byte in your stream until it finds the substring. I'm not sure the code below can beat that, even if the memory has to be duplicated in memory first.
The answer to that question is merely : do you have 250Mo of extra memory to avoid caching the little time you do that copy ? if yes, then use the first code I provided. If no, then use the code below
function StreamPos(const SubStr:ANSIString; MemStr: TMemoryStream): Integer; 
var
  SubLen, SrcLen, Len, I, J: Integer;
  C1: AnsiChar;
  Str:pANSIChar;
begin
 Result := 0;
 if (Pointer(SubStr) = nil) or Not Assigned(MemStr) or (MemStr.Size=0) then Exit;
 SrcLen := Length(SubStr);
 SubLen := MemStr.Size;
 if (SubLen <= 0) or (SrcLen <= 0) or (SrcLen < SubLen) then Exit;
 // find SubStr[1] in Str[1 .. SrcLen - SubLen + 1]
 Len := SrcLen - SubLen + 1;
 C1 := SubStr[1];
 Str:=pANSIChar(MemStr.Memory);
 for I := 0 to Len - 1 do
  begin
   if Str[I] = C1 then
    begin
     Result := I + 1;
     for J := 1 to SubLen-1 do
      begin
       if Str[I+J] <> PAnsiChar(SubStr)[J] then
        begin
         Result := 0;
         break;
        end;
      end;
     if Result <> 0 then break;
    end;
  end;
end;

Open in new window

all measures have been done with performance counters, multiple times, so are accurate enough to the 1/100s. Harware : Core i7 920 @ 2,7 GHz

* time to generate 250Mb of random ANSIString, making sure the last 5 are unique : 1,50 s

* method 1 :
time to search the last 5 substring in it using Pos (ASM) : 0,120 s
time to copy 250Mb of data in another string : 0,138  s
total time : 0,26 s , eating up 250 Mb more memory

* method 2 (non ASM code) : 0,585 s
only advantage : no extra memory usage, twice the time
#erratum in StreamPos code : Line 9&10 , I inverted the SrcLen & SubLen

 SubLen := Length(SubStr);
 SrcLen := MemStr.Size;

Open in new window

here is the code I used to make the tests above (simplified, I remove all performance counting and display)
You'll see I used a StreamPos function that needs pANSIChar and length as source entry, instead of string or memorystream, that allows the use of that function for both cases, while being exactly the same inner working and performance
procedure TForm1.FormCreate(Sender: TObject);
Var
 i,P:integer;
begin
 P:=250000000;
 SetLength(S,P);
 for I := 1 to P-1 do S[i]:=ANSIChar(Random(100)+32);
 S[P]:=#140;
 Sub:=Copy(S,P-5,6); 
end;

procedure TForm1.btnPosClick(Sender: TObject);
Var
 i,P:integer;
 S2:ANSIString;
begin
 for i:=0 to 9 do
  begin
   S2:=S; S2[1]:=#100; // Delphi only duplicate strings if one change is made
  end;
 for i:=0 to 9 do P:=Pos(Sub,S2);
end;

function StreamPos(const SubStr:ANSIString; Str: pANSIChar; SrcLen:Integer ): Integer;
var
  SubLen, Len, I, J: Integer;
  C1: AnsiChar;
begin
 Result := 0;
 if (Pointer(SubStr) = nil) or (SrcLen<0) then Exit;
 SubLen := Length(SubStr);
 if (SrcLen < SubLen) then Exit;
 // find SubStr[1] in Str[1 .. SrcLen - SubLen + 1]
 Len := SrcLen - SubLen + 1;
 C1 := SubStr[1];
 for I := 0 to Len - 1 do
  begin
   if Str[I] = C1 then
    begin
     Result := I + 1;
     for J := 1 to SubLen-1 do
      begin
       if Str[I+J] <> PAnsiChar(SubStr)[J] then
        begin
         Result := 0;
         break;
        end;
      end;
     if Result <> 0 then break;
    end;
  end;
end;

procedure TForm1.btnStreamClick(Sender: TObject);
Var
 i,P:integer;
begin
 for i:=0 to 9 do P:=StreamPos(Sub,pANSIChar(S),Length(S));
end;

Open in new window

epasquier,

Looks very nice, but of what type is S and Sub ?
I guess Ansistring
It looks very nice, but can it handle case insensetive? I really need that.
ansi strings or unicode strings?
or you need both? (eg search for an ansi string into an unicode stream, or vice versa)
I guess from what i learned here, ansi is the fastest so il guess il go for that
ASKER CERTIFIED SOLUTION
Avatar of lomo74
lomo74
Flag of Italy image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
tried it myself --
just download zip file at http://cc.embarcadero.com/Item/18628
then in FastStrings.pas replace all occurrence of
string   ->  ansistring
char   ->  ansichar
pchar    ->  pansichar

and in initialization section replace call to Chr function with AnsiChar(I)

then try the sample below


program Project1;

{$APPTYPE CONSOLE}

uses
  SysUtils,
  FastStrings,
  Classes;

var
  s, f: AnsiString;
  m: TMemoryStream;
  jt: TBMJumpTable;
  p: Pointer;

begin
  try
    s := 'my very long text';
    f := 'long';
    m := TMemoryStream.Create;
    try
      //fill memory stream with something
      m.Size := Length(s);
      m.Write(PAnsiChar(s)^, Length(s));
      m.Position := 0;
      //prepare boyle-moore table and make call
      MakeBMTableNoCase(PAnsiChar(f), Length(f), jt);
      p := BMPosNoCase(m.Memory, PAnsiChar(f), m.Size, Length(f), jt);
      //found?
      if p = nil then
        Writeln('Match not found')
      else
        Writeln('Found match @ position ', Integer(p) - Integer(m.Memory) + 1);
    finally
      m.Free;
    end;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

Open in new window

yes, ANSIString for all. Since you didn't tell and you use stream, it is only logical that the inner coding in the stream is ANSI, so you MUST not use Unicode => Conversions all over, twice the memory used...

> It looks very nice, but can it handle case insensetive? I really need that.
no, it does not, but that could be done easilyby changing the tests
C1 := ANSILowerCase(SubStr)[1];
C2 := UpCase (C1);
if (Str[I] = C1) Or (Str[I] = C2) then
...
and the same for the inner checking

Open in new window

Hi my friend,
There is a long discussion, but maybe you can solve your problem in another way. why you don't use TStrings ie:

var
  Txt:TStrings;
begin
  Txt:= TStringList.Create;
  Txt.LoadFromFile('C:\test.txt'); // assume your file is in 'C:\test.txt
  ShowMessage(IntToStr(Pos('SearchStr', Txt.Text))); // assume SearchStr is your search string
  Txt.Free;
end;
Hello and thanx all a lot. I will try and go for the Boyer-Moore search, its many times faster than the pos. Does anyone know if it can be modified so it also can handle NonCaseInsentive search with Swedish "ÅÄÖåäö" ?
yes, what about the following modification to FastStrings...

var
  I: Integer;
initialization
  for I := 0 to 255 do GUpcaseTable[I] := AnsiChar(I);
  CharUpperBuffA(@GUpcaseTable[0], 256);
  GUpcaseTable[Ord('å')] := 'Å';
  GUpcaseTable[Ord('ä')] := 'Ä';
  GUpcaseTable[Ord('ö')] := 'Ö';
  // and so on ...
  GUpcaseLUT := @GUpcaseTable[0];

also accent-insensitive search can be done:

var
  I: Integer;
initialization
  for I := 0 to 255 do GUpcaseTable[I] := AnsiChar(I);
  CharUpperBuffA(@GUpcaseTable[0], 256);
  GUpcaseTable[Ord('å')] := 'A';
  GUpcaseTable[Ord('ä')] := 'A';
  GUpcaseTable[Ord('ö')] := 'O';
  // and so on ...
  GUpcaseLUT := @GUpcaseTable[0];


Open in new window


instead of modifying GUpcaseTable you can define tables for each combination of accent and case sensitiveness e.g.
GCSAITable   case sensitive accent insensitive
GCIASTable   case insensitive accent sensitive
GCIAITable   case insensitive accent insensitive

then initialize them as shown above,

then create clones of the various functions, and replace references to GUpcaseLUT with appropriate table

eg.

clone procedure MakeBMTableNoCase to MakeBMTableNoCaseNoAccent, then replace
mov     EDX, GUpcaseLUT
with
mov     EDX, GCIAITable

and so on for every function you need...

HTH - Lorenzo -