Solved

How to search in a TmemoryStream

Posted on 2011-03-22
29
1,662 Views
Last Modified: 2012-05-11
I have a quite big tmemorystream, about 250 mb of text. I want to search for a substring in this text and get the ofset, but i cant find any information about how this is done. I have a 64 bit system.
0
Comment
Question by:Sunsales
  • 10
  • 7
  • 5
  • +3
29 Comments
 
LVL 24

Expert Comment

by:jimyX
ID: 35190693
You can use TStringStream instead, and use the property DataString to read the data.
procedure TForm1.Button1Click(Sender: TObject);
var
  S:TStringStream;
begin
  S:=TStringStream.Create;
  S.LoadFromFile('C:\test.txt');
  // handle the string through the property DataString
  S.DataString;
  ...
  S.Free;
end;

Open in new window

0
 

Author Comment

by:Sunsales
ID: 35190914
Thanx, but how do i search?
0
 
LVL 19

Expert Comment

by:Thommy
ID: 35191016
How can i find a string in a TMemoryStream var?
http://www.delphipages.com/forum/showthread.php?t=20735
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35191068
You can use pos() to find the substring.
Can you clarify more please, what are you looking for in that string?
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35191182
Here is an example of how to find sub-string:
procedure TForm1.Button1Click(Sender: TObject);
var
  M:tmemorystream;
  S:TStringStream;
begin
  M:=tmemorystream.Create;
  S:=TStringStream.Create;
  M.LoadFromFile('C:\test.txt');
  M.Position := 0;
  S.LoadFromStream(M);
  if pos('unit',S.DataString) > 0 then
    begin
      showmessage(IntToStr(Pos('unit', S.DataString)));
      S.SaveToFile('C:\unit.txt');
    end
  else
    showmessage('Not found');
  M.Free;
  S.Free;
end;

Open in new window

P.S: the TMemoryStream is used here just to demonstrate the conversion, but it could be eliminated as per my first post.
0
 

Author Comment

by:Sunsales
ID: 35191231
JimyX,

(1)
Is the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.

(2)
I am looking for a substring in a tMemoryStream.
0
 

Author Comment

by:Sunsales
ID: 35191239
(sorry should be:)
(1)
If the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35191444
Pos will be used with TStringStream not TMemoryStream.
If you are looking for sub-string in the TMemoryStream then you need to search (seek) by position:
procedure TForm1.Button2Click(Sender: TObject);
var
  M:tmemorystream;
  S:TStringStream;
begin
  M:=tmemorystream.Create;
  S:=TStringStream.Create;
  M.LoadFromFile('C:\test.txt');
  M.Position := 0;

  // copy 4 bytes starting from the location 123
  M.Seek(123, soFromBeginning);
  S.CopyFrom(M,4);

  showmessage(S.DataString);
  M.Free;
  S.Free;
end;

Open in new window

0
 

Author Comment

by:Sunsales
ID: 35191544
jimyX,

I still don't see any implementation of POS. I guess i can load the Stream into a TStringStream, but can pos really handle a 200 mb string?
0
 
LVL 25

Expert Comment

by:epasquier
ID: 35191668
Pos function is used to search a substring in a string.
That is the only way to easily locate a string within a text, so you have to first load your whole stream into a string variable.

let's say it is an ANSI string and not a Wide String :

 
Function LocateInStream(M:TMemoryString;SubStr:ANSIString):Integer;
Var
 S:ANSIString;
begin
 SetLength(S,M.Size);
 M.Position:=0;
 M.ReadBuffer(S[1],M.Size); // That loads the stream into a (ANSI)string variable
 Result:=Pos(SubStr,S);
end;

Open in new window

0
 
LVL 25

Expert Comment

by:epasquier
ID: 35191714
Of course, that will put some load in the memory, and you could well read the stream piece by piece and search using Pos in fragment of the stream instead of the whole string. It would be a bit more complex, while the general idea is exactly the same as above, and the above code is the best in terms of performance provided you can afford the temporary load of 250Mb in system memory
0
 

Author Comment

by:Sunsales
ID: 35191980
epasquier,

Im not realy happy - because i need to duplicate all 200 mb, to create a a string.

I made up some code that is a start - it finds all "A" in the text.

 
var 
 mem : TMemoryStream;
 ch : char;
 a:integer;

begin

"ReadMemFromDisk"....

memSize:=mem.Size;
    for i := 0 to memSize do  begin

     mem.read(ch,1);
     if ch='A' then inc(a);

    end;

Open in new window


But can i make it faster? Now i copy everything to the char CH - can i not search for a straight into the memory?
0
 
LVL 24

Expert Comment

by:jimyX
ID: 35192037
What happens if you use Pos with TStringStream?

There is a PAQ about finding text in TMemoryStream:
http://www.experts-exchange.com/Programming/Editors_IDEs/Delphi/Q_23979572.html#23157461
0
 
LVL 19

Expert Comment

by:Thommy
ID: 35196535
Have you already tried this from http://www.delphipages.com/forum/showthread.php?t=20735

function FindInMemStream(Stream: TMemoryStream; What: String):Integer;
var
  bufBuffer, bufBuffer2: array[0..254] of Char;
  i: Integer;
begin
  Result := 0;
  FillChar(bufBuffer, 255, #0);
  FillChar(bufBuffer2, 255, #0);
  StrPCopy(@bufBuffer2, What);
  Stream.Position:=0;
  while Stream.Position <> Stream.Size do
  begin
    Stream.Read(bufBuffer[0],Length(What));
    if CompareMem(@bufBuffer,@bufBuffer2,Length(What)) then
    begin
      Result := Stream.Position-Length(What);
      Exit;
    end;
  end;
end;

Open in new window

0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 25

Expert Comment

by:epasquier
ID: 35197042
> Im not realy happy - because i need to duplicate all 200 mb, to create a a string.
that can hardly be helped if you want to re-use standard Delphi functions, that can operate only on String type.

>What happens if you use Pos with TStringStream?
You mean using Pos on the DataString property ? That would mean
a) copy the MemoryStream into the StringStream
b) copy the StringStream into a string (returned as DataString property)
so it would be even less optimized

> But can i make it faster?
> Now i copy everything to the char CH - can i not search for a straight into the memory?
Your only option would be to duplicate the Pos algorithm on the memory directly, you would then avoid the memory duplication but the search would be slower, unless you want to go ASM

One question : where does your MemoryStream comes from ? if it handles only text, then you'd better store that data into an ANSIString type in the first place, and use that in all your code
0
 
LVL 25

Expert Comment

by:epasquier
ID: 35197140
Here is the code of Pos (Delphi code) modified to use MemoryStream instead of ANSIString as source
It is about sure to be slower than Delphi Pos function that is implemented in ASM using extensively
REPNE SCASB

that instruction alone will perform the search of the initial character of the substring, so you can bet that modern CPU will take no more than one CPU tick per byte in your stream until it finds the substring. I'm not sure the code below can beat that, even if the memory has to be duplicated in memory first.
The answer to that question is merely : do you have 250Mo of extra memory to avoid caching the little time you do that copy ? if yes, then use the first code I provided. If no, then use the code below
function StreamPos(const SubStr:ANSIString; MemStr: TMemoryStream): Integer; 
var
  SubLen, SrcLen, Len, I, J: Integer;
  C1: AnsiChar;
  Str:pANSIChar;
begin
 Result := 0;
 if (Pointer(SubStr) = nil) or Not Assigned(MemStr) or (MemStr.Size=0) then Exit;
 SrcLen := Length(SubStr);
 SubLen := MemStr.Size;
 if (SubLen <= 0) or (SrcLen <= 0) or (SrcLen < SubLen) then Exit;
 // find SubStr[1] in Str[1 .. SrcLen - SubLen + 1]
 Len := SrcLen - SubLen + 1;
 C1 := SubStr[1];
 Str:=pANSIChar(MemStr.Memory);
 for I := 0 to Len - 1 do
  begin
   if Str[I] = C1 then
    begin
     Result := I + 1;
     for J := 1 to SubLen-1 do
      begin
       if Str[I+J] <> PAnsiChar(SubStr)[J] then
        begin
         Result := 0;
         break;
        end;
      end;
     if Result <> 0 then break;
    end;
  end;
end;

Open in new window

0
 
LVL 25

Expert Comment

by:epasquier
ID: 35197366
all measures have been done with performance counters, multiple times, so are accurate enough to the 1/100s. Harware : Core i7 920 @ 2,7 GHz

* time to generate 250Mb of random ANSIString, making sure the last 5 are unique : 1,50 s

* method 1 :
time to search the last 5 substring in it using Pos (ASM) : 0,120 s
time to copy 250Mb of data in another string : 0,138  s
total time : 0,26 s , eating up 250 Mb more memory

* method 2 (non ASM code) : 0,585 s
only advantage : no extra memory usage, twice the time
#erratum in StreamPos code : Line 9&10 , I inverted the SrcLen & SubLen

 SubLen := Length(SubStr);
 SrcLen := MemStr.Size;

Open in new window

0
 
LVL 25

Expert Comment

by:epasquier
ID: 35197375
here is the code I used to make the tests above (simplified, I remove all performance counting and display)
You'll see I used a StreamPos function that needs pANSIChar and length as source entry, instead of string or memorystream, that allows the use of that function for both cases, while being exactly the same inner working and performance
procedure TForm1.FormCreate(Sender: TObject);
Var
 i,P:integer;
begin
 P:=250000000;
 SetLength(S,P);
 for I := 1 to P-1 do S[i]:=ANSIChar(Random(100)+32);
 S[P]:=#140;
 Sub:=Copy(S,P-5,6); 
end;

procedure TForm1.btnPosClick(Sender: TObject);
Var
 i,P:integer;
 S2:ANSIString;
begin
 for i:=0 to 9 do
  begin
   S2:=S; S2[1]:=#100; // Delphi only duplicate strings if one change is made
  end;
 for i:=0 to 9 do P:=Pos(Sub,S2);
end;

function StreamPos(const SubStr:ANSIString; Str: pANSIChar; SrcLen:Integer ): Integer;
var
  SubLen, Len, I, J: Integer;
  C1: AnsiChar;
begin
 Result := 0;
 if (Pointer(SubStr) = nil) or (SrcLen<0) then Exit;
 SubLen := Length(SubStr);
 if (SrcLen < SubLen) then Exit;
 // find SubStr[1] in Str[1 .. SrcLen - SubLen + 1]
 Len := SrcLen - SubLen + 1;
 C1 := SubStr[1];
 for I := 0 to Len - 1 do
  begin
   if Str[I] = C1 then
    begin
     Result := I + 1;
     for J := 1 to SubLen-1 do
      begin
       if Str[I+J] <> PAnsiChar(SubStr)[J] then
        begin
         Result := 0;
         break;
        end;
      end;
     if Result <> 0 then break;
    end;
  end;
end;

procedure TForm1.btnStreamClick(Sender: TObject);
Var
 i,P:integer;
begin
 for i:=0 to 9 do P:=StreamPos(Sub,pANSIChar(S),Length(S));
end;

Open in new window

0
 

Author Comment

by:Sunsales
ID: 35198225
epasquier,

Looks very nice, but of what type is S and Sub ?
0
 

Author Comment

by:Sunsales
ID: 35198251
I guess Ansistring
0
 

Author Comment

by:Sunsales
ID: 35198284
It looks very nice, but can it handle case insensetive? I really need that.
0
 
LVL 8

Expert Comment

by:lomo74
ID: 35199565
ansi strings or unicode strings?
or you need both? (eg search for an ansi string into an unicode stream, or vice versa)
0
 

Author Comment

by:Sunsales
ID: 35199608
I guess from what i learned here, ansi is the fastest so il guess il go for that
0
 
LVL 8

Accepted Solution

by:
lomo74 earned 500 total points
ID: 35199784
ok, so what about a modified version of Boyer-Moore.
it's known to be the fastest text search algorithm out there

it should be easy to port this one on D2009+ (which work with unicode by default) so it works on ansistrings instead:
http://cc.embarcadero.com/Item/18628

0
 
LVL 8

Expert Comment

by:lomo74
ID: 35200297
tried it myself --
just download zip file at http://cc.embarcadero.com/Item/18628
then in FastStrings.pas replace all occurrence of
string   ->  ansistring
char   ->  ansichar
pchar    ->  pansichar

and in initialization section replace call to Chr function with AnsiChar(I)

then try the sample below


program Project1;

{$APPTYPE CONSOLE}

uses
  SysUtils,
  FastStrings,
  Classes;

var
  s, f: AnsiString;
  m: TMemoryStream;
  jt: TBMJumpTable;
  p: Pointer;

begin
  try
    s := 'my very long text';
    f := 'long';
    m := TMemoryStream.Create;
    try
      //fill memory stream with something
      m.Size := Length(s);
      m.Write(PAnsiChar(s)^, Length(s));
      m.Position := 0;
      //prepare boyle-moore table and make call
      MakeBMTableNoCase(PAnsiChar(f), Length(f), jt);
      p := BMPosNoCase(m.Memory, PAnsiChar(f), m.Size, Length(f), jt);
      //found?
      if p = nil then
        Writeln('Match not found')
      else
        Writeln('Found match @ position ', Integer(p) - Integer(m.Memory) + 1);
    finally
      m.Free;
    end;
  except
    on E: Exception do
      Writeln(E.ClassName, ': ', E.Message);
  end;
end.

Open in new window

0
 
LVL 25

Expert Comment

by:epasquier
ID: 35200708
yes, ANSIString for all. Since you didn't tell and you use stream, it is only logical that the inner coding in the stream is ANSI, so you MUST not use Unicode => Conversions all over, twice the memory used...

> It looks very nice, but can it handle case insensetive? I really need that.
no, it does not, but that could be done easilyby changing the tests
C1 := ANSILowerCase(SubStr)[1];
C2 := UpCase (C1);
if (Str[I] = C1) Or (Str[I] = C2) then
...
and the same for the inner checking

Open in new window

0
 
LVL 2

Expert Comment

by:RezaSadigh
ID: 35220765
Hi my friend,
There is a long discussion, but maybe you can solve your problem in another way. why you don't use TStrings ie:

var
  Txt:TStrings;
begin
  Txt:= TStringList.Create;
  Txt.LoadFromFile('C:\test.txt'); // assume your file is in 'C:\test.txt
  ShowMessage(IntToStr(Pos('SearchStr', Txt.Text))); // assume SearchStr is your search string
  Txt.Free;
end;
0
 

Author Comment

by:Sunsales
ID: 35223352
Hello and thanx all a lot. I will try and go for the Boyer-Moore search, its many times faster than the pos. Does anyone know if it can be modified so it also can handle NonCaseInsentive search with Swedish "ÅÄÖåäö" ?
0
 
LVL 8

Expert Comment

by:lomo74
ID: 35238767
yes, what about the following modification to FastStrings...

var
  I: Integer;
initialization
  for I := 0 to 255 do GUpcaseTable[I] := AnsiChar(I);
  CharUpperBuffA(@GUpcaseTable[0], 256);
  GUpcaseTable[Ord('å')] := 'Å';
  GUpcaseTable[Ord('ä')] := 'Ä';
  GUpcaseTable[Ord('ö')] := 'Ö';
  // and so on ...
  GUpcaseLUT := @GUpcaseTable[0];

also accent-insensitive search can be done:

var
  I: Integer;
initialization
  for I := 0 to 255 do GUpcaseTable[I] := AnsiChar(I);
  CharUpperBuffA(@GUpcaseTable[0], 256);
  GUpcaseTable[Ord('å')] := 'A';
  GUpcaseTable[Ord('ä')] := 'A';
  GUpcaseTable[Ord('ö')] := 'O';
  // and so on ...
  GUpcaseLUT := @GUpcaseTable[0];


Open in new window


instead of modifying GUpcaseTable you can define tables for each combination of accent and case sensitiveness e.g.
GCSAITable   case sensitive accent insensitive
GCIASTable   case insensitive accent sensitive
GCIAITable   case insensitive accent insensitive

then initialize them as shown above,

then create clones of the various functions, and replace references to GUpcaseLUT with appropriate table

eg.

clone procedure MakeBMTableNoCase to MakeBMTableNoCaseNoAccent, then replace
mov     EDX, GUpcaseLUT
with
mov     EDX, GCIAITable

and so on for every function you need...

HTH - Lorenzo -
0

Featured Post

Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

Join & Write a Comment

Introduction The parallel port is a very commonly known port, it was widely used to connect a printer to the PC, if you look at the back of your computer, for those who don't have newer computers, there will be a port with 25 pins and a small print…
Introduction I have seen many questions in this Delphi topic area where queries in threads are needed or suggested. I know bumped into a similar need. This article will address some of the concepts when dealing with a multithreaded delphi database…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now