Paer Toernell
asked on
How to search in a TmemoryStream
I have a quite big tmemorystream, about 250 mb of text. I want to search for a substring in this text and get the ofset, but i cant find any information about how this is done. I have a 64 bit system.
ASKER
Thanx, but how do i search?
How can i find a string in a TMemoryStream var?
http://www.delphipages.com/forum/showthread.php?t=20735
http://www.delphipages.com/forum/showthread.php?t=20735
You can use pos() to find the substring.
Can you clarify more please, what are you looking for in that string?
Can you clarify more please, what are you looking for in that string?
Here is an example of how to find sub-string:
procedure TForm1.Button1Click(Sender: TObject);
var
M:tmemorystream;
S:TStringStream;
begin
M:=tmemorystream.Create;
S:=TStringStream.Create;
M.LoadFromFile('C:\test.txt');
M.Position := 0;
S.LoadFromStream(M);
if pos('unit',S.DataString) > 0 then
begin
showmessage(IntToStr(Pos('unit', S.DataString)));
S.SaveToFile('C:\unit.txt');
end
else
showmessage('Not found');
M.Free;
S.Free;
end;
P.S: the TMemoryStream is used here just to demonstrate the conversion, but it could be eliminated as per my first post.
ASKER
JimyX,
(1)
Is the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.
(2)
I am looking for a substring in a tMemoryStream.
(1)
Is the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.
(2)
I am looking for a substring in a tMemoryStream.
ASKER
(sorry should be:)
(1)
If the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.
(1)
If the answer includes pos, pls clarify how pos is implemented to search in a tMemoryStream.
Pos will be used with TStringStream not TMemoryStream.
If you are looking for sub-string in the TMemoryStream then you need to search (seek) by position:
If you are looking for sub-string in the TMemoryStream then you need to search (seek) by position:
procedure TForm1.Button2Click(Sender: TObject);
var
M:tmemorystream;
S:TStringStream;
begin
M:=tmemorystream.Create;
S:=TStringStream.Create;
M.LoadFromFile('C:\test.txt');
M.Position := 0;
// copy 4 bytes starting from the location 123
M.Seek(123, soFromBeginning);
S.CopyFrom(M,4);
showmessage(S.DataString);
M.Free;
S.Free;
end;
ASKER
jimyX,
I still don't see any implementation of POS. I guess i can load the Stream into a TStringStream, but can pos really handle a 200 mb string?
I still don't see any implementation of POS. I guess i can load the Stream into a TStringStream, but can pos really handle a 200 mb string?
Pos function is used to search a substring in a string.
That is the only way to easily locate a string within a text, so you have to first load your whole stream into a string variable.
let's say it is an ANSI string and not a Wide String :
That is the only way to easily locate a string within a text, so you have to first load your whole stream into a string variable.
let's say it is an ANSI string and not a Wide String :
Function LocateInStream(M:TMemoryString;SubStr:ANSIString):Integer;
Var
S:ANSIString;
begin
SetLength(S,M.Size);
M.Position:=0;
M.ReadBuffer(S[1],M.Size); // That loads the stream into a (ANSI)string variable
Result:=Pos(SubStr,S);
end;
Of course, that will put some load in the memory, and you could well read the stream piece by piece and search using Pos in fragment of the stream instead of the whole string. It would be a bit more complex, while the general idea is exactly the same as above, and the above code is the best in terms of performance provided you can afford the temporary load of 250Mb in system memory
ASKER
epasquier,
Im not realy happy - because i need to duplicate all 200 mb, to create a a string.
I made up some code that is a start - it finds all "A" in the text.
But can i make it faster? Now i copy everything to the char CH - can i not search for a straight into the memory?
Im not realy happy - because i need to duplicate all 200 mb, to create a a string.
I made up some code that is a start - it finds all "A" in the text.
var
mem : TMemoryStream;
ch : char;
a:integer;
begin
"ReadMemFromDisk"....
memSize:=mem.Size;
for i := 0 to memSize do begin
mem.read(ch,1);
if ch='A' then inc(a);
end;
But can i make it faster? Now i copy everything to the char CH - can i not search for a straight into the memory?
What happens if you use Pos with TStringStream?
There is a PAQ about finding text in TMemoryStream:
https://www.experts-exchange.com/questions/23979572/find-a-pattern-in-a-binary-stream.html#23157461
There is a PAQ about finding text in TMemoryStream:
https://www.experts-exchange.com/questions/23979572/find-a-pattern-in-a-binary-stream.html#23157461
Have you already tried this from http://www.delphipages.com/forum/showthread.php?t=20735
function FindInMemStream(Stream: TMemoryStream; What: String):Integer;
var
bufBuffer, bufBuffer2: array[0..254] of Char;
i: Integer;
begin
Result := 0;
FillChar(bufBuffer, 255, #0);
FillChar(bufBuffer2, 255, #0);
StrPCopy(@bufBuffer2, What);
Stream.Position:=0;
while Stream.Position <> Stream.Size do
begin
Stream.Read(bufBuffer[0],Length(What));
if CompareMem(@bufBuffer,@bufBuffer2,Length(What)) then
begin
Result := Stream.Position-Length(What);
Exit;
end;
end;
end;
> Im not realy happy - because i need to duplicate all 200 mb, to create a a string.
that can hardly be helped if you want to re-use standard Delphi functions, that can operate only on String type.
>What happens if you use Pos with TStringStream?
You mean using Pos on the DataString property ? That would mean
a) copy the MemoryStream into the StringStream
b) copy the StringStream into a string (returned as DataString property)
so it would be even less optimized
> But can i make it faster?
> Now i copy everything to the char CH - can i not search for a straight into the memory?
Your only option would be to duplicate the Pos algorithm on the memory directly, you would then avoid the memory duplication but the search would be slower, unless you want to go ASM
One question : where does your MemoryStream comes from ? if it handles only text, then you'd better store that data into an ANSIString type in the first place, and use that in all your code
that can hardly be helped if you want to re-use standard Delphi functions, that can operate only on String type.
>What happens if you use Pos with TStringStream?
You mean using Pos on the DataString property ? That would mean
a) copy the MemoryStream into the StringStream
b) copy the StringStream into a string (returned as DataString property)
so it would be even less optimized
> But can i make it faster?
> Now i copy everything to the char CH - can i not search for a straight into the memory?
Your only option would be to duplicate the Pos algorithm on the memory directly, you would then avoid the memory duplication but the search would be slower, unless you want to go ASM
One question : where does your MemoryStream comes from ? if it handles only text, then you'd better store that data into an ANSIString type in the first place, and use that in all your code
Here is the code of Pos (Delphi code) modified to use MemoryStream instead of ANSIString as source
It is about sure to be slower than Delphi Pos function that is implemented in ASM using extensively
REPNE SCASB
that instruction alone will perform the search of the initial character of the substring, so you can bet that modern CPU will take no more than one CPU tick per byte in your stream until it finds the substring. I'm not sure the code below can beat that, even if the memory has to be duplicated in memory first.
The answer to that question is merely : do you have 250Mo of extra memory to avoid caching the little time you do that copy ? if yes, then use the first code I provided. If no, then use the code below
It is about sure to be slower than Delphi Pos function that is implemented in ASM using extensively
REPNE SCASB
that instruction alone will perform the search of the initial character of the substring, so you can bet that modern CPU will take no more than one CPU tick per byte in your stream until it finds the substring. I'm not sure the code below can beat that, even if the memory has to be duplicated in memory first.
The answer to that question is merely : do you have 250Mo of extra memory to avoid caching the little time you do that copy ? if yes, then use the first code I provided. If no, then use the code below
function StreamPos(const SubStr:ANSIString; MemStr: TMemoryStream): Integer;
var
SubLen, SrcLen, Len, I, J: Integer;
C1: AnsiChar;
Str:pANSIChar;
begin
Result := 0;
if (Pointer(SubStr) = nil) or Not Assigned(MemStr) or (MemStr.Size=0) then Exit;
SrcLen := Length(SubStr);
SubLen := MemStr.Size;
if (SubLen <= 0) or (SrcLen <= 0) or (SrcLen < SubLen) then Exit;
// find SubStr[1] in Str[1 .. SrcLen - SubLen + 1]
Len := SrcLen - SubLen + 1;
C1 := SubStr[1];
Str:=pANSIChar(MemStr.Memory);
for I := 0 to Len - 1 do
begin
if Str[I] = C1 then
begin
Result := I + 1;
for J := 1 to SubLen-1 do
begin
if Str[I+J] <> PAnsiChar(SubStr)[J] then
begin
Result := 0;
break;
end;
end;
if Result <> 0 then break;
end;
end;
end;
all measures have been done with performance counters, multiple times, so are accurate enough to the 1/100s. Harware : Core i7 920 @ 2,7 GHz
* time to generate 250Mb of random ANSIString, making sure the last 5 are unique : 1,50 s
* method 1 :
time to search the last 5 substring in it using Pos (ASM) : 0,120 s
time to copy 250Mb of data in another string : 0,138 s
total time : 0,26 s , eating up 250 Mb more memory
* method 2 (non ASM code) : 0,585 s
only advantage : no extra memory usage, twice the time
* time to generate 250Mb of random ANSIString, making sure the last 5 are unique : 1,50 s
* method 1 :
time to search the last 5 substring in it using Pos (ASM) : 0,120 s
time to copy 250Mb of data in another string : 0,138 s
total time : 0,26 s , eating up 250 Mb more memory
* method 2 (non ASM code) : 0,585 s
only advantage : no extra memory usage, twice the time
#erratum in StreamPos code : Line 9&10 , I inverted the SrcLen & SubLen
SubLen := Length(SubStr);
SrcLen := MemStr.Size;
here is the code I used to make the tests above (simplified, I remove all performance counting and display)
You'll see I used a StreamPos function that needs pANSIChar and length as source entry, instead of string or memorystream, that allows the use of that function for both cases, while being exactly the same inner working and performance
You'll see I used a StreamPos function that needs pANSIChar and length as source entry, instead of string or memorystream, that allows the use of that function for both cases, while being exactly the same inner working and performance
procedure TForm1.FormCreate(Sender: TObject);
Var
i,P:integer;
begin
P:=250000000;
SetLength(S,P);
for I := 1 to P-1 do S[i]:=ANSIChar(Random(100)+32);
S[P]:=#140;
Sub:=Copy(S,P-5,6);
end;
procedure TForm1.btnPosClick(Sender: TObject);
Var
i,P:integer;
S2:ANSIString;
begin
for i:=0 to 9 do
begin
S2:=S; S2[1]:=#100; // Delphi only duplicate strings if one change is made
end;
for i:=0 to 9 do P:=Pos(Sub,S2);
end;
function StreamPos(const SubStr:ANSIString; Str: pANSIChar; SrcLen:Integer ): Integer;
var
SubLen, Len, I, J: Integer;
C1: AnsiChar;
begin
Result := 0;
if (Pointer(SubStr) = nil) or (SrcLen<0) then Exit;
SubLen := Length(SubStr);
if (SrcLen < SubLen) then Exit;
// find SubStr[1] in Str[1 .. SrcLen - SubLen + 1]
Len := SrcLen - SubLen + 1;
C1 := SubStr[1];
for I := 0 to Len - 1 do
begin
if Str[I] = C1 then
begin
Result := I + 1;
for J := 1 to SubLen-1 do
begin
if Str[I+J] <> PAnsiChar(SubStr)[J] then
begin
Result := 0;
break;
end;
end;
if Result <> 0 then break;
end;
end;
end;
procedure TForm1.btnStreamClick(Sender: TObject);
Var
i,P:integer;
begin
for i:=0 to 9 do P:=StreamPos(Sub,pANSIChar(S),Length(S));
end;
ASKER
epasquier,
Looks very nice, but of what type is S and Sub ?
Looks very nice, but of what type is S and Sub ?
ASKER
I guess Ansistring
ASKER
It looks very nice, but can it handle case insensetive? I really need that.
ansi strings or unicode strings?
or you need both? (eg search for an ansi string into an unicode stream, or vice versa)
or you need both? (eg search for an ansi string into an unicode stream, or vice versa)
ASKER
I guess from what i learned here, ansi is the fastest so il guess il go for that
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
tried it myself --
just download zip file at http://cc.embarcadero.com/Item/18628
then in FastStrings.pas replace all occurrence of
string -> ansistring
char -> ansichar
pchar -> pansichar
and in initialization section replace call to Chr function with AnsiChar(I)
then try the sample below
just download zip file at http://cc.embarcadero.com/Item/18628
then in FastStrings.pas replace all occurrence of
string -> ansistring
char -> ansichar
pchar -> pansichar
and in initialization section replace call to Chr function with AnsiChar(I)
then try the sample below
program Project1;
{$APPTYPE CONSOLE}
uses
SysUtils,
FastStrings,
Classes;
var
s, f: AnsiString;
m: TMemoryStream;
jt: TBMJumpTable;
p: Pointer;
begin
try
s := 'my very long text';
f := 'long';
m := TMemoryStream.Create;
try
//fill memory stream with something
m.Size := Length(s);
m.Write(PAnsiChar(s)^, Length(s));
m.Position := 0;
//prepare boyle-moore table and make call
MakeBMTableNoCase(PAnsiChar(f), Length(f), jt);
p := BMPosNoCase(m.Memory, PAnsiChar(f), m.Size, Length(f), jt);
//found?
if p = nil then
Writeln('Match not found')
else
Writeln('Found match @ position ', Integer(p) - Integer(m.Memory) + 1);
finally
m.Free;
end;
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
yes, ANSIString for all. Since you didn't tell and you use stream, it is only logical that the inner coding in the stream is ANSI, so you MUST not use Unicode => Conversions all over, twice the memory used...
> It looks very nice, but can it handle case insensetive? I really need that.
no, it does not, but that could be done easilyby changing the tests
> It looks very nice, but can it handle case insensetive? I really need that.
no, it does not, but that could be done easilyby changing the tests
C1 := ANSILowerCase(SubStr)[1];
C2 := UpCase (C1);
if (Str[I] = C1) Or (Str[I] = C2) then
...
and the same for the inner checking
Hi my friend,
There is a long discussion, but maybe you can solve your problem in another way. why you don't use TStrings ie:
var
Txt:TStrings;
begin
Txt:= TStringList.Create;
Txt.LoadFromFile('C:\test. txt'); // assume your file is in 'C:\test.txt
ShowMessage(IntToStr(Pos(' SearchStr' , Txt.Text))); // assume SearchStr is your search string
Txt.Free;
end;
There is a long discussion, but maybe you can solve your problem in another way. why you don't use TStrings ie:
var
Txt:TStrings;
begin
Txt:= TStringList.Create;
Txt.LoadFromFile('C:\test.
ShowMessage(IntToStr(Pos('
Txt.Free;
end;
ASKER
Hello and thanx all a lot. I will try and go for the Boyer-Moore search, its many times faster than the pos. Does anyone know if it can be modified so it also can handle NonCaseInsentive search with Swedish "ÅÄÖåäö" ?
yes, what about the following modification to FastStrings...
instead of modifying GUpcaseTable you can define tables for each combination of accent and case sensitiveness e.g.
GCSAITable case sensitive accent insensitive
GCIASTable case insensitive accent sensitive
GCIAITable case insensitive accent insensitive
then initialize them as shown above,
then create clones of the various functions, and replace references to GUpcaseLUT with appropriate table
eg.
clone procedure MakeBMTableNoCase to MakeBMTableNoCaseNoAccent, then replace
mov EDX, GUpcaseLUT
with
mov EDX, GCIAITable
and so on for every function you need...
HTH - Lorenzo -
var
I: Integer;
initialization
for I := 0 to 255 do GUpcaseTable[I] := AnsiChar(I);
CharUpperBuffA(@GUpcaseTable[0], 256);
GUpcaseTable[Ord('å')] := 'Å';
GUpcaseTable[Ord('ä')] := 'Ä';
GUpcaseTable[Ord('ö')] := 'Ö';
// and so on ...
GUpcaseLUT := @GUpcaseTable[0];
also accent-insensitive search can be done:
var
I: Integer;
initialization
for I := 0 to 255 do GUpcaseTable[I] := AnsiChar(I);
CharUpperBuffA(@GUpcaseTable[0], 256);
GUpcaseTable[Ord('å')] := 'A';
GUpcaseTable[Ord('ä')] := 'A';
GUpcaseTable[Ord('ö')] := 'O';
// and so on ...
GUpcaseLUT := @GUpcaseTable[0];
instead of modifying GUpcaseTable you can define tables for each combination of accent and case sensitiveness e.g.
GCSAITable case sensitive accent insensitive
GCIASTable case insensitive accent sensitive
GCIAITable case insensitive accent insensitive
then initialize them as shown above,
then create clones of the various functions, and replace references to GUpcaseLUT with appropriate table
eg.
clone procedure MakeBMTableNoCase to MakeBMTableNoCaseNoAccent,
mov EDX, GUpcaseLUT
with
mov EDX, GCIAITable
and so on for every function you need...
HTH - Lorenzo -
Open in new window