Solved: need a fast search and replace in a very large file (4mb)

string manipulaton is verys low because each time you modify a string memory is reallocated. when you add to the string it is the most consuming.

I suggest rewriting your code to do something like this:
- read from file a small buffer (of type string, for being easy. since it's small, the speed is not a big problem. lets say the minimum size of a buffer is 1024 characters)
- if the buffer does not contains enough data for one operation (like in your case replace a stockholder, or whatever), read some more into the buffer
- if you reached end of file, break (no sense in reading nothing forever :P)
- do your operation (reaplce, whatever)
- write the modified data (not entire buffer) to file (not same file, anotehr one whichyou create empty)

the idea is this: you basically copy the file, but while doing so, you also operate the modifications you want ;)

if you need to keep going through that 7000+ whatevr file, then you should definetly:
- either load it in memory (if it's not too big)
- import it into some database-like system (for example access, or sql lite, or whatever) so that you can do sql selects, which will be faster then you liniarly searching for stuff.

now, in case you just need to take the first element from the 7000+ file and reaplce every occurance in the 4 MB file, then you don't do the above loading/importing, but instead read it one record at a time and :

- load the entire 4 mb file in memory
- while loading it, save into a list the index in the file of the data that needs to be replaced, and what it is
- after this is done, in another list you will save the actual modified data.

example.
at index 10 you need to change A
at index 15 you need to change B
at index 23 you need to change A
in the list you have ((10, A), (15, B), (23, A)) (use a TList and a record for example. or use 2 lists, one for integers and one for strings. which ever is easier for you.
then you load the 7000+ file and see that
A needs to be changed into Y. so you iterate through the list (or the string list) and change A to Y). same for B.
then you save the file like this:

repeat
copy from original file to new file every byte until the next index from the list
save into new file the modified value
until end of file

I don't have time right now to make a complete demo, but if you are having trouble with some specific point, just ask.

veaceslavz

Does you need application or just combine these 2 files?
Try to open txt file in MSExcel2007 which limitation is 1mln rows and use VLOOKUP to combine these 2 files

stuayre

ASKER

ciuly: thanks, I will try your suggestions, I had the large txt file loaded in a memo and was reading each line out and doing the replace but it was still slow.

veaceslavz: I could do that but I need to write something programmaticly, it needs to sit in the background on my pc and do the job repeatedly.

2266180

memo is visual it eats up unnecessary resources. use a TStringList instead which is non-visual. if you want to modify directly in the stringlist, you will aoso have to do soemthing like this:

list.beginupdate;
try
operate many operations on the list
finally
list.endupdate;
end;

developmentguru

I would recommend using a TFileStream to open and read your file. While reading from the stream make your changes saving to another TFileStream. I would use parsing to locate your replacement then instead of writing it out as is write out the replacement value.

A file stream takes care of buffering a set amount of the file read in memory so it is fast. This also allows you to do a parsing technique without worrying about the speed impact.

If you are in control of the replacement strings I would make a change there as well. Instead of using what you showed %%%stock_DJB400%%% use something that better indicates the start and stop of the replacement area such as XML <TagA></TagA> or just <<<stock_DJB400>>> The indicators then show begin and end. Just a suggestion.

If you want to give me an example replacement file and an example file that contains the replacements to be done, I will do an example for you.

2266180

>> A file stream takes care of buffering a set amount of the file read in memory so it is fast.

do you have anything to back this up? because I looked at the sources (up until delphi 7) and there is no buffering in any of the streams, including file stream.

stuayre

ASKER

thanks developmentguru & ciuly, im going to try the TFileStream

can anyone help me with the tFileStream? I have this code but how do i read the stream into a var so i can do my search replace?

var
FileS: TFilestream;
I: Integer;
Buff: string;
Teststring: String;
begin
FileS := TFilestream.Create('c:\mytextfile.txt',fmOpenRead);
try
for I := 0 to FileS.Size do
begin

//some code here?

end;
finally
FileS.Free;
end;
end;

developmentguru

@ciuly I was quoting from memory and seem to have been mistaken. I can't find any automatic buffering either. Of course it is not a terrible problem to handle the buffering.

@stuayre I am working on an example now.

stuayre

ASKER

ok thanks

developmentguru

I generated a file that I could test the replace on. The file is roughly 4.5 MB I had another file that contained my list of search tags versus replacement values (SearchTag=ReplacementValue on each line). I used this file to call the TFileReplace.AddCriteria. I then called replace and got the expected results in about 26.82 seconds.

The class I made here is FAR from optimized.

1) I started with the idea of using an out buffer, never used it.
2) When creating the strings to compare for start and end tag, tag name, etc. This could also be optimized to not recreate the string several times.

Just a couple of things off the top of my head.

It does show you the use of a file stream with a parsing replacement method.

Let me know if you need more.

unit uFileReplace;
 
interface
uses
  SysUtils, Classes;
 
type
  //replace array of PChar with array of PWideChar for unicode
  TFileReplace = class
  private
    fInBuffer : PChar;
    fOutBuffer : PChar;
    fInFile : TFileStream;
    fOutFile : TFileStream;
    fInFileSector : integer;
    fInFileIndex : integer;
    fOutFileSector : integer;
    fOutFileIndex : integer;
    fSearchList : TStringList;
    fReplaceList : TStringList;
    fStartDelimiter : string;
    fEndDelimiter : string;
    fBufferSize : integer;
    function GetChar(Index: integer): char;
  protected
    procedure SortSearchCriteria;
    property Chars[Index : integer] : char read GetChar;
  public
    constructor Create(InFileName, OutFileName, StartDelim, EndDelim : string;
      BufferSize : integer);
    destructor Destroy; override;
    procedure AddCriteria(SearchFor, ReplaceWith : string);
    procedure Replace;
  end;
 
implementation
var
  SearchList : TStringList;
 
{ TFileReplace }
 
function CompareSearchList(List: TStringList; Index1,
  Index2: Integer): Integer;
var
  A, B : string;
 
begin
  Result := 0;
  A := '';
  B := '';
  if Index1 < SearchList.Count then
    A := SearchList[Index1];
 
  if Index2 < SearchList.Count then
    B := SearchList[Index2];
 
  if (Index1 < SearchList.Count) and (Index2 < SearchList.Count) then
    if A > B then
      Result := 1
    else
      if A < B then
        Result := -1;
end;
 
procedure TFileReplace.AddCriteria(SearchFor, ReplaceWith: string);
begin
  fSearchList.Add(SearchFor);
  fReplaceList.Add(ReplaceWith);
end;
 
constructor TFileReplace.Create(InFileName, OutFileName, StartDelim,
  EndDelim : string; BufferSize: integer);
begin
  fBufferSize := BufferSize;
  GetMem(fInBuffer, BufferSize);
  GetMem(fOutBuffer, BufferSize);
  fInFile := TFileStream.Create(InFileName, fmOpenRead);
  fOutFile := TFileStream.Create(OutFileName, fmCreate);
  fSearchList := TStringList.Create;
  fReplaceList := TStringList.Create;
  fStartDelimiter := StartDelim;
  fEndDelimiter := EndDelim;
  fInFileSector := -1; //points nowhere
end;
 
destructor TFileReplace.Destroy;
begin
  fReplaceList.Free;
  fSearchList.Free;
  fOutFile.Free;
  fInFile.Free;
  FreeMem(fOutBuffer);
  FreeMem(fInBuffer);
  inherited;
end;
 
function TFileReplace.GetChar(Index: integer): char;
var
  TargetSector : integer;
  Ptr : integer;
 
begin
  TargetSector := Index div fBufferSize;
  if TargetSector <> fInFileSector then
    begin
      fInFile.Seek(TargetSector * fBufferSize, soFromBeginning);
      fInFile.Read(fInBuffer^, fBufferSize);
      fInFileSector := TargetSector;
    end;
  Ptr := Index mod fBufferSize;
  Result := Char(PByteArray(fInBuffer)^[Ptr]);
end;
 
procedure TFileReplace.Replace;
var
  Last, SIndex : integer;
  Comp : string;
  I : integer;
  FoundEnd : boolean;
  TagName : string;
  TBuf : PChar;
  Value : string;
  TagIndex : integer;
  ChunkSize : integer;
 
begin
  Assert(fStartDelimiter <> '', 'Start delimiter not set');
  Assert(fEndDelimiter <> '', 'End delimiter not set');
 
  SortSearchCriteria;
  Last := 0;
  repeat
    //find a possible start
    while (fInFileIndex < fInFile.Size) and
      (Chars[fInFileIndex] <> fStartDelimiter[1]) do
      inc(fInFileIndex);
 
    if Chars[fInFileIndex] = fStartDelimiter[1] then
      begin
        //is this the full delimiter?
        Comp := '';
        for I := 0 to Length(fStartDelimiter) - 1 do
          Comp := Comp + Chars[fInFileIndex + I];
 
        if Comp = fStartDelimiter then
          begin
            inc(fInFileIndex, Length(fStartDelimiter));
            //search for end delimiter
            sIndex := fInFileIndex;
            repeat
              //find a possible end
              while (fInFileIndex < fInFile.Size) and
                (Chars[fInFileIndex] <> fStartDelimiter[1]) do
                inc(fInFileIndex);
 
                Comp := '';
                for I := 0 to Length(fStartDelimiter) - 1 do
                  Comp := Comp + Chars[fInFileIndex + I];
 
                FoundEnd := Comp = fEndDelimiter;
            until (fInFileIndex >= fInFile.Size) or FoundEnd;
            if FoundEnd then
              begin
                TagName := '';
                for I := sIndex to fInFileIndex - 1 do
                  TagName := TagName + Chars[I];
 
                //time to do the replace
                //first, copy the file from the original to the new from Last to
                //current
                ChunkSize := SIndex - Last - Length(fStartDelimiter);
                GetMem(TBuf, ChunkSize);
                try
                  fInFile.Seek(Last, soFromBeginning);
                  fInFile.Read(TBuf^, ChunkSize);
                  fOutFile.WriteBuffer(TBuf^, ChunkSize);
                finally
                  FreeMem(TBuf);
                end;
 
                //copy in the replacement
                TagIndex := fSearchList.IndexOf(TagName);
                Value := '';
                if TagIndex > -1 then
                  Value := fReplaceList[TagIndex];
                if Value <> '' then                
                  fOutFile.Write(Value[1], Length(Value));
 
                //advance past the end of the tag
                inc(fInFileIndex, Length(fEndDelimiter));
                Last := fInFileIndex;
              end;
          end;
      end;    
  until fInFileIndex >= fInFile.Size;
end;
 
procedure TFileReplace.SortSearchCriteria;
var
  SL, Values : TstringList;
  I : Integer;
 
begin
  SL := TStringList.Create;
  try
    SL.Assign(fSearchList);
    SL.Sort;
    Values := TStringList.Create;
    try
      for I := 0 to SL.Count - 1 do
        Values.Add(fReplaceList[fSearcHlist.IndexOf(SL[I])]);
 
      fSearchList.Assign(SL);
      fReplaceList.Assign(Values);
    finally
      Values.Free;
    end;
  finally
    SL.Free;
  end;
end;
 
end.

Open in new window

developmentguru

The first comment should not have the word array in it...

developmentguru

Just to put things in perspective, my test was the following.

10,000 randomly generated text lines of varying length from 6 characters to 1025 characters built of random words including the replacement tags. The example file size was 4,760,311 bytes. There were over 157,500 replacements to be done. I did not bother with an exact count, I let the IDE search find all instances of ::: and it found more than 315,000 (two per tag). Processed into a new file with the replacements in 26.82 seconds. Is that still too slow?

stuayre

ASKER

Hey guys sorry for the delay!

first thanks for the massive effort you both put in.

i've had a look at both soultions and WOW rllibby's wins it does it in about 1 second!
amazing, i think Borland should use your functions!

developmentguru yours was good too and i want to thank you for the effort and time you put in to post a solution.

cheers

Stu

developmentguru

I think one of the key differences between the two solutions is that my solution does all replacements in one pass. Using rlibby's solution will need to do a separate pass for every replacement.

developmentguru

I am going to play with rlibby's solution and check it out. My previous statement was based on assumptions and not on watching the code run.

Russell Libby

Actually, the replace is done in memory using a 2 pass approach. First pass performs the matching logic, and maintains the following:

- source index of match
- length of match
- substitution string

This allows the replacement string to be allocated once, and the second pass does the work of moving the source string over to the destination. It is extremely efficient, at the cost of memory. But regardless, it does NOT perform a pass for every replacement match.

Russell

developmentguru

I copied the code out as is and ran it. Very fast, but where exactly is the output?

Russell Libby

Same file...

// Resource protection
try
// Load the test file
regTest.LoadFromFile(FILE_TEST); <--- load test file
// Perform replacement callback
regTest.ReplaceCallback(ReplaceTag); <--- manage replacements
// Save back to original file
regTest.SaveToFile(FILE_TEST); <---- save back to file (see comment above)
finally
// Free regex
regTest.Free;
end;

developmentguru

The only other thing... I thought that the person who wrote the question wanted a streaming solution so memory would not be an issue. I will want to try mine against your sample data and see how fast it is in memory. Sorry if this is annoying to anyone, I am having fun.

Russell Libby

Its a tradeoff; memory / speed....

Imho, if you are dealing with a known size (in this case 4MB), then it is perfectly acceptable to load the data into memory to process. Also for this size, there should be no issues if the dest result is handled in memory before writing to disk.

You could process the source in memory, and generate the results directly to the disk file. But this tends to be slightly slower due to all the small IO writes to disk. Below is code that handles the replacements manually, and depending on the $define statement, will dump the result directly to disk, or a memory stream (fast version) which will then write the final result to disk.

Using the memory stream I was seeing run times around 500ms. File based stream runs in 850ms

Russell

-----

unit Unit1;

interface
////////////////////////////////////////////////////////////////////////////////
//
// Download RegExprEx from:
//
// http://users.adelphia.net/~rllibby/downloads/regexprex.zip
//
////////////////////////////////////////////////////////////////////////////////
uses
Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
RegExprEx, StdCtrls;

////////////////////////////////////////////////////////////////////////////////
// Node definitions for hashing elements
////////////////////////////////////////////////////////////////////////////////
type
PHashNode = ^THashNode;
THashNode = packed record
Item: PChar;
Data: Pointer;
Next: PHashNode;
end;

////////////////////////////////////////////////////////////////////////////////
// Hashing constants
////////////////////////////////////////////////////////////////////////////////
const
HASH_SIZE = 12583;

////////////////////////////////////////////////////////////////////////////////
// Callback for optional freeing of the data items
////////////////////////////////////////////////////////////////////////////////
type
THashDeallocator = procedure(Sender: TObject; P: Pointer) of object;

////////////////////////////////////////////////////////////////////////////////
// Hash class definition
////////////////////////////////////////////////////////////////////////////////
type
THash = class(TObject)
private
// Private declarations
FHash: Array [0..Pred(HASH_SIZE)] of PHashNode;
FDeallocator: THashDeallocator;
FCareCase: Boolean;
FCount: Integer;
protected
// Protected declarations
function NewNode(Item: PChar; Data: Pointer): PHashNode;
public
// Public declarations
constructor Create(Deallocator: THashDeallocator = nil);
destructor Destroy; override;
procedure Clear;
function Delete(Item: PChar): Boolean;
function Extract(Item: PChar; out Data: Pointer): Boolean;
function Add(Item: PChar; Data: Pointer): Boolean;
function Find(Item: PChar; out Data: Pointer): Boolean;
property Count: Integer read FCount;
property CaseSensitive: Boolean read FCareCase write FCareCase;
end;

////////////////////////////////////////////////////////////////////////////////
// TFastMemStream
////////////////////////////////////////////////////////////////////////////////
type
TFastMemStream = class(TMemoryStream)
protected
// Protected declarations
function Realloc(var NewCapacity: Longint): Pointer; override;
end;

////////////////////////////////////////////////////////////////////////////////
// Hashing functions
////////////////////////////////////////////////////////////////////////////////
function HashFunc(Key: PChar; CareCase: Boolean): Integer;
function HashCompare(Item1, Item2: PChar; CareCase: Boolean): Boolean;

////////////////////////////////////////////////////////////////////////////////
// TForm1
////////////////////////////////////////////////////////////////////////////////
type
TForm1 = class(TForm)
Button1: TButton;
btnRun: TButton;
procedure FormCreate(Sender: TObject);
procedure FormDestroy(Sender: TObject);
procedure Button1Click(Sender: TObject);
procedure btnRunClick(Sender: TObject);
private
// Private declarations
FTestCodes: TStringList;
FHashCodeQty: THash;
protected
// Protected declarations
procedure GenerateSampleCodes;
procedure GenerateSampleCodeQtyFile;
procedure GenerateSampleTestFile;
procedure GenerateSampleFiles;
public
// Public declarations
end;

////////////////////////////////////////////////////////////////////////////////
// File constants
////////////////////////////////////////////////////////////////////////////////
const
// File names for testing
FILE_TEST = 'testing.txt';
FILE_CODEQTY = 'codeqty.txt';

////////////////////////////////////////////////////////////////////////////////
// RegEx constants
////////////////////////////////////////////////////////////////////////////////
const
// Eg StockItem1 = 103
REG_TAGQTY = '\s*(:<Tag>[^=\s]+)\s*=\s*(:<Value>\d+)';
// Eg %%%StockItem1%%%
REG_TAG = '%%%(:<Tag>!(%%%)+)%%%';

////////////////////////////////////////////////////////////////////////////////
// Global variables
////////////////////////////////////////////////////////////////////////////////
var
Form1: TForm1;

implementation
{$R *.DFM}

//// Example of what you would need to do to perform the replacement logic /////
procedure TForm1.btnRunClick(Sender: TObject);
var strmBase: TStream;
regCodeQty: TRegExpr;
regTest: TRegExpr;
ptrValue: Pointer;
dwBackPos: Integer;
dwSize: Integer;
dwMark: LongWord;
szReplace: String;
lpszSource: PChar;
begin

// Mark time
dwMark:=GetTickCount;

// Resource protection
try
// Clear hash table
FHashCodeQty.Clear;
// Create regex to load hash table
regCodeQty:=TRegExpr.CreatePattern(REG_TAGQTY);
// Resource protection
try
// Load the data from file
regCodeQty.LoadFromFile(FILE_CODEQTY);
// While match
if regCodeQty.MatchFirst then
begin
// Repeat
repeat
// Add hash item
FHashCodeQty.Add(PChar(regCodeQty.NamedBackReference['Tag']), Pointer(StrToInt(regCodeQty.NamedBackReference['Value'])));
// Stop when we fail to match next
until not(regCodeQty.MatchNext);
end;
finally
// Free regex
regCodeQty.Free;
end;

//
// Undef to use file based destination stream
//
{$DEFINE USE_MEM}

// Create regex to load the test file
regTest:=TRegExpr.CreatePattern(REG_TAG);

// Resource protection
try
// Load the test file
regTest.LoadFromFile(FILE_TEST);

// Create stream to dump the results
{$IFDEF USE_MEM}
strmBase:=TFastMemStream.Create;
{$ELSE}
strmBase:=TFileStream.Create(FILE_TEST, fmCreate);
{$ENDIF}

// Resource protection
try
// Get source pointer from regex parser
lpszSource:=regTest.Source;
// Set back marker
dwBackPos:=0;
// Start match
if regTest.MatchFirst then
begin
// Repeat while match
repeat
// Get size of data up to match
dwSize:=regTest.MatchPos - dwBackPos;
// write preceding text over
if (dwSize > 0) then strmBase.Write(lpszSource^, dwSize);
// Push source pointer past the match
Inc(lpszSource, dwSize + regTest.MatchLen);
// Lookup the tag in the hash to get the quantitiy
if FHashCodeQty.Find(PChar(regTest.NamedBackReference['Tag']), ptrValue) then
begin
// Pointer is really numeric value, convert to string
Str(Integer(ptrValue), szReplace);
// Write replacement value to file
strmBase.Write(Pointer(szReplace)^, Length(szReplace));
end;
// Update the back marker
dwBackPos:=regTest.MatchPos + regTest.MatchLen;
// Match next
until not(regTest.MatchNext);
end;
// Write remaining block of text
strmBase.Write(lpszSource^, StrLen(lpszSource));
finally
// Save to file (if not a file stream)
if not(strmBase is THandleStream) then
begin
TCustomMemoryStream(strmBase).SaveToFile(FILE_TEST);
end;
// Free the stream
strmBase.Free;
end;
finally
// Free regex
regTest.Free;
end;

finally
// Set caption
Caption:=IntToStr(GetTickCount - dwMark);
end;

end;

procedure TForm1.GenerateSampleFiles;
begin

// Generate the sample codes
GenerateSampleCodes;

// Generate sample code = qty file
GenerateSampleCodeQtyFile;

// Generate the sample test file
GenerateSampleTestFile;

end;

procedure TForm1.GenerateSampleTestFile;
var listSample: TStringList;
dwIndex: Integer;
begin

// Generate a sample file with the following attributes:
//
// - 50000 total lines
// - 3 replacement tags per line
// - 150000 total replacement tags in the file

// Create list
listSample:=TStringList.Create;

// Resource protection
try
// Walk sample codes
for dwIndex:=1 to 50000 do
begin
// Create sample line
listSample.Add(
StringOfChar('a', Random(100)) + '(%%%' +
FTestCodes[Random(FTestCodes.Count)] + '%%%)' +
StringOfChar('b', Random(100)) + '(%%%' +
FTestCodes[Random(FTestCodes.Count)] + '%%%)' +
StringOfChar('c', Random(100)) + '(%%%' +
FTestCodes[Random(FTestCodes.Count)] + '%%%)' +
StringOfChar('d', Random(100))
);

end;
listSample.Add('zzz(%%%' + FTestCodes[Random(FTestCodes.Count)] + '%%%)Q');
// Save to file
listSample.SaveToFile(FILE_TEST);
finally
// Free the list
listSample.Free;
end;

end;

procedure TForm1.GenerateSampleCodeQtyFile;
var listCodeQty: TStringList;
dwIndex: Integer;
begin

// Create list
listCodeQty:=TStringList.Create;

// Resource protection
try
// Walk sample codes
for dwIndex:=0 to Pred(FTestCodes.Count) do
begin
// Generate random quantity values
listCodeQty.Add(Format('%s = %d', [FTestCodes[dwIndex], Random(100000)]));
end;
// Save to file
listCodeQty.SaveToFile(FILE_CODEQTY);
finally
// Free the list
listCodeQty.Free;
end;

end;

procedure TForm1.GenerateSampleCodes;
var dwIndex: Integer;
begin

// Lock list
FTestCodes.BeginUpdate;

// Resource protection
try
// Clear list
FTestCodes.Clear;
// Create set of stock codes
for dwIndex:=1 to 10000 do
begin
// Add sample stock code
FTestCodes.Add(Format('stock_DJB%d', [dwIndex]));
end;
finally
// Unlock list
FTestCodes.EndUpdate;
end;

end;

procedure TForm1.FormCreate(Sender: TObject);
begin

// Create string list for codes
FTestCodes:=TStringList.Create;

// Create hash for loading codes and quantities
FHashCodeQty:=THash.Create;

end;

procedure TForm1.FormDestroy(Sender: TObject);
begin

// Free objects
FTestCodes.Free;
FHashCodeQty.Free;

end;

//// Generate sample data //////////////////////////////////////////////////////
procedure TForm1.Button1Click(Sender: TObject);
var dwMark: LongWord;
begin

// Mark time
dwMark:=GetTickCount;

// Resource protection
try
// Generate the sample data files
GenerateSampleFiles;
finally
// Set caption
Caption:=IntToStr(GetTickCount - dwMark);
end;

end;

//// TFastMemStream ////////////////////////////////////////////////////////////
function TFastMemStream.Realloc(var NewCapacity: Longint): Pointer;
var dwDelta: Integer;
lpMemory: Pointer;
begin

// Get current memory pointer
lpMemory:=Memory;

// Resource protection
try
// Calculate the delta to be applied to the capacity
if (NewCapacity > 0) then
begin
// Check new capacity
if (NewCapacity > MaxWord) then
// Delta is 1/4 of desired capacity
dwDelta:=NewCapacity div 4
else
// Minimum allocation of 64 KB
dwDelta:=MaxWord;
// Update by delta
Inc(NewCapacity, dwDelta);
end;
// Determine if capacity has changed
if not(NewCapacity = Capacity) then
begin
// Check for nil alloc
if (NewCapacity = 0) then
begin
// Release the memory
FreeMem(lpMemory);
// Clear result
lpMemory:=nil;
end
else
begin
// Check current capacity
if (Capacity = 0) then
// Allocate memory
lpMemory:=AllocMem(NewCapacity)
else
// Reallocate memory
ReallocMem(lpMemory, NewCapacity);
end;
end;
finally
// Return modified pointer
result:=lpMemory;
end;

end;

//// Hashing functions /////////////////////////////////////////////////////////
function HashCompare(Item1, Item2: PChar; CareCase: Boolean): Boolean;
begin

// Null assignment checks
if (Item1 = nil) then
// Check for Item2 being nil
result:=(Item2 = nil)
else if (Item2 = nil) then
// Item1 is not null, so no possible match
result:=False
// Check case
else if CareCase then
// Case sensitive compare
result:=(StrComp(Item1, Item2) = 0)
else
// In-case sensitive compare
result:=(StrIComp(Item1, Item2) = 0);

end;

function HashFunc(Key: PChar; CareCase: Boolean): Integer;
var bChar: Byte;
begin

// Set starting result
result:=0;

// Generate hash index for key
while (Key^ > #0) do
begin
if CareCase then
bChar:=Byte(Key^)
else if (Key^ in ['A'..'Z']) then
bChar:=Byte(Key^) + 32
else
bChar:=Byte(Key^);
Inc(result, (result shl 5) + (bChar shl 3));
Inc(Key);
end;

// Keep result in bounds of array
result:=LongWord(result) mod HASH_SIZE;

end;

//// THash /////////////////////////////////////////////////////////////////////
constructor THash.Create(Deallocator: THashDeallocator = nil);
begin

// Perform inherited
inherited Create;

// Set default values
ZeroMemory(@FHash, SizeOf(FHash));
FDeallocator:=Deallocator;
FCareCase:=False;
FCount:=0;

end;

destructor THash.Destroy;
begin

// Resource protection
try
// Clear the hash
Clear;
finally
// Perform inherited
inherited Destroy;
end;

end;

procedure THash.Clear;
var phNode1: PHashNode;
phNode2: PHashNode;
dwIndex: Integer;
begin

// Resource protection
try
// Iterate the array and clear the hash nodes
for dwIndex:=0 to Pred(HASH_SIZE) do
begin
// Get bucket node
phNode1:=FHash[dwIndex];
// Walk the nodes
while Assigned(phNode1) do
begin
// Get pointer to next item
phNode2:=phNode1^.Next;
// Free memory for node name
if Assigned(phNode1^.Item) then FreeMem(phNode1^.Item);
// Callback
if (Assigned(phNode1^.Data) and Assigned(FDeallocator)) then FDeallocator(Self, phNode1^.Data);
// Free node item
FreeMem(phNode1);
// Set iterator to next item
phNode1:=phNode2;
end;
end;
finally
// Clear all top level buckets
for dwIndex:=0 to Pred(HASH_SIZE) do FHash[dwIndex]:=nil;
// Reset the count
FCount:=0;
end;

end;

function THash.Extract(Item: PChar; out Data: Pointer): Boolean;
var phNode1: PHashNode;
phNode2: PHashNode;
dwIndex: Integer;
begin

// Get the hash index
dwIndex:=HashFunc(Item, FCareCase);

// Get the hash bucket item
phNode1:=FHash[dwIndex];

// Did top level item exist?
if Assigned(phNode1) then
begin
// Prepare for loop
phNode2:=phNode1;
// Walk the nodes
while Assigned(phNode2) do
begin
// Match key
if HashCompare(phNode2^.Item, Item, FCareCase) then break;
// Save current node
phNode1:=phNode2;
// Move to the next node in the chain
phNode2:=phNode2^.Next;
end;
// Check to see if the node is still set
if Assigned(phNode2) then
begin
// Set out param data value
Data:=phNode2^.Data;
// Check to see if this is the top level item
if (phNode2 = phNode1) then
// Link next node into the bucket
FHash[dwIndex]:=phNode2^.Next
else
// Link over this node
phNode1^.Next:=phNode2^.Next;
// Free memory for node name
if Assigned(phNode2^.Item) then FreeMem(phNode2^.Item);
// Free the node
FreeMem(phNode2);
// Decrement the node count
Dec(FCount);
// Success
result:=True;
end
else
// Did not find the node
result:=False;
end
else
// No nodes in bucket
result:=False;

end;

function THash.Delete(Item: PChar): Boolean;
var lpData: Pointer;
begin

// Extract the item
result:=Extract(Item, lpData);

// Check result, perform callback if needed
if (result and Assigned(lpData) and Assigned(FDeallocator)) then FDeallocator(Self, lpData);

end;

function THash.Add(Item: PChar; Data: Pointer): Boolean;
var phNode1: PHashNode;
phNode2: PHashNode;
dwIndex: Integer;
begin

// Get the hash bucket item index
dwIndex:=HashFunc(Item, FCareCase);

// Resource protection
try
// Get the hash bucket item
phNode1:=FHash[dwIndex];
// Is the bucket empty
if (phNode1 = nil) then
// Add new cell item
FHash[dwIndex]:=NewNode(Item, Data)
else
begin
// Save current node
phNode2:=phNode1;
// Walk nodes
while Assigned(phNode2) do
begin
// Match the key
if HashCompare(phNode2^.Item, Item, FCareCase) then
begin
// Check for data change
if (phNode2^.Data <> Data) then
begin
// Callback
if (Assigned(phNode2^.Data) and Assigned(FDeallocator)) then FDeallocator(Self, phNode2^.Data);
// Set new data item
phNode2^.Data:=Data;
end;
// Break loop
break;
end;
// Save current node
phNode1:=phNode2;
// Walk next node
phNode2:=phNode2^.Next;
end;
// Do we need to add a new item to the end of the chain?
if (phNode2 = nil) then
begin
// Create hash node
phNode2:=NewNode(Item, Data);
// Link the node in
phNode1^.Next:=phNode2;
end;
end;
finally
// Always success
result:=True;
end;

end;

function THash.Find(Item: PChar; out Data: Pointer): Boolean;
var phNode: PHashNode;
begin

// Get the hash bucket item
phNode:=FHash[HashFunc(Item, FCareCase)];

// Resource protection
try
// Walk the items
while Assigned(phNode) do
begin
// Compare the key
if HashCompare(phNode^.Item, Item, FCareCase) then
begin
// Key exists, set out return data
Data:=phNode^.Data;
break;
end;
// Walk the next item
phNode:=phNode^.Next;
end;
finally
// Success if node is assigned
result:=Assigned(phNode);
end;

end;

function THash.NewNode(Item: PChar; Data: Pointer): PHashNode;
begin

// Allocate memory for new node
result:=AllocMem(SizeOf(THashNode));

// Set the structure fields
result^.Item:=StrCopy(AllocMem(Succ(StrLen(Item))), Item);
result^.Data:=Data;

// Increment the count
Inc(FCount);

end;

initialization

// Seed random number generator
Randomize;

end.

-- dfm ---
object Form1: TForm1
Left = 351
Top = 319
Width = 453
Height = 177
Caption = 'Form1'
Color = clBtnFace
Font.Charset = ANSI_CHARSET
Font.Color = clWindowText
Font.Height = -11
Font.Name = 'Tahoma'
Font.Style = []
OldCreateOrder = False
OnCreate = FormCreate
OnDestroy = FormDestroy
PixelsPerInch = 96
TextHeight = 13
object Button1: TButton
Left = 12
Top = 12
Width = 165
Height = 25
Caption = 'Generate Sample Data'
TabOrder = 0
OnClick = Button1Click
end
object btnRun: TButton
Left = 12
Top = 44
Width = 165
Height = 25
Caption = 'Run Test'
TabOrder = 1
OnClick = btnRunClick
end
end

stuayre

ASKER

Russell is there a way to mod the code so if the qty doesn't exist it will replace it with 0?

I can't work out how your code works :)

cheers

Stu

Russell Libby

Stu,

Yes, its very easy. All you need to do is check the hash find; if it fails, set ptrValue to nil (which will convert to zero when cast as int).

Eg (first example I gave)
----------------------------------
procedure TForm1.ReplaceTag(Sender: TObject; var Replace: String);
var ptrValue: Pointer;
begin

// Lookup the tag in the hash to get the quantitiy
if not(FHashCodeQty.Find(PChar(TRegExpr(Sender).NamedBackReference['Tag']), ptrValue)) then ptrValue:=nil;

// Pointer is really numeric value, set replacement text
Str(Integer(ptrValue), Replace);

end;

Eg (second example I gave, mod the code in TForm1.btnRunClick)
----------------------------------

// Lookup the tag in the hash to get the quantitiy
if not(FHashCodeQty.Find(PChar(regTest.NamedBackReference['Tag']), ptrValue)) then ptrValue:=nil;

// Pointer is really numeric value, convert to string
Str(Integer(ptrValue), szReplace);

// Write replacement value to file
strmBase.Write(Pointer(szReplace)^, Length(szReplace));

developmentguru

In case you were interested I was able to significantly increase the speed of what I had previously done by doing a couple of optimizations. The process should be far easier to understand although it is still not quite as fast (the hashing algorithm is roughly2-3 times as fast). I timed the StrPos routine that comes with Delphi and it was taking 15ms or more per search on my machine which is ridiculous. I made my own and it routinely tests as 0ms. Before anyone comments, I know such routines exist. It took me less time to write than to look through the available libraries and pick one (let alone test them all). I used the same data generator as rlibby created with one important difference. I changed the GenerateSampleCodes to generate them this way:

procedure TForm1.GenerateSampleCodes;
var dwIndex: Integer;
begin

// Lock list
FTestCodes.BeginUpdate;

// Resource protection
try
// Clear list
FTestCodes.Clear;
// Create set of stock codes
for dwIndex:=1 to 10000 do
begin
// Add sample stock code
FTestCodes.Add(Format('stock_DJB%.5d', [dwIndex]));
end;
finally
// Unlock list
FTestCodes.EndUpdate;
end;

end;

This allowed me to presort them and do a simple binary search for the name within a regular string list.
I plan to work on it more as time permits. Let me know if you have questions.

unit uFileReplace;
 
interface
uses
  Windows, SysUtils, Classes;
 
type
  //replace array of PChar with array of PWideChar for unicode
  TFileReplace = class
  private
    fInFile : TMemoryStream;
    fOutFile : TMemoryStream;
    fStartDelimiter : string;
    fEndDelimiter : string;
    fSearchList : TStringList;
    fOutFileName : string;
  protected
  public
    constructor Create(InFileName, OutFileName, ReplaceFileName, StartDelim,
      EndDelim : string);
    destructor Destroy; override;
    procedure Replace;
    function PCFind(Source, Value : pchar) : PChar;
    function IndexOfName(Name : string) : integer;
  end;
 
implementation
{ TFileReplace }
 
constructor TFileReplace.Create(InFileName, OutFileName, ReplaceFileName,
  StartDelim, EndDelim : string);
var
  FS : TFileStream;
 
begin
  FS := TFileStream.Create(InFileName, fmOpenRead);
  try
    fInFile := TMemoryStream.Create;
    fInFile.LoadFromStream(FS);
  finally
    FS.Free;
  end;
 
  fSearchList := TStringList.Create;
  fSearchList.LoadFromFile(ReplaceFileName);
 
  fOutFile := TMemoryStream.Create;
  fOutFile.SetSize(fInFile.Size * 2);
  fStartDelimiter := StartDelim;
  fEndDelimiter := EndDelim;
  fOutFileName := OutFileName;
end;
 
destructor TFileReplace.Destroy;
begin
  fOutFile.Free;
  fSearchList.Free;
  fInFile.Free;
  inherited;
end;
 
function TFileReplace.IndexOfName(Name: string): integer;
var
  I, L, H, G : integer;
  GName : string;
 
begin
  Result := -1;
  L := 0;
  H := fSearchList.Count - 1;
  repeat
    G := ((H - L) shr 1) + L;
    GName := fSearchList.Names[G];
    if Name > GName then
      L := G
    else
      if Name < GName then
        H := G;
  until (H - L < 3) or (Name = GName);
 
  if Name = GName then
    begin
      Result := G;
      exit;
    end;
 
  I := L;
  while (Name <> fSearchList.Names[I]) and (I < H) do
    inc(I);
  if Name = fSearchList.Names[I] then
    Result := I;
end;
 
function TFileReplace.PCFind(Source, Value: pchar): PChar;
begin
  Result := nil;
 
  if not Assigned(Source) then
    exit;
 
  if not Assigned(Value) then
    exit;
 
  try
    if Source[0] = #0 then
      exit;
  except
 
  end;
 
  if Value[0] = #0 then
    exit;
 
  asm
    //ecx Value
    //edx Source
      Mov ESI, Source
 
    @ResetValue:
      Mov EDI, Value; //set the pointer to the value
 
    @Loop:
      Mov BL, [ESI] //read in a byte of the source
      Cmp BL, [EDI] //compare to what is being searched for
      JE @TestFullValue //if found then continue
 
    @NextChar:
      inc ESI       //otherwise increment the source pointer
      CMP [ESI], 0
      JE @NotFound
      JMP @Loop     //and search further
 
    @TestFullValue:
      //this is where we need to see if we found the entire value
      Mov EDX, ESI
 
    @LoopTestFullValue:
      inc ESI
      inc EDI
 
      Mov BL, [ESI]
      Cmp BL, [EDI]
      JE  @LoopTestFullValue
 
      Mov BL, [EDI]
      Cmp BL, 0
      JE @Found
      JMP @ResetValue
 
    @Found:
      mov Result, EDX  //set the result
 
    @NotFound:
  end;
end;
 
procedure TFileReplace.Replace;
var
  Bookmark, SIndex, EIndex : PChar;
  LInt : integer absolute Bookmark;
  TagName : string;
  Value : string;
  TagIndex : integer;
  ChunkSize : integer;
  P : integer;
 
begin
  Assert(fStartDelimiter <> '', 'Start delimiter not set');
  Assert(fEndDelimiter <> '', 'End delimiter not set');
 
  Bookmark := fInFile.Memory;
  repeat
//    SIndex := StrPos(Bookmark, PChar(fStartDelimiter));
    SIndex := PCFind(Bookmark, PChar(fStartDelimiter));
 
    if Assigned(SIndex) then
      begin
        //figure out how much can be copied RAW
        ChunkSize := integer(SIndex) - integer(Bookmark);
 
        //copy it
        if ChunkSize > 0 then
          fOutFile.WriteBuffer(Bookmark^, ChunkSize);
 
        //skip the delimiter
        inc(SIndex, Length(fStartDelimiter));
 
        //find the ending delimiter
    //    EIndex := StrPos(SIndex, PChar(fEndDelimiter));
 
        EIndex := PCFind(SIndex, PChar(fEndDelimiter));
 
        //up to 100% to this point
 
        //figure out how big the tag is
        ChunkSize := integer(EIndex) - integer(SIndex);
 
        //set the tagname
        SetLength(TagName, ChunkSize);
        StrLCopy(@TagName[1], SIndex, ChunkSize);
 
        //move past the end delimiter
        inc(EIndex, Length(fEndDelimiter));
 
        //update Bookmark
        Bookmark := EIndex;
 
        //up to 100% to this point
 
        //replace the tag
        TagIndex := IndexOfName(TagName);
        Value := '';
        if TagIndex > -1 then
          begin
            Value := fSearchList[TagIndex];
            P := Pos('=', Value);
            Value := Copy(Value, P + 1, Length(Value));
          end;
        if Value <> '' then
          fOutFile.Write(Value[1], Length(Value));
      end;
 
  until not Assigned(SIndex);
 
  fOutFile.Seek(0, soFromBeginning);
  fOutFile.SaveToFile(fOutFileName);
end;
 
end.

Open in new window

Russell Libby

Any chance that this question will get closed?

Russell

stuayre

ASKER

Sorry!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Russle you're a genius! thanks for all your help :)

kind regards

Stu

stuayre

ASKER

p.s thanks to developmentguru too :o)

Russell Libby

Stu,
No problem. Figured I would throw a reminder out there (its easy to lose track sometimes)

Russell