Link to home
Start Free TrialLog in
Avatar of stuayre
stuayre

asked on

need a fast search and replace in a very large file (4mb)

Hi,

I have a very large 4mb text file with stock place holders like this %%%stock_DJB400%%%

and then i have a large list of stock codes and qtys (7000+) in a csv file that i need to replace in the large txt file.

I wrote an application to load both the files into a string and then search and replace in the other but it is really sssllloww.

how can i make things faster?

can anyone help?

cheers

Stu
Avatar of 2266180
2266180
Flag of United States of America image

string manipulaton is verys low because each time you modify a string  memory is reallocated. when you add to the string it is the most consuming.

I suggest rewriting your code to do something like this:
- read from file a small buffer (of type string, for being easy. since it's small, the speed is not a big problem. lets say the minimum size of a buffer is 1024 characters)
- if the buffer does not contains enough data for one operation (like in your case replace a stockholder, or whatever), read some more into the buffer
- if you reached end of file, break (no sense in reading nothing forever :P)
- do your operation (reaplce, whatever)
- write the modified data (not entire buffer) to file (not same file, anotehr one whichyou create empty)

the idea is this: you basically copy the file, but while doing so, you also operate the modifications you want ;)

if you need to keep going through that 7000+ whatevr file, then you should definetly:
- either load it in memory (if it's not too big)
- import it into some database-like system (for example access, or sql lite, or whatever) so that you can do sql selects, which will be faster then you liniarly searching for stuff.

now, in case you just need to take the first element from the 7000+ file and reaplce every occurance in the 4 MB file, then you don't do the above loading/importing, but instead read it one record at a time and :

- load the entire 4 mb file in memory
- while loading it, save into a list the index in the file of the data that needs to be replaced, and what it is
- after this is done, in another list you will save the actual modified data.

example.
at index 10 you need to change A
at index 15 you need to change B
at index 23 you need to change A
in the list you have ((10, A), (15, B), (23, A)) (use a TList and a record for example. or use 2 lists, one for integers and one for strings. which ever is easier for you.
then you load the 7000+ file and see that
A needs to be changed into Y. so you iterate through the list (or the string list) and change A to Y). same for B.
then you save the file like this:

repeat
  copy from original file to new file every byte until the next index from the list
  save into new file the modified value
until end of file

I don't have time right now to make a complete demo, but if you are having trouble with some specific point, just ask.
Does you need application or just combine these 2 files?
Try to open txt file in MSExcel2007 which limitation is 1mln rows and use VLOOKUP to combine these 2 files
Avatar of stuayre
stuayre

ASKER

ciuly: thanks, I will try your suggestions, I had the large txt file loaded in a memo and was reading each line out and doing the replace but it was still slow.

veaceslavz: I could do that but I need to write something programmaticly, it needs to sit in the background on my pc and do the job repeatedly.

memo is visual it eats up unnecessary resources. use a TStringList instead which is non-visual. if you want to modify directly in the stringlist, you will aoso have to do soemthing like this:

list.beginupdate;
try
  operate many operations on the list
finally
  list.endupdate;
end;
 I would recommend using a TFileStream to open and read your file.  While reading from the stream make your changes saving to another TFileStream.  I would use parsing to locate your replacement then instead of writing it out as is write out the replacement value.

  A file stream takes care of buffering a set amount of the file read in memory so it is fast.  This also allows you to do a parsing technique without worrying about the speed impact.

  If you are in control of the replacement strings I would make a change there as well.  Instead of using what you showed %%%stock_DJB400%%% use something that better indicates the start and stop of the replacement area such as XML <TagA></TagA> or just <<<stock_DJB400>>>  The indicators then show begin and end.  Just a suggestion.

  If you want to give me an example replacement file and an example file that contains the replacements to be done, I will do an example for you.
>>   A file stream takes care of buffering a set amount of the file read in memory so it is fast.

do you have anything to back this up? because I looked at the sources (up until delphi 7) and there is no buffering in any of the streams, including file stream.
Avatar of stuayre

ASKER

thanks developmentguru & ciuly, im going to try the TFileStream

can anyone help me with the tFileStream? I have this code but how do i read the stream into a var so i can do my search replace?

var
FileS: TFilestream;
I: Integer;
Buff: string;
Teststring: String;
begin
FileS := TFilestream.Create('c:\mytextfile.txt',fmOpenRead);
try
  for I := 0 to FileS.Size do
    begin

     //some code here?

    end;
finally
  FileS.Free;
  end;
end;
@ciuly I was quoting from memory and seem to have been mistaken.  I can't find any automatic buffering either.  Of course it is not a terrible problem to handle the buffering.

@stuayre I am working on an example now.
Avatar of stuayre

ASKER

ok thanks
I generated a file that I could test the replace on.  The file is roughly 4.5 MB I had another file that contained my list of search tags versus replacement values (SearchTag=ReplacementValue on each line).  I used this file to call the TFileReplace.AddCriteria.  I then called replace and got the expected results in about 26.82 seconds.

The class I made here is FAR from optimized.  

1) I started with the idea of using an out buffer, never used it.
2) When creating the strings to compare for start and end tag, tag name, etc.  This could also be optimized to not recreate the string several times.

Just a couple of things off the top of my head.

It does show you the use of a file stream with a parsing replacement method.

Let me know if you need more.
unit uFileReplace;
 
interface
uses
  SysUtils, Classes;
 
type
  //replace array of PChar with array of PWideChar for unicode
  TFileReplace = class
  private
    fInBuffer : PChar;
    fOutBuffer : PChar;
    fInFile : TFileStream;
    fOutFile : TFileStream;
    fInFileSector : integer;
    fInFileIndex : integer;
    fOutFileSector : integer;
    fOutFileIndex : integer;
    fSearchList : TStringList;
    fReplaceList : TStringList;
    fStartDelimiter : string;
    fEndDelimiter : string;
    fBufferSize : integer;
    function GetChar(Index: integer): char;
  protected
    procedure SortSearchCriteria;
    property Chars[Index : integer] : char read GetChar;
  public
    constructor Create(InFileName, OutFileName, StartDelim, EndDelim : string;
      BufferSize : integer);
    destructor Destroy; override;
    procedure AddCriteria(SearchFor, ReplaceWith : string);
    procedure Replace;
  end;
 
implementation
var
  SearchList : TStringList;
 
{ TFileReplace }
 
function CompareSearchList(List: TStringList; Index1,
  Index2: Integer): Integer;
var
  A, B : string;
 
begin
  Result := 0;
  A := '';
  B := '';
  if Index1 < SearchList.Count then
    A := SearchList[Index1];
 
  if Index2 < SearchList.Count then
    B := SearchList[Index2];
 
  if (Index1 < SearchList.Count) and (Index2 < SearchList.Count) then
    if A > B then
      Result := 1
    else
      if A < B then
        Result := -1;
end;
 
procedure TFileReplace.AddCriteria(SearchFor, ReplaceWith: string);
begin
  fSearchList.Add(SearchFor);
  fReplaceList.Add(ReplaceWith);
end;
 
constructor TFileReplace.Create(InFileName, OutFileName, StartDelim,
  EndDelim : string; BufferSize: integer);
begin
  fBufferSize := BufferSize;
  GetMem(fInBuffer, BufferSize);
  GetMem(fOutBuffer, BufferSize);
  fInFile := TFileStream.Create(InFileName, fmOpenRead);
  fOutFile := TFileStream.Create(OutFileName, fmCreate);
  fSearchList := TStringList.Create;
  fReplaceList := TStringList.Create;
  fStartDelimiter := StartDelim;
  fEndDelimiter := EndDelim;
  fInFileSector := -1; //points nowhere
end;
 
destructor TFileReplace.Destroy;
begin
  fReplaceList.Free;
  fSearchList.Free;
  fOutFile.Free;
  fInFile.Free;
  FreeMem(fOutBuffer);
  FreeMem(fInBuffer);
  inherited;
end;
 
function TFileReplace.GetChar(Index: integer): char;
var
  TargetSector : integer;
  Ptr : integer;
 
begin
  TargetSector := Index div fBufferSize;
  if TargetSector <> fInFileSector then
    begin
      fInFile.Seek(TargetSector * fBufferSize, soFromBeginning);
      fInFile.Read(fInBuffer^, fBufferSize);
      fInFileSector := TargetSector;
    end;
  Ptr := Index mod fBufferSize;
  Result := Char(PByteArray(fInBuffer)^[Ptr]);
end;
 
procedure TFileReplace.Replace;
var
  Last, SIndex : integer;
  Comp : string;
  I : integer;
  FoundEnd : boolean;
  TagName : string;
  TBuf : PChar;
  Value : string;
  TagIndex : integer;
  ChunkSize : integer;
 
begin
  Assert(fStartDelimiter <> '', 'Start delimiter not set');
  Assert(fEndDelimiter <> '', 'End delimiter not set');
 
  SortSearchCriteria;
  Last := 0;
  repeat
    //find a possible start
    while (fInFileIndex < fInFile.Size) and
      (Chars[fInFileIndex] <> fStartDelimiter[1]) do
      inc(fInFileIndex);
 
    if Chars[fInFileIndex] = fStartDelimiter[1] then
      begin
        //is this the full delimiter?
        Comp := '';
        for I := 0 to Length(fStartDelimiter) - 1 do
          Comp := Comp + Chars[fInFileIndex + I];
 
        if Comp = fStartDelimiter then
          begin
            inc(fInFileIndex, Length(fStartDelimiter));
            //search for end delimiter
            sIndex := fInFileIndex;
            repeat
              //find a possible end
              while (fInFileIndex < fInFile.Size) and
                (Chars[fInFileIndex] <> fStartDelimiter[1]) do
                inc(fInFileIndex);
 
                Comp := '';
                for I := 0 to Length(fStartDelimiter) - 1 do
                  Comp := Comp + Chars[fInFileIndex + I];
 
                FoundEnd := Comp = fEndDelimiter;
            until (fInFileIndex >= fInFile.Size) or FoundEnd;
            if FoundEnd then
              begin
                TagName := '';
                for I := sIndex to fInFileIndex - 1 do
                  TagName := TagName + Chars[I];
 
                //time to do the replace
                //first, copy the file from the original to the new from Last to
                //current
                ChunkSize := SIndex - Last - Length(fStartDelimiter);
                GetMem(TBuf, ChunkSize);
                try
                  fInFile.Seek(Last, soFromBeginning);
                  fInFile.Read(TBuf^, ChunkSize);
                  fOutFile.WriteBuffer(TBuf^, ChunkSize);
                finally
                  FreeMem(TBuf);
                end;
 
                //copy in the replacement
                TagIndex := fSearchList.IndexOf(TagName);
                Value := '';
                if TagIndex > -1 then
                  Value := fReplaceList[TagIndex];
                if Value <> '' then                
                  fOutFile.Write(Value[1], Length(Value));
 
                //advance past the end of the tag
                inc(fInFileIndex, Length(fEndDelimiter));
                Last := fInFileIndex;
              end;
          end;
      end;    
  until fInFileIndex >= fInFile.Size;
end;
 
procedure TFileReplace.SortSearchCriteria;
var
  SL, Values : TstringList;
  I : Integer;
 
begin
  SL := TStringList.Create;
  try
    SL.Assign(fSearchList);
    SL.Sort;
    Values := TStringList.Create;
    try
      for I := 0 to SL.Count - 1 do
        Values.Add(fReplaceList[fSearcHlist.IndexOf(SL[I])]);
 
      fSearchList.Assign(SL);
      fReplaceList.Assign(Values);
    finally
      Values.Free;
    end;
  finally
    SL.Free;
  end;
end;
 
end.

Open in new window

The first comment should not have the word array in it...
Just to put things in perspective, my test was the following.

10,000 randomly generated text lines of varying length from 6 characters to 1025 characters built of random words including the replacement tags.  The example file size was 4,760,311 bytes.  There were over 157,500 replacements to be done.  I did not bother with an exact count, I let the IDE search find all instances of ::: and it found more than 315,000 (two per tag).  Processed into a new file with the replacements in 26.82 seconds.  Is that still too slow?
ASKER CERTIFIED SOLUTION
Avatar of Russell Libby
Russell Libby
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of stuayre

ASKER

Hey guys sorry for the delay!

first thanks for the massive effort you both put in.

i've had a look at both soultions and WOW rllibby's wins it does it in about 1 second!
amazing, i think Borland should use your functions!

developmentguru yours was good too and i want to thank you for the effort and time you put in to post a solution.

cheers

Stu
I think one of the key differences between the two solutions is that my solution does all replacements in one pass.  Using rlibby's solution will need to do a separate pass for every replacement.
I am going to play with rlibby's solution and check it out.  My previous statement was based on assumptions and not on watching the code run.
Actually, the replace is done in memory using a 2 pass approach. First pass performs the matching logic, and maintains the following:

- source index of match
- length of match
- substitution string

This allows the replacement string to be allocated once, and the second pass does the work of moving the source string over to the destination. It is extremely efficient, at the cost of memory. But regardless, it does NOT perform a pass for every replacement match.

Russell
I copied the code out as is and ran it.  Very fast, but where exactly is the output?
Same file...

     // Resource protection
     try
        // Load the  test file
        regTest.LoadFromFile(FILE_TEST); <--- load test file
        // Perform replacement callback
        regTest.ReplaceCallback(ReplaceTag); <--- manage replacements
        // Save back to original file
        regTest.SaveToFile(FILE_TEST); <---- save back to file (see comment above)
     finally
        // Free regex
        regTest.Free;
     end;
The only other thing... I thought that the person who wrote the question wanted a streaming solution so memory would not be an issue.  I will want to try mine against your sample data and see how fast it is in memory.  Sorry if this is annoying to anyone, I am having fun.
Its a tradeoff; memory / speed....

Imho, if you are dealing with a known size (in this case 4MB), then it is perfectly acceptable to load the data into memory to process. Also for this size, there should be no issues if the dest result is handled in memory before writing to disk.

You could process the source in memory, and generate the results directly to the disk file. But this tends to be slightly slower due to all the small IO writes to disk. Below is code that handles the replacements manually, and depending on the $define statement, will dump the result directly to disk, or a memory stream (fast version) which will then write the final result to disk.

Using the memory stream I was seeing run times around 500ms. File based stream runs in 850ms

Russell

-----

unit Unit1;

interface
////////////////////////////////////////////////////////////////////////////////
//
//   Download RegExprEx from:
//
//      http://users.adelphia.net/~rllibby/downloads/regexprex.zip
//
////////////////////////////////////////////////////////////////////////////////
uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  RegExprEx, StdCtrls;

////////////////////////////////////////////////////////////////////////////////
//   Node definitions for hashing elements
////////////////////////////////////////////////////////////////////////////////
type
  PHashNode         =  ^THashNode;
  THashNode         =  packed record
     Item:          PChar;
     Data:          Pointer;
     Next:          PHashNode;
  end;

////////////////////////////////////////////////////////////////////////////////
//   Hashing constants
////////////////////////////////////////////////////////////////////////////////
const
  HASH_SIZE         =  12583;

////////////////////////////////////////////////////////////////////////////////
//   Callback for optional freeing of the data items
////////////////////////////////////////////////////////////////////////////////
type
  THashDeallocator  =  procedure(Sender: TObject; P: Pointer) of object;

////////////////////////////////////////////////////////////////////////////////
//   Hash class definition
////////////////////////////////////////////////////////////////////////////////
type
  THash             =  class(TObject)
  private
     // Private declarations
     FHash:         Array [0..Pred(HASH_SIZE)] of PHashNode;
     FDeallocator:  THashDeallocator;
     FCareCase:     Boolean;
     FCount:        Integer;
  protected
     // Protected declarations
     function       NewNode(Item: PChar; Data: Pointer): PHashNode;
  public
     // Public declarations
     constructor    Create(Deallocator: THashDeallocator = nil);
     destructor     Destroy; override;
     procedure      Clear;
     function       Delete(Item: PChar): Boolean;
     function       Extract(Item: PChar; out Data: Pointer): Boolean;
     function       Add(Item: PChar; Data: Pointer): Boolean;
     function       Find(Item: PChar; out Data: Pointer): Boolean;
     property       Count: Integer read FCount;
     property       CaseSensitive: Boolean read FCareCase write FCareCase;
  end;

////////////////////////////////////////////////////////////////////////////////
//   TFastMemStream
////////////////////////////////////////////////////////////////////////////////
type
  TFastMemStream    =  class(TMemoryStream)
  protected
     // Protected declarations
     function       Realloc(var NewCapacity: Longint): Pointer; override;
  end;

////////////////////////////////////////////////////////////////////////////////
//   Hashing functions
////////////////////////////////////////////////////////////////////////////////
function   HashFunc(Key: PChar; CareCase: Boolean): Integer;
function   HashCompare(Item1, Item2: PChar; CareCase: Boolean): Boolean;

////////////////////////////////////////////////////////////////////////////////
//  TForm1
////////////////////////////////////////////////////////////////////////////////
type
  TForm1            = class(TForm)
     Button1:       TButton;
     btnRun:        TButton;
     procedure      FormCreate(Sender: TObject);
     procedure      FormDestroy(Sender: TObject);
     procedure      Button1Click(Sender: TObject);
     procedure      btnRunClick(Sender: TObject);
  private
     // Private declarations
     FTestCodes:    TStringList;
     FHashCodeQty:  THash;
  protected
     // Protected declarations
     procedure      GenerateSampleCodes;
     procedure      GenerateSampleCodeQtyFile;
     procedure      GenerateSampleTestFile;
     procedure      GenerateSampleFiles;
  public
     // Public declarations
  end;

////////////////////////////////////////////////////////////////////////////////
//   File constants
////////////////////////////////////////////////////////////////////////////////
const
  // File names for testing
  FILE_TEST         =  'testing.txt';
  FILE_CODEQTY      =  'codeqty.txt';

////////////////////////////////////////////////////////////////////////////////
//   RegEx constants
////////////////////////////////////////////////////////////////////////////////
const
  // Eg StockItem1 = 103
  REG_TAGQTY        =  '\s*(:<Tag>[^=\s]+)\s*=\s*(:<Value>\d+)';
  // Eg %%%StockItem1%%%
  REG_TAG           =  '%%%(:<Tag>!(%%%)+)%%%';

////////////////////////////////////////////////////////////////////////////////
//   Global variables
////////////////////////////////////////////////////////////////////////////////
var
  Form1:            TForm1;

implementation
{$R *.DFM}

//// Example of what you would need to do to perform the replacement logic /////
procedure TForm1.btnRunClick(Sender: TObject);
var  strmBase:      TStream;
     regCodeQty:    TRegExpr;
     regTest:       TRegExpr;
     ptrValue:      Pointer;
     dwBackPos:     Integer;
     dwSize:        Integer;
     dwMark:        LongWord;
     szReplace:     String;
     lpszSource:    PChar;
begin

  // Mark time
  dwMark:=GetTickCount;

  // Resource protection
  try
     // Clear hash table
     FHashCodeQty.Clear;
     // Create regex to load hash table
     regCodeQty:=TRegExpr.CreatePattern(REG_TAGQTY);
     // Resource protection
     try
        // Load the data from file
        regCodeQty.LoadFromFile(FILE_CODEQTY);
        // While match
        if regCodeQty.MatchFirst then
        begin
           // Repeat
           repeat
              // Add hash item
              FHashCodeQty.Add(PChar(regCodeQty.NamedBackReference['Tag']), Pointer(StrToInt(regCodeQty.NamedBackReference['Value'])));
           // Stop when we fail to match next
           until not(regCodeQty.MatchNext);
        end;
     finally
        // Free regex
        regCodeQty.Free;
     end;

     //
     // Undef to use file based destination stream
     //
     {$DEFINE USE_MEM}

     // Create regex to load the test file
     regTest:=TRegExpr.CreatePattern(REG_TAG);

     // Resource protection
     try
        // Load the  test file
        regTest.LoadFromFile(FILE_TEST);

        // Create stream to dump the results
        {$IFDEF USE_MEM}
        strmBase:=TFastMemStream.Create;
        {$ELSE}
        strmBase:=TFileStream.Create(FILE_TEST, fmCreate);
        {$ENDIF}

        // Resource protection
        try
           // Get source pointer from regex parser
           lpszSource:=regTest.Source;
           // Set back marker
           dwBackPos:=0;
           // Start match
           if regTest.MatchFirst then
           begin
              // Repeat while match
              repeat
                 // Get size of data up to match
                 dwSize:=regTest.MatchPos - dwBackPos;
                 // write preceding text over
                 if (dwSize > 0) then strmBase.Write(lpszSource^, dwSize);
                 // Push source pointer past the match
                 Inc(lpszSource, dwSize + regTest.MatchLen);
                 // Lookup the tag in the hash to get the quantitiy
                 if FHashCodeQty.Find(PChar(regTest.NamedBackReference['Tag']), ptrValue) then
                 begin
                    // Pointer is really numeric value, convert to string
                    Str(Integer(ptrValue), szReplace);
                    // Write replacement value to file
                    strmBase.Write(Pointer(szReplace)^, Length(szReplace));
                 end;
                 // Update the back marker
                 dwBackPos:=regTest.MatchPos + regTest.MatchLen;
              // Match next
              until not(regTest.MatchNext);
           end;
           // Write remaining block of text
           strmBase.Write(lpszSource^, StrLen(lpszSource));
        finally
           // Save to file (if not a file stream)
           if not(strmBase is THandleStream) then
           begin
              TCustomMemoryStream(strmBase).SaveToFile(FILE_TEST);
           end;
           // Free the stream
           strmBase.Free;
        end;
     finally
        // Free regex
        regTest.Free;
     end;

  finally
     // Set caption
     Caption:=IntToStr(GetTickCount - dwMark);
  end;

end;

procedure TForm1.GenerateSampleFiles;
begin

  // Generate the sample codes
  GenerateSampleCodes;

  // Generate sample code = qty file
  GenerateSampleCodeQtyFile;

  // Generate the sample test file
  GenerateSampleTestFile;

end;

procedure TForm1.GenerateSampleTestFile;
var  listSample:    TStringList;
     dwIndex:       Integer;
begin

  // Generate a sample file with the following attributes:
  //
  //    - 50000 total lines
  //    - 3 replacement tags per line
  //    - 150000 total replacement tags in the file

  // Create list
  listSample:=TStringList.Create;

  // Resource protection
  try
     // Walk sample codes
     for dwIndex:=1 to 50000 do
     begin
        // Create sample line
        listSample.Add(
                          StringOfChar('a', Random(100)) + '(%%%' +
                          FTestCodes[Random(FTestCodes.Count)] + '%%%)' +
                          StringOfChar('b', Random(100)) + '(%%%' +
                          FTestCodes[Random(FTestCodes.Count)] + '%%%)' +
                          StringOfChar('c', Random(100)) +  '(%%%' +
                          FTestCodes[Random(FTestCodes.Count)] + '%%%)' +
                          StringOfChar('d', Random(100))
                       );

     end;
     listSample.Add('zzz(%%%' + FTestCodes[Random(FTestCodes.Count)] + '%%%)Q');
     // Save to file
     listSample.SaveToFile(FILE_TEST);
  finally
     // Free the list
     listSample.Free;
  end;

end;

procedure TForm1.GenerateSampleCodeQtyFile;
var  listCodeQty:   TStringList;
     dwIndex:       Integer;
begin

  // Create list
  listCodeQty:=TStringList.Create;

  // Resource protection
  try
     // Walk sample codes
     for dwIndex:=0 to Pred(FTestCodes.Count) do
     begin
        // Generate random quantity values
        listCodeQty.Add(Format('%s = %d', [FTestCodes[dwIndex], Random(100000)]));
     end;
     // Save to file
     listCodeQty.SaveToFile(FILE_CODEQTY);
  finally
     // Free the list
     listCodeQty.Free;
  end;

end;

procedure TForm1.GenerateSampleCodes;
var  dwIndex:       Integer;
begin

  // Lock list
  FTestCodes.BeginUpdate;

  // Resource protection
  try
     // Clear list
     FTestCodes.Clear;
     // Create set of stock codes
     for dwIndex:=1 to 10000 do
     begin
        // Add sample stock code
        FTestCodes.Add(Format('stock_DJB%d', [dwIndex]));
     end;
  finally
     // Unlock list
     FTestCodes.EndUpdate;
  end;

end;

procedure TForm1.FormCreate(Sender: TObject);
begin

  // Create string list for codes
  FTestCodes:=TStringList.Create;

  // Create hash for loading codes and quantities
  FHashCodeQty:=THash.Create;

end;

procedure TForm1.FormDestroy(Sender: TObject);
begin

  // Free objects
  FTestCodes.Free;
  FHashCodeQty.Free;

end;

//// Generate sample data //////////////////////////////////////////////////////
procedure TForm1.Button1Click(Sender: TObject);
var  dwMark:        LongWord;
begin

  // Mark time
  dwMark:=GetTickCount;

  // Resource protection
  try
     // Generate the sample data files
     GenerateSampleFiles;
  finally
     // Set caption
     Caption:=IntToStr(GetTickCount - dwMark);
  end;

end;

//// TFastMemStream ////////////////////////////////////////////////////////////
function TFastMemStream.Realloc(var NewCapacity: Longint): Pointer;
var  dwDelta:       Integer;
     lpMemory:      Pointer;
begin

  // Get current memory pointer
  lpMemory:=Memory;

  // Resource protection
  try
     // Calculate the delta to be applied to the capacity
     if (NewCapacity > 0) then
     begin
        // Check new capacity
        if (NewCapacity > MaxWord) then
           // Delta is 1/4 of desired capacity
           dwDelta:=NewCapacity div 4
        else
           // Minimum allocation of 64 KB
           dwDelta:=MaxWord;
        // Update by delta
        Inc(NewCapacity, dwDelta);
     end;
     // Determine if capacity has changed
     if not(NewCapacity = Capacity) then
     begin
        // Check for nil alloc
        if (NewCapacity = 0) then
        begin
           // Release the memory
           FreeMem(lpMemory);
           // Clear result
           lpMemory:=nil;
        end
        else
        begin
           // Check current capacity
           if (Capacity = 0) then
              // Allocate memory
              lpMemory:=AllocMem(NewCapacity)
           else
              // Reallocate memory
              ReallocMem(lpMemory, NewCapacity);
        end;
     end;
  finally
     // Return modified pointer
     result:=lpMemory;
  end;

end;

//// Hashing functions /////////////////////////////////////////////////////////
function HashCompare(Item1, Item2: PChar; CareCase: Boolean): Boolean;
begin

  // Null assignment checks
  if (Item1 = nil) then
     // Check for Item2 being nil
     result:=(Item2 = nil)
  else if (Item2 = nil) then
     // Item1 is not null, so no possible match
     result:=False
  // Check case
  else if CareCase then
     // Case sensitive compare
     result:=(StrComp(Item1, Item2) = 0)
  else
     // In-case sensitive compare
     result:=(StrIComp(Item1, Item2) = 0);

end;

function HashFunc(Key: PChar; CareCase: Boolean): Integer;
var  bChar:         Byte;
begin

  // Set starting result
  result:=0;

  // Generate hash index for key
  while (Key^ > #0) do
  begin
     if CareCase then
        bChar:=Byte(Key^)
     else if (Key^ in ['A'..'Z']) then
        bChar:=Byte(Key^) + 32
     else
        bChar:=Byte(Key^);
     Inc(result, (result shl 5) + (bChar shl 3));
     Inc(Key);
  end;

  // Keep result in bounds of array
  result:=LongWord(result) mod HASH_SIZE;

end;

//// THash /////////////////////////////////////////////////////////////////////
constructor THash.Create(Deallocator: THashDeallocator = nil);
begin

  // Perform inherited
  inherited Create;

  // Set default values
  ZeroMemory(@FHash, SizeOf(FHash));
  FDeallocator:=Deallocator;
  FCareCase:=False;
  FCount:=0;

end;

destructor THash.Destroy;
begin

  // Resource protection
  try
     // Clear the hash
     Clear;
  finally
     // Perform inherited
     inherited Destroy;
  end;

end;

procedure THash.Clear;
var  phNode1:       PHashNode;
     phNode2:       PHashNode;
     dwIndex:       Integer;
begin

  // Resource protection
  try
     // Iterate the array and clear the hash nodes
     for dwIndex:=0 to Pred(HASH_SIZE) do
     begin
        // Get bucket node
        phNode1:=FHash[dwIndex];
        // Walk the nodes
        while Assigned(phNode1) do
        begin
           // Get pointer to next item
           phNode2:=phNode1^.Next;
           // Free memory for node name
           if Assigned(phNode1^.Item) then FreeMem(phNode1^.Item);
           // Callback
           if (Assigned(phNode1^.Data) and Assigned(FDeallocator)) then FDeallocator(Self, phNode1^.Data);
           // Free node item
           FreeMem(phNode1);
           // Set iterator to next item
           phNode1:=phNode2;
        end;
     end;
  finally
     // Clear all top level buckets
     for dwIndex:=0 to Pred(HASH_SIZE) do FHash[dwIndex]:=nil;
     // Reset the count
     FCount:=0;
  end;

end;

function THash.Extract(Item: PChar; out Data: Pointer): Boolean;
var  phNode1:       PHashNode;
     phNode2:       PHashNode;
     dwIndex:       Integer;
begin

  // Get the hash index
  dwIndex:=HashFunc(Item, FCareCase);

  // Get the hash bucket item
  phNode1:=FHash[dwIndex];

  // Did top level item exist?
  if Assigned(phNode1) then
  begin
     // Prepare for loop
     phNode2:=phNode1;
     // Walk the nodes
     while Assigned(phNode2) do
     begin
        // Match key
        if HashCompare(phNode2^.Item, Item, FCareCase) then break;
        // Save current node
        phNode1:=phNode2;
        // Move to the next node in the chain
        phNode2:=phNode2^.Next;
     end;
     // Check to see if the node is still set
     if Assigned(phNode2) then
     begin
        // Set out param data value
        Data:=phNode2^.Data;
        // Check to see if this is the top level item
        if (phNode2 = phNode1) then
           // Link next node into the bucket
           FHash[dwIndex]:=phNode2^.Next
        else
           // Link over this node
           phNode1^.Next:=phNode2^.Next;
        // Free memory for node name
        if Assigned(phNode2^.Item) then FreeMem(phNode2^.Item);
        // Free the node
        FreeMem(phNode2);
        // Decrement the node count
        Dec(FCount);
        // Success
        result:=True;
     end
     else
        // Did not find the node
        result:=False;
  end
  else
     // No nodes in bucket
     result:=False;

end;

function THash.Delete(Item: PChar): Boolean;
var  lpData:        Pointer;
begin

  // Extract the item
  result:=Extract(Item, lpData);

  // Check result, perform callback if needed
  if (result and Assigned(lpData) and Assigned(FDeallocator)) then FDeallocator(Self, lpData);

end;

function THash.Add(Item: PChar; Data: Pointer): Boolean;
var  phNode1:          PHashNode;
     phNode2:          PHashNode;
     dwIndex:          Integer;
begin

  // Get the hash bucket item index
  dwIndex:=HashFunc(Item, FCareCase);

  // Resource protection
  try
     // Get the hash bucket item
     phNode1:=FHash[dwIndex];
     // Is the bucket empty
     if (phNode1 = nil) then
        // Add new cell item
        FHash[dwIndex]:=NewNode(Item, Data)
     else
     begin
        // Save current node
        phNode2:=phNode1;
        // Walk nodes
        while Assigned(phNode2) do
        begin
           // Match the key
           if HashCompare(phNode2^.Item, Item, FCareCase) then
           begin
              // Check for data change
              if (phNode2^.Data <> Data) then
              begin
                 // Callback
                 if (Assigned(phNode2^.Data) and Assigned(FDeallocator)) then FDeallocator(Self, phNode2^.Data);
                 // Set new data item
                 phNode2^.Data:=Data;
              end;
              // Break loop
              break;
           end;
           // Save current node
           phNode1:=phNode2;
           // Walk next node
           phNode2:=phNode2^.Next;
        end;
        // Do we need to add a new item to the end of the chain?
        if (phNode2 = nil) then
        begin
           // Create hash node
           phNode2:=NewNode(Item, Data);
           // Link the node in
           phNode1^.Next:=phNode2;
        end;
     end;
  finally
     // Always success
     result:=True;
  end;

end;

function THash.Find(Item: PChar; out Data: Pointer): Boolean;
var  phNode:        PHashNode;
begin

  // Get the hash bucket item
  phNode:=FHash[HashFunc(Item, FCareCase)];

  // Resource protection
  try
     // Walk the items
     while Assigned(phNode) do
     begin
        // Compare the key
        if HashCompare(phNode^.Item, Item, FCareCase) then
        begin
           // Key exists, set out return data
           Data:=phNode^.Data;
           break;
        end;
        // Walk the next item
        phNode:=phNode^.Next;
     end;
  finally
     // Success if node is assigned
     result:=Assigned(phNode);
  end;

end;

function THash.NewNode(Item: PChar; Data: Pointer): PHashNode;
begin

  // Allocate memory for new node
  result:=AllocMem(SizeOf(THashNode));

  // Set the structure fields
  result^.Item:=StrCopy(AllocMem(Succ(StrLen(Item))), Item);
  result^.Data:=Data;

  // Increment the count
  Inc(FCount);

end;

initialization

  // Seed random number generator
  Randomize;

end.

-- dfm ---
object Form1: TForm1
  Left = 351
  Top = 319
  Width = 453
  Height = 177
  Caption = 'Form1'
  Color = clBtnFace
  Font.Charset = ANSI_CHARSET
  Font.Color = clWindowText
  Font.Height = -11
  Font.Name = 'Tahoma'
  Font.Style = []
  OldCreateOrder = False
  OnCreate = FormCreate
  OnDestroy = FormDestroy
  PixelsPerInch = 96
  TextHeight = 13
  object Button1: TButton
    Left = 12
    Top = 12
    Width = 165
    Height = 25
    Caption = 'Generate Sample Data'
    TabOrder = 0
    OnClick = Button1Click
  end
  object btnRun: TButton
    Left = 12
    Top = 44
    Width = 165
    Height = 25
    Caption = 'Run Test'
    TabOrder = 1
    OnClick = btnRunClick
  end
end


Avatar of stuayre

ASKER

Russell is there a way to mod the code so if the qty doesn't exist it will replace it with 0?

I can't work out how your code works :)

cheers

Stu
Stu,

Yes, its very easy. All you need to do is check the hash find; if it fails, set ptrValue to nil (which will convert to zero when cast as int).

Eg (first example I gave)
----------------------------------
procedure TForm1.ReplaceTag(Sender: TObject; var Replace: String);
var  ptrValue:      Pointer;
begin

  // Lookup the tag in the hash to get the quantitiy
  if not(FHashCodeQty.Find(PChar(TRegExpr(Sender).NamedBackReference['Tag']), ptrValue)) then ptrValue:=nil;

  // Pointer is really numeric value, set replacement text
  Str(Integer(ptrValue), Replace);

end;

Eg (second example I gave, mod the code in TForm1.btnRunClick)
----------------------------------
       
        // Lookup the tag in the hash to get the quantitiy
        if not(FHashCodeQty.Find(PChar(regTest.NamedBackReference['Tag']), ptrValue)) then ptrValue:=nil;

        // Pointer is really numeric value, convert to string
        Str(Integer(ptrValue), szReplace);
       
        // Write replacement value to file
        strmBase.Write(Pointer(szReplace)^, Length(szReplace));
 In case you were interested I was able to significantly increase the speed of what I had previously done by doing a couple of optimizations.  The process should be far easier to understand although it is still not quite as fast (the hashing algorithm is roughly2-3 times as fast).  I timed the StrPos routine that comes with Delphi and it was taking 15ms or more per search on my machine which is ridiculous.  I made my own and it routinely tests as 0ms.  Before anyone comments, I know such routines exist.  It took me less time to write than to look through the available libraries and pick one (let alone test them all).   I used the same data generator as rlibby created with one important difference.  I changed the GenerateSampleCodes to generate them this way:

procedure TForm1.GenerateSampleCodes;
var  dwIndex:       Integer;
begin

  // Lock list
  FTestCodes.BeginUpdate;

  // Resource protection
  try
     // Clear list
     FTestCodes.Clear;
     // Create set of stock codes
     for dwIndex:=1 to 10000 do
     begin
        // Add sample stock code
        FTestCodes.Add(Format('stock_DJB%.5d', [dwIndex]));
     end;
  finally
     // Unlock list
     FTestCodes.EndUpdate;
  end;

end;

This allowed me to presort them and do a simple binary search for the name within a regular string list.
I plan to work on it more as time permits.  Let me know if you have questions.
unit uFileReplace;
 
interface
uses
  Windows, SysUtils, Classes;
 
type
  //replace array of PChar with array of PWideChar for unicode
  TFileReplace = class
  private
    fInFile : TMemoryStream;
    fOutFile : TMemoryStream;
    fStartDelimiter : string;
    fEndDelimiter : string;
    fSearchList : TStringList;
    fOutFileName : string;
  protected
  public
    constructor Create(InFileName, OutFileName, ReplaceFileName, StartDelim,
      EndDelim : string);
    destructor Destroy; override;
    procedure Replace;
    function PCFind(Source, Value : pchar) : PChar;
    function IndexOfName(Name : string) : integer;
  end;
 
implementation
{ TFileReplace }
 
constructor TFileReplace.Create(InFileName, OutFileName, ReplaceFileName,
  StartDelim, EndDelim : string);
var
  FS : TFileStream;
 
begin
  FS := TFileStream.Create(InFileName, fmOpenRead);
  try
    fInFile := TMemoryStream.Create;
    fInFile.LoadFromStream(FS);
  finally
    FS.Free;
  end;
 
  fSearchList := TStringList.Create;
  fSearchList.LoadFromFile(ReplaceFileName);
 
  fOutFile := TMemoryStream.Create;
  fOutFile.SetSize(fInFile.Size * 2);
  fStartDelimiter := StartDelim;
  fEndDelimiter := EndDelim;
  fOutFileName := OutFileName;
end;
 
destructor TFileReplace.Destroy;
begin
  fOutFile.Free;
  fSearchList.Free;
  fInFile.Free;
  inherited;
end;
 
function TFileReplace.IndexOfName(Name: string): integer;
var
  I, L, H, G : integer;
  GName : string;
 
begin
  Result := -1;
  L := 0;
  H := fSearchList.Count - 1;
  repeat
    G := ((H - L) shr 1) + L;
    GName := fSearchList.Names[G];
    if Name > GName then
      L := G
    else
      if Name < GName then
        H := G;
  until (H - L < 3) or (Name = GName);
 
  if Name = GName then
    begin
      Result := G;
      exit;
    end;
 
  I := L;
  while (Name <> fSearchList.Names[I]) and (I < H) do
    inc(I);
  if Name = fSearchList.Names[I] then
    Result := I;
end;
 
function TFileReplace.PCFind(Source, Value: pchar): PChar;
begin
  Result := nil;
 
  if not Assigned(Source) then
    exit;
 
  if not Assigned(Value) then
    exit;
 
  try
    if Source[0] = #0 then
      exit;
  except
 
  end;
 
  if Value[0] = #0 then
    exit;
 
  asm
    //ecx Value
    //edx Source
      Mov ESI, Source
 
    @ResetValue:
      Mov EDI, Value; //set the pointer to the value
 
    @Loop:
      Mov BL, [ESI] //read in a byte of the source
      Cmp BL, [EDI] //compare to what is being searched for
      JE @TestFullValue //if found then continue
 
    @NextChar:
      inc ESI       //otherwise increment the source pointer
      CMP [ESI], 0
      JE @NotFound
      JMP @Loop     //and search further
 
    @TestFullValue:
      //this is where we need to see if we found the entire value
      Mov EDX, ESI
 
    @LoopTestFullValue:
      inc ESI
      inc EDI
 
      Mov BL, [ESI]
      Cmp BL, [EDI]
      JE  @LoopTestFullValue
 
      Mov BL, [EDI]
      Cmp BL, 0
      JE @Found
      JMP @ResetValue
 
    @Found:
      mov Result, EDX  //set the result
 
    @NotFound:
  end;
end;
 
procedure TFileReplace.Replace;
var
  Bookmark, SIndex, EIndex : PChar;
  LInt : integer absolute Bookmark;
  TagName : string;
  Value : string;
  TagIndex : integer;
  ChunkSize : integer;
  P : integer;
 
begin
  Assert(fStartDelimiter <> '', 'Start delimiter not set');
  Assert(fEndDelimiter <> '', 'End delimiter not set');
 
  Bookmark := fInFile.Memory;
  repeat
//    SIndex := StrPos(Bookmark, PChar(fStartDelimiter));
    SIndex := PCFind(Bookmark, PChar(fStartDelimiter));
 
    if Assigned(SIndex) then
      begin
        //figure out how much can be copied RAW
        ChunkSize := integer(SIndex) - integer(Bookmark);
 
        //copy it
        if ChunkSize > 0 then
          fOutFile.WriteBuffer(Bookmark^, ChunkSize);
 
        //skip the delimiter
        inc(SIndex, Length(fStartDelimiter));
 
        //find the ending delimiter
    //    EIndex := StrPos(SIndex, PChar(fEndDelimiter));
 
        EIndex := PCFind(SIndex, PChar(fEndDelimiter));
 
        //up to 100% to this point
 
        //figure out how big the tag is
        ChunkSize := integer(EIndex) - integer(SIndex);
 
        //set the tagname
        SetLength(TagName, ChunkSize);
        StrLCopy(@TagName[1], SIndex, ChunkSize);
 
        //move past the end delimiter
        inc(EIndex, Length(fEndDelimiter));
 
        //update Bookmark
        Bookmark := EIndex;
 
        //up to 100% to this point
 
        //replace the tag
        TagIndex := IndexOfName(TagName);
        Value := '';
        if TagIndex > -1 then
          begin
            Value := fSearchList[TagIndex];
            P := Pos('=', Value);
            Value := Copy(Value, P + 1, Length(Value));
          end;
        if Value <> '' then
          fOutFile.Write(Value[1], Length(Value));
      end;
 
  until not Assigned(SIndex);
 
  fOutFile.Seek(0, soFromBeginning);
  fOutFile.SaveToFile(fOutFileName);
end;
 
end.

Open in new window

Any chance that this question will get closed?

Russell
Avatar of stuayre

ASKER

Sorry!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Russle you're a genius! thanks for all your help :)

kind regards

Stu
Avatar of stuayre

ASKER

p.s thanks to developmentguru too :o)
Stu,
No problem. Figured I would throw a reminder out there (its easy to lose track sometimes)

Russell