Solved

Reading FAST from a HUDGE TextFile

Posted on 2006-10-24
35
261 Views
Last Modified: 2010-04-05
I need a function or a way to read (and process) a BIG Textfile (around 1mega). At the moment I read line after line but it takes a lifetime! I've tried to load the file into a TStringList but when I'm processing all the text it also takes a life time. I have an application made by a friend in C# who processes the same text file in a couple of seconds!
Please advice
0
Comment
Question by:crystyan
  • 11
  • 10
  • 8
  • +2
35 Comments
 
LVL 28

Assisted Solution

by:ciuly
ciuly earned 70 total points
ID: 17794843
well that depends how you are doing the processing. if you are using functions like copy, delete, etc on the string, then it will take a long time because delphi reallocates the string at every call of such function.
see this thread on some tips on performance on string text manipulation (there are a few links to some other threads on the same thing): http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21980883.html (it's long, I know, but you can skip the code ;)

also, using something like:
function readup(filename:string):string;
begin
  with tfilestream.create(filename,fmopenread) do
  begin
    setlength(result,size);
    read(pansichar(result)^,size);
    free;
  end;
end;
or using teh stringlist.loadfromfile should be fast enough.
0
 
LVL 4

Expert Comment

by:tobjectpascal
ID: 17796686
procedure TForm1.Button1Click(Sender: TObject);
Var
 T: Textfile;
 Tmp: String;
 C: Integer;
begin
  System.Assign(T,'c:\sample.txt');
  System.Reset(T);
  While Not Eof(T) Do
   Begin
     Readln(T,Tmp);
     Inc(C);
   End;
  ShowMessage(IntToStr(C)+' Lines Loaded');
  System.Close(T);
end;

1.6 meg Text file in about 1/2 second or faster, seemed almost instant..

0
 
LVL 4

Expert Comment

by:tobjectpascal
ID: 17796741
Time Taken to Load 12548928 Lines 0.03 Seconds

procedure TForm1.Button1Click(Sender: TObject);
Var
 T: Textfile;
 Tmp: String;
 TT: Real;
 GT: Longint;
 C: Integer;
begin
  Gt:=GettickCount;
  System.Assign(T,'c:\sample.txt');
  System.Reset(T);
  While Not Eof(T) Do
   Begin
     Readln(T,Tmp);
     Inc(C);
   End;
  TT:=(GetTickCount-Gt) / 1000;
  Edit1.Text:='Time Taken to Load '+IntToStr(C)+' Lines '+FloatToStr(TT)+' Seconds';
  System.Close(T);
end;


How fast do you want it?

Compiled/Written in Delphi 5 CPU 1gig AMD Duron, 512 meg of memory...

So what's cauing the slow load time in your program? must be a string manipluation problem
0
 
LVL 4

Expert Comment

by:tobjectpascal
ID: 17796752
My Bad

Time Taken to Load 18820 Lines 0.04 Seconds

C:=0; //even after all these years i still forget to initilize variables sometimes.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17798374
@crystyan

When you say you are using a TStringList, does that mean you are invoking the LosdFromFile method?  It is VERY FAST.

Since 1MB isn't terribly large, we will need some additional information from you:

1. What kind of hardware are you using? (CPU speed, RAM, laptop/desktop/server, etc.)
2. What is your runtime environment? (OS, competing programs, etc.)
3. What kind of "(and process)" actions are you taking on the lines in this file?
4. What's happening with the rest of your application (forms, controls, mouse/keyboard inputs, etc.) while you are reading and processing this file?
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17800658
I agree, TStringList should be fast enough for a 1 meg file
Were you perhaps using a TMemo (or similar) to load it?
The visual aspect of components will slow things down a lot
you can wrap it like :-

mystringlist.BeginUpdate;
try
  mystringlist.LoadFromFile('c:\blah.txt');
finally
  mystringlist.Endupdate;
end;
to speed it up
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17801063
also, setting the TStringList.CAPACITY property to an estimate of the number of lines you will be loading will also make a noticable improvement on the performance of the LoadFromFile method.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17801072
typo correction:
LosdFromFile  should be LoadFromFile
0
 

Author Comment

by:crystyan
ID: 17806715
hmm ... I think the problem is somewhere else then! I want to count all the 2 letter combination and keep them stored in a TList or if you have a better ideea (a faster one it would be great !)
Please, if you have time to write me a small class wich should have something like : Combination: String; count: Cardinal and it would count all the combination from that BIG text file it would be great.
For example if I would have a text file with 2 words:
word1 word2

then I would have
wo, 2
or, 2
rd, 2
d1, 1
d2, 1

Ignore ALL what is not between ['a' .. 'z']  you need to make them lower first :|

THANK YOU !
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17807744
This should be simple enough to understand
I have left it open so you can d "any length" words
eg. find 2 letter words, then also find 3 letter words..
I blatantly used ciuly's pAnsiChar loading of the file. Feed him some points if you like it

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls;

type TBigFileParser = class(TObject)
public
  Results: TStringList;
  BigText: string;
  Constructor Create;
  Destructor Destroy; override;
  procedure LoadFile(filename: string);
  procedure FindxLetterWords(wordlength: integer);
end;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.DFM}

{ TBigFileParser }

constructor TBigFileParser.Create;
begin
  inherited Create;
  Results := TStringList.Create;
end;

destructor TBigFileParser.Destroy;
begin
  Results.Clear;
  Results.Free;
  inherited Destroy;
end;

procedure TBigFileParser.LoadFile(filename: string);
  var
    FS: TFileStream;
begin
  FS := TFileStream.create(filename, fmOpenRead);
  try
    SetLength(BigText, FS.Size);
    FS.Read(PAnsiChar(BigText)^, FS.Size);
  finally
    FS.Free;
  end;
end;

procedure TBigFileParser.FindxLetterWords(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
  Results.Clear; // clear our counts first
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found a 'x' length word
        Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
        partialword := '';
      end
      else if length(partialword) > wordlength then
      begin
        partialword := '';
      end;
    end
    else
      partialword := partialword + c; // add this 'a'..z' char to our string
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found a x length word in the last section
    Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
  end
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  with TBigFileParser.Create do
  try
    LoadFile('test.txt');
    FindxLetterWords(2); // we want to find the "2" letter words
    memo1.lines.assign(Results);
    ShowMessage('The word "my" appears ' + Results.Values['my'] + ' times');
  finally
    free;
  end;
end;

end.
0
 

Author Comment

by:crystyan
ID: 17808132
AWESOME !!! Thank you SO MUCH ! Another thing and this is all. Could you tell me how could I sort them by the number of apparition ?

THANK YOU !
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17808189
add these bits of code

var
  Form1: TForm1; // place CustomSortByCount below this part in your existing unit
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
implementation

{$R *.DFM}

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if List.Values[List.Names[Index1]] < List.Values[List.Names[Index2]] then
    result := -1
  else if List.Values[List.Names[Index1]] < List.Values[List.Names[Index2]] then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;

then change the button1 click to be ...

procedure TForm1.Button1Click(Sender: TObject);
begin
  with TBigFileParser.Create do
  try
    LoadFile('test.txt');
    FindxLetterWords(2); // we want to find the "2" letter words
    memo1.lines.assign(Results);
// now sort the results
    Results.CustomSort(CustomSortByCount);
    memo2.lines.assign(Results);
    ShowMessage('The word "my" appears ' + Results.Values['my'] + ' times');
  finally
    free;
  end;
end;
0
 

Author Comment

by:crystyan
ID: 17808223
oh, another thing. I've just tested and it doesn't seem to get the right results ... try on this file www.cristianignat.com/diverse/source.zip
Please try searching for "mm" or "th" you'll see only 1 result and if you would manually open the file in an editor you'll get much more results.
0
 

Author Comment

by:crystyan
ID: 17808244
and how could I sort the results when I' calling Results.CustomSort(CustomSortByCount); I get [Error] Unit1.pas(350): Incompatible types: 'regular procedure and method pointer'

Sorry for stressing you :-|
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17808275
oops, slight bug
change
      else if length(partialword) > wordlength then
to
      else if length(partialword) <> wordlength then
0
 

Author Comment

by:crystyan
ID: 17808297
nope, still not. Run it for "mm" you'll see that the application show you only 1 and it appears more than one time in the file I've uploaded it for you.
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17808358
Incompatible types: 'regular procedure and method pointer'
THis means you have the function declaration
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
in teh wrong place.
place it just above the word
"implementation"
and put the corrected procedure (below) just below the word
"implementation"

I also was not converting teh count to numbers (so the order was alphanumeric instead of integer)

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if StrToInt(List.Values[List.Names[Index1]]) < StrToInt(List.Values[List.Names[Index2]]) then
    result := -1
  else if StrToInt(List.Values[List.Names[Index1]]) > StrToInt(List.Values[List.Names[Index2]]) then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;

and don't forget to change the line
      else if length(partialword) > wordlength then
to
      else if length(partialword) <> wordlength then

that'll teach me not to fully test my code before releasing it :-)
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17808376
I get 23  "th"
16th, 28th, 30th, 4th, 19th, 20th, 19th, 5th, 19th, 20th, 6th, 7th, 19th, 7th, 19th, 19th, 20th, 7th, 20th, 69th, 70th, 7th, 13th
and 1 "mm"
there is only 1 "mm being "approximately 38 mm a year. "
every other instance of "mm" is part of another word (eg. "programmed", "common")
have I not understood your request correctly?
0
 

Author Comment

by:crystyan
ID: 17808388
I want EVERY instance of "mm" so if it appears in programming, I want that too. But this is good too that it's searching only for the 2 letter words. could u do that as another function ?

This is the last thing. Promisse!
0
 

Author Comment

by:crystyan
ID: 17808414
and I still didn't get it about the sort function. :-(
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17808569
the top of my demo looksl ike this :-

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls;

type TBigFileParser = class(TObject)
public
  Results: TStringList;
  BigText: string;
  Constructor Create;
  Destructor Destroy; override;
  procedure LoadFile(filename: string);
  procedure FindxLetterWords(wordlength: integer);
end;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    Memo2: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer; // <-- NOTE POSITION HERE

implementation

{$R *.DFM}

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if StrToInt(List.Values[List.Names[Index1]]) < StrToInt(List.Values[List.Names[Index2]]) then
    result := -1
  else if StrToInt(List.Values[List.Names[Index1]]) > StrToInt(List.Values[List.Names[Index2]]) then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;


you will need to clarify that mm thing a bit more

I assume you want it for every "2 characters" not just mm

so in the following text
"abcdefg hijk"
would you expect to get
"ab" "cd" "ef" "hi" "jk"
or
"ab" "bc" "cd" "de" "ef" "fg" "hi" "ij" "jk"
or something else?
0
 

Author Comment

by:crystyan
ID: 17809649
I expect ab, bc, cd, de, ef, fg, hi, jp
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17812032
@crystyan

what about "ij"?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17812090
What should happen for a word like "isn't" or "can't"?
0
 

Author Comment

by:crystyan
ID: 17812168
"ab" "bc" "cd" "de" "ef" "fg" "hi" "ij" "jk"

missed that.

so basicly I want ALL the 2 (or 3) consecutive letters to be counted but not those separated by space or other character different than 'a'..'z'
in this case I don't want to appear "gh" in the counting.

for isn't it will count for "is", "sn". that's all
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17814793
add this procedure to the class

procedure TBigFileParser.FindxConsecutiveLetters(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
//  Results.Clear; // clear our counts first - up to you if you want to do this
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
      else if length(partialword) <> wordlength then
      begin
        partialword := '';
      end;
    end
    else
    begin // we have not reached the end of the current text (word)
      partialword := partialword + c; // add this 'a'..z' char to our string
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
    end;
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found 'x' length letters in a row, add it to our list
    Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
  end
end;

and call it like

    FindxConsecutiveLetters(2); // we want to find strings of "2" letters in a row

It takes a little longer to run than the first procedure. Let me know if the speed is unacceptable, and i'll change the way it stores it
0
 

Author Comment

by:crystyan
ID: 17815260
Did you try it on the file I've sent you ? It freezes somewhere :-( it never ends ... :-| Give it a run.

  with TBigFileParser.Create do
  try
    LoadFile('C:\Documents and Settings\Cristian\Desktop\source.txt');
    FindxConsecutiveLetters(2); // we want to find the "2" letter words
//    Results.CustomSort(CustomSortByCount);
    memo1.lines.assign(Results);
    ShowMessage('The word "mm" appears ' + Results.Values['mm'] + ' times');
  finally
    free;
  end;
0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17815727
I've been testing on your file the whole time. like i said, it takes a while longer due to the number of results, but does not lock up.
I have increased the speed dramatically by using Binary Tree searching, I'll upload it after I give it a once over
0
 
LVL 17

Accepted Solution

by:
TheRealLoki earned 380 total points
ID: 17815831
here's my faster demo. only takes a couple of seconds including the sort by value
It's using a binary tree and insertion sort routine, which should give your C#-coding buddy a run for his money
:-) could make it a little faster if I had to, but got a job to do here as well


unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls, math;

type TBigFileParser = class(TObject)
private
{$IFDEF DEBUG}  iterations: integer; {$ENDIF}
  nearest: integer;
{ the following 4 functions are used internally when parsing the string.
  they cannot be used after you have done a "CustomSort".
  So do not do a CustomSort until you have finished calling
  FindxLetterWords and FindxConsecutiveLetters}
  function BinTreeSearch(left, middle, right: integer; s: string): integer;
  function BinTreeSearchNearest(left, middle, right: integer; s: string): integer;
  function BinaryIndexOfText(s: string): integer; // can only be used before you do a customsort
  function BinaryInsertOfText(s: string): integer; // can only be used before you do a customsort
public
  Results: TStringList;
  BigText: string;
  Constructor Create;
  Destructor Destroy; override;
  procedure LoadFile(filename: string);
  procedure FindxLetterWords(wordlength: integer);
  procedure FindxConsecutiveLetters(wordlength: integer);
  function CountOf(s: string): integer; overload;
  function CountOf(s: string; canusebinarysearch: boolean): integer; overload;
end;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    Memo2: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;

implementation

{$R *.DFM}

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if Integer(List.Objects[Index1]) < Integer(List.Objects[Index2]) then
    result := -1
  else if Integer(List.Objects[Index1]) > Integer(List.Objects[Index2]) then
    result := 1
  else
    result := AnsiCompareStr(List[Index1], List[Index2]);
end;

(*
function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if StrToInt(List.Values[List.Names[Index1]]) < StrToInt(List.Values[List.Names[Index2]]) then
    result := -1
  else if StrToInt(List.Values[List.Names[Index1]]) > StrToInt(List.Values[List.Names[Index2]]) then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;
*)

{ TBigFileParser }

constructor TBigFileParser.Create;
begin
  inherited Create;
  Results := TStringList.Create;
end;

destructor TBigFileParser.Destroy;
begin
  Results.Clear;
  Results.Free;
  inherited Destroy;
end;

procedure TBigFileParser.LoadFile(filename: string);
  var
    FS: TFileStream;
begin
  FS := TFileStream.create(filename, fmOpenRead);
  try
    SetLength(BigText, FS.Size);
    FS.Read(PAnsiChar(BigText)^, FS.Size);
  finally
    FS.Free;
  end;
end;

procedure TBigFileParser.FindxLetterWords(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
//  Results.Clear; // clear our counts first - up to you if you want to do this
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found a 'x' length word
        BinaryInsertOfText(partialword);
        partialword := '';
      end
      else if length(partialword) <> wordlength then
      begin
        partialword := '';
      end;
    end
    else
      partialword := partialword + c; // add this 'a'..z' char to our string
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found a x length word in the last section
    BinaryInsertOfText(partialword);
  end
end;

procedure TBigFileParser.FindxConsecutiveLetters(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
//  Results.Clear; // clear our counts first - up to you if you want to do this
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        BinaryInsertOfText(partialword);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
      else if length(partialword) <> wordlength then
      begin
        partialword := '';
      end;
    end
    else
    begin // we have not reached the end of the current text (word)
      partialword := partialword + c; // add this 'a'..z' char to our string
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        BinaryInsertOfText(partialword);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
    end;
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found 'x' length letters in a row, add it to our list
    BinaryInsertOfText(partialword);
  end
end;

function TBigFileParser.BinTreeSearch(left, middle, right: integer; s: string): integer;
var
  midvalue: string;
begin
{$IFDEF DEBUG}  inc(iterations);
 Form1.memo1.lines.add(Format('%d: %d %d %d', [iterations, left, middle, right])); {$ENDIF}

  midvalue := Results[middle];
  if s < midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearch(left, left + trunc((middle - left)/2), middle, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
        result := -1;
    end;
  end
  else if s > midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearch(middle, middle + trunc((right - middle)/2), right, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
        result := -1;
    end;
  end
  else
    result := middle;
end;

function TBigFileParser.BinTreeSearchNearest(left, middle, right: integer; s: string): integer;
var
  midvalue: string;
begin
{$IFDEF DEBUG}  inc(iterations);
  Form1.memo1.lines.add(Format('%d: %d %d %d', [iterations, left, middle, right])); {$ENDIF}

  midvalue := Results[middle];
  if s < midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearchNearest(left, left + trunc((middle - left)/2), middle, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
      begin
        result := -1;
        if s < Results[left] then Nearest := max(0, left)
        else if s > Results[right] then Nearest := right + 1
        else Nearest := right;
      end;
    end;
  end
  else if s > midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearchNearest(middle, middle + trunc((right - middle)/2), right, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
      begin
        result := -1;
        if s < Results[left] then Nearest := max(0, left)
        else if s > Results[right] then Nearest := right + 1
        else Nearest := right;
      end;
    end;
  end
  else
    result := middle;
end;

function TBigFileParser.BinaryIndexOfText(s: string): integer;
begin
{$IFDEF DEBUG}  iterations := 0;{$ENDIF}
  Nearest := -1;
  result := BinTreeSearch(0, Min(trunc(results.Count div 2), pred(results.Count)), pred(results.Count), s);
end;

function TBigFileParser.BinaryInsertOfText(s: string): integer;
begin
{$IFDEF DEBUG}  iterations := 0; {$ENDIF}
  Nearest := -1;
  if Results.Count = 0 then
  begin
    Results.AddObject(s, pointer(1));
    result := 0;
  end
  else
  begin
    result := BinTreeSearchNearest(0, Min(trunc(results.Count div 2), pred(results.Count)), pred(results.Count), s);

    if result = -1 then
    begin
      if nearest = results.count then
        Results.AddObject(s, pointer(1))
      else
        Results.InsertObject(nearest, s, pointer(1));
    end
    else
    begin
      Results.Objects[result] := Pointer( Integer(Results.Objects[result]) +1 );
    end;
  end;
end;

function TBigFileParser.CountOf(s: string): integer;
begin
  result := CountOf(s, false); // assume the user has done a customsearch
end;

function TBigFileParser.CountOf(s: string; canusebinarysearch: boolean): integer;
var
  i: integer;
begin
  result := 0;
  if canusebinarysearch then i := self.BinaryIndexOfText(s)
  else i := Results.indexOf(s);
  if i <> -1 then
    result := integer(Results.Objects[i]);
end;

// *** FORM STUFF

procedure TForm1.Button1Click(Sender: TObject);
var
  i: integer;
begin
  memo1.lines.beginupdate;
  memo2.lines.beginupdate;
  with TBigFileParser.Create do
  try
    memo1.lines.Clear;
    memo2.lines.Clear;
    LoadFile('source.txt');
//    FindxLetterWords(2); // we want to find the "2" letter words
//    FindxLetterWords(4); // we want to find the "4" letter words
    FindxConsecutiveLetters(2); // we want to find strings of "2" letters in a row
    for i := 0 to pred(Results.Count) do
      memo1.lines.add(Results[i] + '=' + IntToStr(integer(Results.Objects[i])));
// now sort the results
    Results.CustomSort(CustomSortByCount);
    for i := 0 to pred(Results.Count) do
      memo2.lines.add(Results[i] + ' =' + IntToStr(integer(Results.Objects[i])));
    ShowMessage('The word "an" appears ' + IntToStr(CountOf('an')) + ' times');
  finally
    free;
    memo1.lines.endupdate;
    memo2.lines.endupdate;
  end;
end;

end.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17816730
@TheRealLoki

Please pardon my ignorance in asking you these questions, but I'm taking it as a learning opportunity.

1. Wouldn't it be fastest to use a simple Integer array (or two dimension array) where the character values are used as indexes into the array?

2. Do you think the hashing of character values would make a dictionary/associative object (?Hashtable equivalent?) slower than #1?
Example: http://cc.borland.com/Item.aspx?id=15171

3. We're only talking about 676 (=26*26) integer items needed for an array to keep the count of occurances.  Is this too much memory to throw at this? (I don't think so)

0
 
LVL 17

Expert Comment

by:TheRealLoki
ID: 17816852
Since speed was his request, I'm guessing memory would not be a problem (especially for just a 1 meg file), so a hash table would be fine.
For this same reason, I chose the "divide and conquer" approach and the insertion sort, because of the speed benefits associated with both. I know there are classes that already do this, but I felt a touch of nostalgia at trying to rewrite one from memory.

Since TStringlist has an objects() property, it is handling the keeping of "text" and values together, so it was just the quickest implementation for me to use. writing your own would be better, without the extra bloat of TStringlist.

26*26 is only valid for his first request (2), he has since mentioned
"so basicly I want ALL the 2 (or 3) consecutive letters"
so he may eventually want to change to go all the way up to 20

Given time, I'd write a better implementation, but this is an open question, and anyone else is welcome to offer a better solution. I can only spare a few minutes here and there between work, and my answer is by no means the best possible one :-)
0
 
LVL 28

Expert Comment

by:ciuly
ID: 17817487
b-trees :)
level 1 will be the first char
level 2 wil be the second char.
so for the x consecutive letters you will have x levels each with 26 chars (what about unicode :P )

best thing at this approach is that it can store and find very qucly (I would add the fastest, but am not totally sure) all the cases from 1 char to x chars. so let's say if the asker wants all the 2 AND 3 consecutive characters this b-tree will have it. of cours ethe algorithm needs to be changed to find them as well but this is getting a little too generic :)
0
 
LVL 45

Expert Comment

by:aikimark
ID: 17818021
@ciuly

Good call on the b-tree suggest for the general case.

As the unique combinations of letters increases an array solution will become less sparse.  However, an array will be a faster solution, trading off wasted memory locations for speed.

For this particular EE question, I think there will be some practical limit on the length of character sequences we can count.  Longer character sequences may be found in non-English languages, such as German.  In this specific case, I think a practical upper limit would be three or four character sequences (not 20).

What an interesting criterion unicode compliance would place on this solution.  hmmmmm.  "What constitues a single character, versus two characters?",  "Should ae be a single character or two?"
0
 
LVL 28

Expert Comment

by:ciuly
ID: 17818659
actually, when in unicode, all characters are 2 bytes. so basically ae will be 1 char, if it is the 1 unicode char representing ae, or 2 chars, if it's teh 2 unicode chars sequence a and e. the only issue I see in unicode is that you actually have 65536 teoretical characters. one will have to go through all and get the printable characters (non-control, non-punctuation, etc). and that would be the list of single characters that get into account for this issue.
bottom line: too much headache, if you ask me :)
0
 
LVL 45

Assisted Solution

by:aikimark
aikimark earned 50 total points
ID: 17819480
@crystyan

You've certainly got a lot of solutions to try.  If you need to squeeze the utmost performance out of this process, I would recommend:

1. two dimension array defined as follows.
SequenceCounts : Array['a'..'z', 'a'..'z'] of integer;

Notes: The character values actually index the correct array position.  You can quickly iterate through this array using the LOW and HIGH functions as value ranges to gather non-zero counts to sort.  After testing and debugging this code, you can get an additional performance boost by disabling bounds-checking.

2. Since you don't care about the file content as separate lines, the fastest possible way to get the file into memory is through the filestream example posted earlier.  The CrLf characters will separate the words and simplify the code.

3. If you keep track of the min and max count values during the gathering of the non-zero counts, you might be able to programmatically determine whether a Counting Sort was feasible. Suitability depends on the range of values between the lowest and highest numbers.  For relatively small (50-100) distinct count values, Loki's Insert Sort should suffice.  A good compromise would be to store the character pair/count values in a list and create a custom compare for the list's Sort method, which uses a Quicksort algorithm.
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
Introduction Raise your hands if you were as upset with FireMonkey as I was when I discovered that there was no TListview.  I use TListView in almost all of my applications I've written, and I was not going to compromise by resorting to TStringGrid…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

23 Experts available now in Live!

Get 1:1 Help Now