Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 271
  • Last Modified:

Reading FAST from a HUDGE TextFile

I need a function or a way to read (and process) a BIG Textfile (around 1mega). At the moment I read line after line but it takes a lifetime! I've tried to load the file into a TStringList but when I'm processing all the text it also takes a life time. I have an application made by a friend in C# who processes the same text file in a couple of seconds!
Please advice
0
crystyan
Asked:
crystyan
  • 11
  • 10
  • 8
  • +2
3 Solutions
 
2266180Commented:
well that depends how you are doing the processing. if you are using functions like copy, delete, etc on the string, then it will take a long time because delphi reallocates the string at every call of such function.
see this thread on some tips on performance on string text manipulation (there are a few links to some other threads on the same thing): http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21980883.html (it's long, I know, but you can skip the code ;)

also, using something like:
function readup(filename:string):string;
begin
  with tfilestream.create(filename,fmopenread) do
  begin
    setlength(result,size);
    read(pansichar(result)^,size);
    free;
  end;
end;
or using teh stringlist.loadfromfile should be fast enough.
0
 
tobjectpascalCommented:
procedure TForm1.Button1Click(Sender: TObject);
Var
 T: Textfile;
 Tmp: String;
 C: Integer;
begin
  System.Assign(T,'c:\sample.txt');
  System.Reset(T);
  While Not Eof(T) Do
   Begin
     Readln(T,Tmp);
     Inc(C);
   End;
  ShowMessage(IntToStr(C)+' Lines Loaded');
  System.Close(T);
end;

1.6 meg Text file in about 1/2 second or faster, seemed almost instant..

0
 
tobjectpascalCommented:
Time Taken to Load 12548928 Lines 0.03 Seconds

procedure TForm1.Button1Click(Sender: TObject);
Var
 T: Textfile;
 Tmp: String;
 TT: Real;
 GT: Longint;
 C: Integer;
begin
  Gt:=GettickCount;
  System.Assign(T,'c:\sample.txt');
  System.Reset(T);
  While Not Eof(T) Do
   Begin
     Readln(T,Tmp);
     Inc(C);
   End;
  TT:=(GetTickCount-Gt) / 1000;
  Edit1.Text:='Time Taken to Load '+IntToStr(C)+' Lines '+FloatToStr(TT)+' Seconds';
  System.Close(T);
end;


How fast do you want it?

Compiled/Written in Delphi 5 CPU 1gig AMD Duron, 512 meg of memory...

So what's cauing the slow load time in your program? must be a string manipluation problem
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
tobjectpascalCommented:
My Bad

Time Taken to Load 18820 Lines 0.04 Seconds

C:=0; //even after all these years i still forget to initilize variables sometimes.
0
 
aikimarkCommented:
@crystyan

When you say you are using a TStringList, does that mean you are invoking the LosdFromFile method?  It is VERY FAST.

Since 1MB isn't terribly large, we will need some additional information from you:

1. What kind of hardware are you using? (CPU speed, RAM, laptop/desktop/server, etc.)
2. What is your runtime environment? (OS, competing programs, etc.)
3. What kind of "(and process)" actions are you taking on the lines in this file?
4. What's happening with the rest of your application (forms, controls, mouse/keyboard inputs, etc.) while you are reading and processing this file?
0
 
TheRealLokiSenior DeveloperCommented:
I agree, TStringList should be fast enough for a 1 meg file
Were you perhaps using a TMemo (or similar) to load it?
The visual aspect of components will slow things down a lot
you can wrap it like :-

mystringlist.BeginUpdate;
try
  mystringlist.LoadFromFile('c:\blah.txt');
finally
  mystringlist.Endupdate;
end;
to speed it up
0
 
aikimarkCommented:
also, setting the TStringList.CAPACITY property to an estimate of the number of lines you will be loading will also make a noticable improvement on the performance of the LoadFromFile method.
0
 
aikimarkCommented:
typo correction:
LosdFromFile  should be LoadFromFile
0
 
crystyanAuthor Commented:
hmm ... I think the problem is somewhere else then! I want to count all the 2 letter combination and keep them stored in a TList or if you have a better ideea (a faster one it would be great !)
Please, if you have time to write me a small class wich should have something like : Combination: String; count: Cardinal and it would count all the combination from that BIG text file it would be great.
For example if I would have a text file with 2 words:
word1 word2

then I would have
wo, 2
or, 2
rd, 2
d1, 1
d2, 1

Ignore ALL what is not between ['a' .. 'z']  you need to make them lower first :|

THANK YOU !
0
 
TheRealLokiSenior DeveloperCommented:
This should be simple enough to understand
I have left it open so you can d "any length" words
eg. find 2 letter words, then also find 3 letter words..
I blatantly used ciuly's pAnsiChar loading of the file. Feed him some points if you like it

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls;

type TBigFileParser = class(TObject)
public
  Results: TStringList;
  BigText: string;
  Constructor Create;
  Destructor Destroy; override;
  procedure LoadFile(filename: string);
  procedure FindxLetterWords(wordlength: integer);
end;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.DFM}

{ TBigFileParser }

constructor TBigFileParser.Create;
begin
  inherited Create;
  Results := TStringList.Create;
end;

destructor TBigFileParser.Destroy;
begin
  Results.Clear;
  Results.Free;
  inherited Destroy;
end;

procedure TBigFileParser.LoadFile(filename: string);
  var
    FS: TFileStream;
begin
  FS := TFileStream.create(filename, fmOpenRead);
  try
    SetLength(BigText, FS.Size);
    FS.Read(PAnsiChar(BigText)^, FS.Size);
  finally
    FS.Free;
  end;
end;

procedure TBigFileParser.FindxLetterWords(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
  Results.Clear; // clear our counts first
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found a 'x' length word
        Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
        partialword := '';
      end
      else if length(partialword) > wordlength then
      begin
        partialword := '';
      end;
    end
    else
      partialword := partialword + c; // add this 'a'..z' char to our string
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found a x length word in the last section
    Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
  end
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  with TBigFileParser.Create do
  try
    LoadFile('test.txt');
    FindxLetterWords(2); // we want to find the "2" letter words
    memo1.lines.assign(Results);
    ShowMessage('The word "my" appears ' + Results.Values['my'] + ' times');
  finally
    free;
  end;
end;

end.
0
 
crystyanAuthor Commented:
AWESOME !!! Thank you SO MUCH ! Another thing and this is all. Could you tell me how could I sort them by the number of apparition ?

THANK YOU !
0
 
TheRealLokiSenior DeveloperCommented:
add these bits of code

var
  Form1: TForm1; // place CustomSortByCount below this part in your existing unit
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
implementation

{$R *.DFM}

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if List.Values[List.Names[Index1]] < List.Values[List.Names[Index2]] then
    result := -1
  else if List.Values[List.Names[Index1]] < List.Values[List.Names[Index2]] then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;

then change the button1 click to be ...

procedure TForm1.Button1Click(Sender: TObject);
begin
  with TBigFileParser.Create do
  try
    LoadFile('test.txt');
    FindxLetterWords(2); // we want to find the "2" letter words
    memo1.lines.assign(Results);
// now sort the results
    Results.CustomSort(CustomSortByCount);
    memo2.lines.assign(Results);
    ShowMessage('The word "my" appears ' + Results.Values['my'] + ' times');
  finally
    free;
  end;
end;
0
 
crystyanAuthor Commented:
oh, another thing. I've just tested and it doesn't seem to get the right results ... try on this file www.cristianignat.com/diverse/source.zip
Please try searching for "mm" or "th" you'll see only 1 result and if you would manually open the file in an editor you'll get much more results.
0
 
crystyanAuthor Commented:
and how could I sort the results when I' calling Results.CustomSort(CustomSortByCount); I get [Error] Unit1.pas(350): Incompatible types: 'regular procedure and method pointer'

Sorry for stressing you :-|
0
 
TheRealLokiSenior DeveloperCommented:
oops, slight bug
change
      else if length(partialword) > wordlength then
to
      else if length(partialword) <> wordlength then
0
 
crystyanAuthor Commented:
nope, still not. Run it for "mm" you'll see that the application show you only 1 and it appears more than one time in the file I've uploaded it for you.
0
 
TheRealLokiSenior DeveloperCommented:
Incompatible types: 'regular procedure and method pointer'
THis means you have the function declaration
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
in teh wrong place.
place it just above the word
"implementation"
and put the corrected procedure (below) just below the word
"implementation"

I also was not converting teh count to numbers (so the order was alphanumeric instead of integer)

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if StrToInt(List.Values[List.Names[Index1]]) < StrToInt(List.Values[List.Names[Index2]]) then
    result := -1
  else if StrToInt(List.Values[List.Names[Index1]]) > StrToInt(List.Values[List.Names[Index2]]) then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;

and don't forget to change the line
      else if length(partialword) > wordlength then
to
      else if length(partialword) <> wordlength then

that'll teach me not to fully test my code before releasing it :-)
0
 
TheRealLokiSenior DeveloperCommented:
I get 23  "th"
16th, 28th, 30th, 4th, 19th, 20th, 19th, 5th, 19th, 20th, 6th, 7th, 19th, 7th, 19th, 19th, 20th, 7th, 20th, 69th, 70th, 7th, 13th
and 1 "mm"
there is only 1 "mm being "approximately 38 mm a year. "
every other instance of "mm" is part of another word (eg. "programmed", "common")
have I not understood your request correctly?
0
 
crystyanAuthor Commented:
I want EVERY instance of "mm" so if it appears in programming, I want that too. But this is good too that it's searching only for the 2 letter words. could u do that as another function ?

This is the last thing. Promisse!
0
 
crystyanAuthor Commented:
and I still didn't get it about the sort function. :-(
0
 
TheRealLokiSenior DeveloperCommented:
the top of my demo looksl ike this :-

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls;

type TBigFileParser = class(TObject)
public
  Results: TStringList;
  BigText: string;
  Constructor Create;
  Destructor Destroy; override;
  procedure LoadFile(filename: string);
  procedure FindxLetterWords(wordlength: integer);
end;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    Memo2: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer; // <-- NOTE POSITION HERE

implementation

{$R *.DFM}

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if StrToInt(List.Values[List.Names[Index1]]) < StrToInt(List.Values[List.Names[Index2]]) then
    result := -1
  else if StrToInt(List.Values[List.Names[Index1]]) > StrToInt(List.Values[List.Names[Index2]]) then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;


you will need to clarify that mm thing a bit more

I assume you want it for every "2 characters" not just mm

so in the following text
"abcdefg hijk"
would you expect to get
"ab" "cd" "ef" "hi" "jk"
or
"ab" "bc" "cd" "de" "ef" "fg" "hi" "ij" "jk"
or something else?
0
 
crystyanAuthor Commented:
I expect ab, bc, cd, de, ef, fg, hi, jp
0
 
aikimarkCommented:
@crystyan

what about "ij"?
0
 
aikimarkCommented:
What should happen for a word like "isn't" or "can't"?
0
 
crystyanAuthor Commented:
"ab" "bc" "cd" "de" "ef" "fg" "hi" "ij" "jk"

missed that.

so basicly I want ALL the 2 (or 3) consecutive letters to be counted but not those separated by space or other character different than 'a'..'z'
in this case I don't want to appear "gh" in the counting.

for isn't it will count for "is", "sn". that's all
0
 
TheRealLokiSenior DeveloperCommented:
add this procedure to the class

procedure TBigFileParser.FindxConsecutiveLetters(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
//  Results.Clear; // clear our counts first - up to you if you want to do this
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
      else if length(partialword) <> wordlength then
      begin
        partialword := '';
      end;
    end
    else
    begin // we have not reached the end of the current text (word)
      partialword := partialword + c; // add this 'a'..z' char to our string
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
    end;
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found 'x' length letters in a row, add it to our list
    Results.Values[partialword] := IntToStr(StrToIntDef(Results.Values[partialword], 0) + 1);
  end
end;

and call it like

    FindxConsecutiveLetters(2); // we want to find strings of "2" letters in a row

It takes a little longer to run than the first procedure. Let me know if the speed is unacceptable, and i'll change the way it stores it
0
 
crystyanAuthor Commented:
Did you try it on the file I've sent you ? It freezes somewhere :-( it never ends ... :-| Give it a run.

  with TBigFileParser.Create do
  try
    LoadFile('C:\Documents and Settings\Cristian\Desktop\source.txt');
    FindxConsecutiveLetters(2); // we want to find the "2" letter words
//    Results.CustomSort(CustomSortByCount);
    memo1.lines.assign(Results);
    ShowMessage('The word "mm" appears ' + Results.Values['mm'] + ' times');
  finally
    free;
  end;
0
 
TheRealLokiSenior DeveloperCommented:
I've been testing on your file the whole time. like i said, it takes a while longer due to the number of results, but does not lock up.
I have increased the speed dramatically by using Binary Tree searching, I'll upload it after I give it a once over
0
 
TheRealLokiSenior DeveloperCommented:
here's my faster demo. only takes a couple of seconds including the sort by value
It's using a binary tree and insertion sort routine, which should give your C#-coding buddy a run for his money
:-) could make it a little faster if I had to, but got a job to do here as well


unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
  StdCtrls, math;

type TBigFileParser = class(TObject)
private
{$IFDEF DEBUG}  iterations: integer; {$ENDIF}
  nearest: integer;
{ the following 4 functions are used internally when parsing the string.
  they cannot be used after you have done a "CustomSort".
  So do not do a CustomSort until you have finished calling
  FindxLetterWords and FindxConsecutiveLetters}
  function BinTreeSearch(left, middle, right: integer; s: string): integer;
  function BinTreeSearchNearest(left, middle, right: integer; s: string): integer;
  function BinaryIndexOfText(s: string): integer; // can only be used before you do a customsort
  function BinaryInsertOfText(s: string): integer; // can only be used before you do a customsort
public
  Results: TStringList;
  BigText: string;
  Constructor Create;
  Destructor Destroy; override;
  procedure LoadFile(filename: string);
  procedure FindxLetterWords(wordlength: integer);
  procedure FindxConsecutiveLetters(wordlength: integer);
  function CountOf(s: string): integer; overload;
  function CountOf(s: string; canusebinarysearch: boolean): integer; overload;
end;

type
  TForm1 = class(TForm)
    Button1: TButton;
    Memo1: TMemo;
    Memo2: TMemo;
    procedure Button1Click(Sender: TObject);
  private
    { Private declarations }
  public
    { Public declarations }
  end;

var
  Form1: TForm1;
  function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;

implementation

{$R *.DFM}

function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if Integer(List.Objects[Index1]) < Integer(List.Objects[Index2]) then
    result := -1
  else if Integer(List.Objects[Index1]) > Integer(List.Objects[Index2]) then
    result := 1
  else
    result := AnsiCompareStr(List[Index1], List[Index2]);
end;

(*
function CustomSortByCount(List: TStringList; Index1, Index2: Integer): Integer;
begin
  if StrToInt(List.Values[List.Names[Index1]]) < StrToInt(List.Values[List.Names[Index2]]) then
    result := -1
  else if StrToInt(List.Values[List.Names[Index1]]) > StrToInt(List.Values[List.Names[Index2]]) then
    result := 1
  else
    result := AnsiCompareStr(List.Names[Index1], List.Names[Index2]);
end;
*)

{ TBigFileParser }

constructor TBigFileParser.Create;
begin
  inherited Create;
  Results := TStringList.Create;
end;

destructor TBigFileParser.Destroy;
begin
  Results.Clear;
  Results.Free;
  inherited Destroy;
end;

procedure TBigFileParser.LoadFile(filename: string);
  var
    FS: TFileStream;
begin
  FS := TFileStream.create(filename, fmOpenRead);
  try
    SetLength(BigText, FS.Size);
    FS.Read(PAnsiChar(BigText)^, FS.Size);
  finally
    FS.Free;
  end;
end;

procedure TBigFileParser.FindxLetterWords(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
//  Results.Clear; // clear our counts first - up to you if you want to do this
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found a 'x' length word
        BinaryInsertOfText(partialword);
        partialword := '';
      end
      else if length(partialword) <> wordlength then
      begin
        partialword := '';
      end;
    end
    else
      partialword := partialword + c; // add this 'a'..z' char to our string
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found a x length word in the last section
    BinaryInsertOfText(partialword);
  end
end;

procedure TBigFileParser.FindxConsecutiveLetters(wordlength: integer);
var
  i: integer;
  btlen: integer;
  partialword: string;
  c: char;
begin
//  Results.Clear; // clear our counts first - up to you if you want to do this
  partialword := '';
  BigText := lowercase(BigText);
  btlen := length(BigText);
  i := 1;
  while (i <= btlen) do
  begin
    c := BigText[i];

    if not(c in ['a'..'z']) then
    begin // assume any char not in 'a'..'z' is a delimter (space, etc)
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        BinaryInsertOfText(partialword);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
      else if length(partialword) <> wordlength then
      begin
        partialword := '';
      end;
    end
    else
    begin // we have not reached the end of the current text (word)
      partialword := partialword + c; // add this 'a'..z' char to our string
      if length(partialword) = wordlength then
      begin // we have found 'x' length letters in a row, add it to our list
        BinaryInsertOfText(partialword);
        // remove the first letter of our "found text" so we can grab the next letter
        Delete(partialword, 1, 1);
      end
    end;
    inc(i);
  end;
  if length(partialword) = wordlength then
  begin // we have found 'x' length letters in a row, add it to our list
    BinaryInsertOfText(partialword);
  end
end;

function TBigFileParser.BinTreeSearch(left, middle, right: integer; s: string): integer;
var
  midvalue: string;
begin
{$IFDEF DEBUG}  inc(iterations);
 Form1.memo1.lines.add(Format('%d: %d %d %d', [iterations, left, middle, right])); {$ENDIF}

  midvalue := Results[middle];
  if s < midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearch(left, left + trunc((middle - left)/2), middle, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
        result := -1;
    end;
  end
  else if s > midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearch(middle, middle + trunc((right - middle)/2), right, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
        result := -1;
    end;
  end
  else
    result := middle;
end;

function TBigFileParser.BinTreeSearchNearest(left, middle, right: integer; s: string): integer;
var
  midvalue: string;
begin
{$IFDEF DEBUG}  inc(iterations);
  Form1.memo1.lines.add(Format('%d: %d %d %d', [iterations, left, middle, right])); {$ENDIF}

  midvalue := Results[middle];
  if s < midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearchNearest(left, left + trunc((middle - left)/2), middle, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
      begin
        result := -1;
        if s < Results[left] then Nearest := max(0, left)
        else if s > Results[right] then Nearest := right + 1
        else Nearest := right;
      end;
    end;
  end
  else if s > midvalue then
  begin
    if (left <> middle) and (right <> middle) then
      result := BinTreeSearchNearest(middle, middle + trunc((right - middle)/2), right, s)
    else
    begin
      if s = Results[left] then
        result := left
      else if s = results[right] then
        result := right
      else
      begin
        result := -1;
        if s < Results[left] then Nearest := max(0, left)
        else if s > Results[right] then Nearest := right + 1
        else Nearest := right;
      end;
    end;
  end
  else
    result := middle;
end;

function TBigFileParser.BinaryIndexOfText(s: string): integer;
begin
{$IFDEF DEBUG}  iterations := 0;{$ENDIF}
  Nearest := -1;
  result := BinTreeSearch(0, Min(trunc(results.Count div 2), pred(results.Count)), pred(results.Count), s);
end;

function TBigFileParser.BinaryInsertOfText(s: string): integer;
begin
{$IFDEF DEBUG}  iterations := 0; {$ENDIF}
  Nearest := -1;
  if Results.Count = 0 then
  begin
    Results.AddObject(s, pointer(1));
    result := 0;
  end
  else
  begin
    result := BinTreeSearchNearest(0, Min(trunc(results.Count div 2), pred(results.Count)), pred(results.Count), s);

    if result = -1 then
    begin
      if nearest = results.count then
        Results.AddObject(s, pointer(1))
      else
        Results.InsertObject(nearest, s, pointer(1));
    end
    else
    begin
      Results.Objects[result] := Pointer( Integer(Results.Objects[result]) +1 );
    end;
  end;
end;

function TBigFileParser.CountOf(s: string): integer;
begin
  result := CountOf(s, false); // assume the user has done a customsearch
end;

function TBigFileParser.CountOf(s: string; canusebinarysearch: boolean): integer;
var
  i: integer;
begin
  result := 0;
  if canusebinarysearch then i := self.BinaryIndexOfText(s)
  else i := Results.indexOf(s);
  if i <> -1 then
    result := integer(Results.Objects[i]);
end;

// *** FORM STUFF

procedure TForm1.Button1Click(Sender: TObject);
var
  i: integer;
begin
  memo1.lines.beginupdate;
  memo2.lines.beginupdate;
  with TBigFileParser.Create do
  try
    memo1.lines.Clear;
    memo2.lines.Clear;
    LoadFile('source.txt');
//    FindxLetterWords(2); // we want to find the "2" letter words
//    FindxLetterWords(4); // we want to find the "4" letter words
    FindxConsecutiveLetters(2); // we want to find strings of "2" letters in a row
    for i := 0 to pred(Results.Count) do
      memo1.lines.add(Results[i] + '=' + IntToStr(integer(Results.Objects[i])));
// now sort the results
    Results.CustomSort(CustomSortByCount);
    for i := 0 to pred(Results.Count) do
      memo2.lines.add(Results[i] + ' =' + IntToStr(integer(Results.Objects[i])));
    ShowMessage('The word "an" appears ' + IntToStr(CountOf('an')) + ' times');
  finally
    free;
    memo1.lines.endupdate;
    memo2.lines.endupdate;
  end;
end;

end.
0
 
aikimarkCommented:
@TheRealLoki

Please pardon my ignorance in asking you these questions, but I'm taking it as a learning opportunity.

1. Wouldn't it be fastest to use a simple Integer array (or two dimension array) where the character values are used as indexes into the array?

2. Do you think the hashing of character values would make a dictionary/associative object (?Hashtable equivalent?) slower than #1?
Example: http://cc.borland.com/Item.aspx?id=15171

3. We're only talking about 676 (=26*26) integer items needed for an array to keep the count of occurances.  Is this too much memory to throw at this? (I don't think so)

0
 
TheRealLokiSenior DeveloperCommented:
Since speed was his request, I'm guessing memory would not be a problem (especially for just a 1 meg file), so a hash table would be fine.
For this same reason, I chose the "divide and conquer" approach and the insertion sort, because of the speed benefits associated with both. I know there are classes that already do this, but I felt a touch of nostalgia at trying to rewrite one from memory.

Since TStringlist has an objects() property, it is handling the keeping of "text" and values together, so it was just the quickest implementation for me to use. writing your own would be better, without the extra bloat of TStringlist.

26*26 is only valid for his first request (2), he has since mentioned
"so basicly I want ALL the 2 (or 3) consecutive letters"
so he may eventually want to change to go all the way up to 20

Given time, I'd write a better implementation, but this is an open question, and anyone else is welcome to offer a better solution. I can only spare a few minutes here and there between work, and my answer is by no means the best possible one :-)
0
 
2266180Commented:
b-trees :)
level 1 will be the first char
level 2 wil be the second char.
so for the x consecutive letters you will have x levels each with 26 chars (what about unicode :P )

best thing at this approach is that it can store and find very qucly (I would add the fastest, but am not totally sure) all the cases from 1 char to x chars. so let's say if the asker wants all the 2 AND 3 consecutive characters this b-tree will have it. of cours ethe algorithm needs to be changed to find them as well but this is getting a little too generic :)
0
 
aikimarkCommented:
@ciuly

Good call on the b-tree suggest for the general case.

As the unique combinations of letters increases an array solution will become less sparse.  However, an array will be a faster solution, trading off wasted memory locations for speed.

For this particular EE question, I think there will be some practical limit on the length of character sequences we can count.  Longer character sequences may be found in non-English languages, such as German.  In this specific case, I think a practical upper limit would be three or four character sequences (not 20).

What an interesting criterion unicode compliance would place on this solution.  hmmmmm.  "What constitues a single character, versus two characters?",  "Should ae be a single character or two?"
0
 
2266180Commented:
actually, when in unicode, all characters are 2 bytes. so basically ae will be 1 char, if it is the 1 unicode char representing ae, or 2 chars, if it's teh 2 unicode chars sequence a and e. the only issue I see in unicode is that you actually have 65536 teoretical characters. one will have to go through all and get the printable characters (non-control, non-punctuation, etc). and that would be the list of single characters that get into account for this issue.
bottom line: too much headache, if you ask me :)
0
 
aikimarkCommented:
@crystyan

You've certainly got a lot of solutions to try.  If you need to squeeze the utmost performance out of this process, I would recommend:

1. two dimension array defined as follows.
SequenceCounts : Array['a'..'z', 'a'..'z'] of integer;

Notes: The character values actually index the correct array position.  You can quickly iterate through this array using the LOW and HIGH functions as value ranges to gather non-zero counts to sort.  After testing and debugging this code, you can get an additional performance boost by disabling bounds-checking.

2. Since you don't care about the file content as separate lines, the fastest possible way to get the file into memory is through the filestream example posted earlier.  The CrLf characters will separate the words and simplify the code.

3. If you keep track of the min and max count values during the gathering of the non-zero counts, you might be able to programmatically determine whether a Counting Sort was feasible. Suitability depends on the range of values between the lowest and highest numbers.  For relatively small (50-100) distinct count values, Loki's Insert Sort should suffice.  A good compromise would be to store the character pair/count values in a list and create a custom compare for the list's Sort method, which uses a Quicksort algorithm.
0

Featured Post

Ask an Anonymous Question!

Don't feel intimidated by what you don't know. Ask your question anonymously. It's easy! Learn more and upgrade.

  • 11
  • 10
  • 8
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now