Link to home
Start Free TrialLog in
Avatar of tommykjensen
tommykjensen

asked on

Fast search

I am writing a log analyse program. The input file will usually contain 500.000+ lines that need to be analysed.

I have everything and it works. Its just not fast enough in certain areas.

Here is an extract from the log
************************************************************************************
 29264 03/17/2003 07:25:07 User skjba has logged in and is using a Floating license
 29264 03/17/2003 14:08:17 DBQUERY^B^msgclass(P4)^3*^0.000000^F^2^0.001000^"class.name="system startup/shutdown""^{"type", "class.name"}^0.000000^0.000000 ( [ 0] message.pro select.msg.class )
************************************************************************************

The first column is a process id. I want to save this id together with the user associated with it as indicated in the first line.

Currently what I do is for each line I read from the file I check if if is the line that indicate a user logon. If it is I store the id and user in an array of records. Then as I traverse the log I lookup the process id in the user array in order to match the second line with a user.

This is taking a long time and the time increases for each new user that is found.

I also use this method to count occurences of other events in the log. I use 2 for-loops and this is very inefficient.

Currently the code looks like this:

Search for the user:

function FindUser(pid2: string) : string;
VAR
 n,fundet : integer;
begin
  fundet:=-1;
  For n:=0 to UserCount do with Users[n] do begin
    if trim(pid2)=trim(Pid) then begin
      fundet:=n;
      break;
    end;
  end;
  if fundet>-1 then
    FindUser:=Users[fundet].User
  else
    FindUser:='UNKNOWN';
end;


Code for matching other events:

 For n:=0 to TransCount-1 do with DBquery[n] do begin
  fundet:=-1;
  pointer:=n;
    For m:=0 to SortTrans-1 do with DBquerystats[m] do begin
      if (DBquery[n].query=DBquerystats[m].query) and (DBquery[n].keytype=DBquerystats[m].Keytype) and (DBquery[n].key=DBquerystats[m].Key) then begin
       fundet:=m;
       break;
      end;
    end;
  if fundet>-1 then
   DBqueryStats[fundet].Antal:=DBqueryStats[fundet].Antal+1
   
  else begin
   DBqueryStats[SortTrans].Antal:=1;
   DBqueryStats[SortTrans].PID:=DBquery[n].PID;
   DBqueryStats[SortTrans].Query:=DBquery[n].query;
   DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
   DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
   DBqueryStats[SortTrans].Key:=DBquery[n].key;
   DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;
   SortTrans:=SortTrans+1;
  end;
 end;



How can I improve this?
Avatar of emadat
emadat
Flag of United States of America image

When it comes to searching, nothing beats assembly language.

Look at the article "Optimizing String Searches in Delphi" at
http://www.delphidevelopernewsletter.com/dd/DDMag.nsf/WebIndexByIssue/75EDEE6470ADB1BF852568F2005E2269?opendocument 

The author used assembly language to implement some of the fastest searching algorithms I've heard of. He is also providing a downloadable library that contains his ocde.

What you can do about your log file is to load the whole thing into a single string -which Delphi allows to be up to 2 GB in size- and use the search routine found in the library to scan the whole file -now cached into the string- for you.
Avatar of billious
billious

My suggestions will be with a mind to retaining your current methods.
No doubt you can clean them up later, if required.

1. Declare a new item iPid in Users and DBquery (and possibly DBqueryStats,)
   as integer
2. On adding a new User, iPid := strtoint(trim(Pid));
3. Revised Finduser :

function FindUser(pid2: string) : string;
VAR
  n,fundet : integer;
begin
 n := strtoint(trim(pid2));
 fundet:=Usercount;
 while fundet>0 do with Users[fundet] do
   if n=iPid then
     break
   else
     dec(fundet);
 if fundet>-1 then
   FindUser:=Users[fundet].User
 else
   FindUser:='UNKNOWN';
end;

 This does an INTEGER comparison, rather than a string, and does the TRIM
 once, rather than TWICE for each entry until found.
 Furthermore, it looks from the END of the array (where more recent sessions
 are recorded) to the beginning (where possibly-defunct processes are
 recorded) and thus detects more recent (and thus possibly more active)
 processes IDs first.

4. Similarly,
For n:=0 to pred(TransCount) do with DBquery[n] do
begin
 fundet := pred(SortTrans);
 while fundet >= 0 do
 with DBquerystats[fundet] do
   if (DBquery[n].query=DBquerystats[fundet].query) and
      (DBquery[n].keytype=DBquerystats[fundet].Keytype) and
      (DBquery[n].key=DBquerystats[fundet].Key) then
     break
   else
     dec(fundet);
 if fundet>-1 then
  DBqueryStats[fundet].Antal:=inc(DBqueryStats[fundet].Antal)
 else
 begin
  DBqueryStats[SortTrans].Antal:=1;
  DBqueryStats[SortTrans].PID:=DBquery[n].PID;
{  DBqueryStats[SortTrans].iPID:=DBquery[n].iPID;}  // if you need it
  DBqueryStats[SortTrans].Query:=DBquery[n].query;
  DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
  DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
  DBqueryStats[SortTrans].Key:=DBquery[n].key;
  DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;
  SortTrans:=inc(SortTrans);
 end;
end;

5. examine
   if (DBquery[n].query=DBquerystats[fundet].query) and
      (DBquery[n].keytype=DBquerystats[fundet].Keytype) and
      (DBquery[n].key=DBquerystats[fundet].Key) then

   and put the three comparisons in a sequence such that the most likey
   to be UNEQUAL is first. Note that the 'if a=b and c=d and e=f'
   structure is the same as
    'if a=b then if c=d then if e=f'
   so that if the final comparison produces most fails, then the first
   two still have to be executed every time around. Changing the sequence
   to 'if e=f and a=b and c=d' will thus perform only ONE comparison for
   these majority-fails, rather than 3.

HTH

...Bill

Avatar of tommykjensen

ASKER

Thank You for the suggestions.

As for suggestion 4:

This actually made the routine significantly slower?


As for suggestion 5:

The first comparison is the one most likely to fail so that don't need to change.

But which is faster?

One if statement with all three expressions

or three nested if's ?
4:

The text that is compared looks like following 3 examples:


name="datadict"

table.name="schedule" and column.name="schedule.id"

problem.status="Waiting for supplier" or problem.status="Waiting for customer" or problem.status="Wait" or problem.status="Recovered" or problem.status="Reassigned" or problem.status="Open" or problem.status="Closed" or problem.status="Assigned" or problem.status="Accept") and (priority.code="4" or priority.code="3" or priority.code="2" or priority.code="1") and open.time>='02/01/03 00:00:00' and open.time<'03/01/03 00:00:01' and company="zzzzz"
tommykjensen:

I'm a litle surprised that (4) failed, and conclude that patented theory
#1 (later transactions refer to later process ids) is probably suspect.

(I'm even more surprised that SortTrans:=inc(SortTrans); worked, since
it's syntactically invalid, so I suppose you're interpreting... In older
versions of Turbo-Pascal, 'inc' was considerably faster than 'x:=x+1', but
it looks as though the optimiser has taken care of that one...old habits
die hard!  Hmmm...seems 'for loop:= low to high' also now detects
high<low ==> no action, so maybe I'll start using that, too!)

Revised (4) [which should be familiar]

For n:=0 to pred(TransCount) do with DBquery[n] do
begin
 fundet := -1;
 while fundet < pred(SortTrans) do
 begin
   fundet := fundet + 1;
   with DBquerystats[fundet] do
   if (DBquery[n].query=query) and
      (DBquery[n].keytype=DKeytype) and
      (DBquery[n].key=Key) then
     break;
 if fundet>-1 then
  inc(DBqueryStats[fundet].Antal)
 else
 begin
  DBqueryStats[SortTrans].Antal:=1;
  DBqueryStats[SortTrans].PID:=DBquery[n].PID;
{  DBqueryStats[SortTrans].iPID:=DBquery[n].iPID;}  // if you need it
  DBqueryStats[SortTrans].Query:=DBquery[n].query;
  DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
  DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
  DBqueryStats[SortTrans].Key:=DBquery[n].key;
  DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;
  inc(SortTrans);
 end;
end;

since this is fundamentally a very small change to your original, I'd
expect only a marginal improvement...but an improvement!

----------------

I presume that your code
   if (DBquery[n].query=query) and
      (DBquery[n].keytype=DKeytype) and
      (DBquery[n].key=Key) then

is actually
   if (DBquery[n].name=name) and
      (DBquery[n].table.name=table.name and
          DBquery[n].column.name=column.name) and
      (DBquery[n].key=Key) then //which I'm not going to rewrite!!

where "datadict", "schedule", "schedule.id"

are typical values for name, table.name and column.name??

I'd suggest a series of steps, and expect an improvement with each.

4a.
  for each of name, table.name, column.name, problem.status and
  company, set up a table of values encountered, and when the
  DBquery is established, record the index found to each (in the
  same way as iPid was set up, but by scanning the table) in
  DBquery. Similarly, add iName, itable_name, iColumn_name, etc.
  to DBquerystats, and transfer in your establishment of the
  DBquerystats record.

  In your comparison, compare DBquery.iname,DBquerystats.iname;
  DBquery.itable_name,DBquerystats.itable_name etc. This reduces
  the number of string-compares and should thus improve speed.

  Since priority.code and open.time appear to be char and tdatetime,
  using this technique on those fields should show a detriment to
  performance.

4b.
  IF - AND - OR theory:

  If you have an expression
    if a and b and c and d and e  //the 'outer' expression

  where a,b,c,d,e are each Boolean expressions, then Delphi will evaluate
  each in turn until it finds one is false (and therefore the whole
  'outer' expression is false) and it will have to evaluate ALL of the
  expressions to determine that the 'outer' expression is true.

  It follows then that if e is true only 10% of the time, but a..d are
  true 90% of the time, then

    if a and b and c and d and e  //the 'outer' expression

of the total,
  10% will evaluate just a
  9%  will evaluate a and b
  8.1% will evaluate a,b,c
  7.29% will evaluate a,b,c,d
  65.61% will evaluate a,b,c,d,e  (100-10-9-8.1-7.29)
  and the result will be 6.561% True (10% of a,b,c,d,e)

  on the other hand,

    if e and a and b and c and d  //revised 'outer' expression
       ^--------<<<-------------

of the total,
  90% will evaluate just e
  1%  will evaluate e and a
  0.9% will evaluate e,a,b
  0.81% will evaluate e,a,b,c
  7.29% will evaluate e,a,b,c,d  (100-90-1-0.9-0.81)
  and the result will be 6.651% true (90% of 7.29)

The OR situation is precisely the opposite :-
  If you have an expression
    if a or b or c or d or e  //the 'outer' expression

  where a,b,c,d,e are each Boolean expressions, then Delphi will evaluate
  each in turn until it finds one is TRUE (and therefore the whole
  'outer' expression is TRUE) and it will have to evaluate ALL of the
  expressions to determine that the 'outer' expression is FALSE.

  So, for obvious reasons, the term most likely to be TRUE should be
  placed first.



SO, in your case, which I conclude would be expressed as:

if
(name="datadict")  //first expression

AND

(table.name="schedule" and column.name="schedule.id")  //second expr

AND  //third expr follows

(
problem.status="Waiting for supplier" or
problem.status="Waiting for customer" or
problem.status="Wait" or
problem.status="Recovered" or
problem.status="Reassigned" or
problem.status="Open" or
problem.status="Closed" or
problem.status="Assigned" or
problem.status="Accept")

and

(priority.code="4" or
priority.code="3" or
priority.code="2" or
priority.code="1")

and

open.time>='02/01/03 00:00:00'

and

open.time<'03/01/03 00:00:01'

and

company="zzzzz"
)


I'd suggest that you'd see an improvement WITHOUT implementing 4a by
changing the order to (I believe that the improvement in 4a would
probably mask 4b)

if open.time >= .....
and
   open.time < .....
and
   company...
and
   name....
and
   table.name.....
and
   column.name.....
and
   (priority.code... or priority.code...)
and
   (problem.status ... or problem.status...)


where
a) the order company, name, table.name, column.name
   is placed in INCREASING order of probability of a match of the
   expression (ie. most-probable last)
b) the order of sub-expressions in priority.code and problem.status
   should be placed in DECREASING order of probability of a match of the
   expression (ie. most-probable first)
   

HOWEVER
-------

Since your code appears to be detecting certain criteria about a log
record and then counting the NUMBER of records that match those criteria,
then the use of an 'or' expression would indicate that you are interested
in (records with priority code "1".."4") and (records with other priority
codes), such that if the first record encountered in the log had priority
code "3" and the next otherwise identical record had priority code "2" then
your program would count 2 records, and thus all that could be said about
those 2 records is that they had priority codes "1".."4". An identical
argument could be made about problem.status.

I don't know whether this is intentional but if it is (as I suspect) then
I would suggest that you set up yet another value(s) in DBquery, Dbquerystats
bpriority_code (bproblem_status) as a Boolean to indicate whether the
record is/is not in the priority.code range/problem.status set of interest.

It might even be that you could eliminate entries that are of no interest
at that point from DBquery. Naturally, if you are reporting both in-priority-
code-set and not-in-priority-code-set, then the complex OR expression in the
long comparison-for-adding-to-DBquerystats simply becomes a nice Boolean,
(eg. bpriority_code) since the expression was evaluated at entry to DBquery.

HTH

...Bill
First thank You for a very thourough answer, its clear You have used quite some time to help me.

Actually this did not work

SortTrans:=inc(SortTrans);

I misread Your suggestion so I used the

SortTrans:=SortTrans+1

I just changed it to the correct syntax but the speed did not change.

I will try the new suggestion.

As for the query. This should not be interpreted as this is a query done in an application and I want to count the number of exact queries.

The DBquey array only contains the log lines am interested in and the field query should just be treated as a string of characters I want to count exact matches.


As a sidenote the purpose of my program is to show the queries users of a helpdesk application issues on the database. I need this to setup correct keys. So the loglines is the result of a trace on that helpdesk application.
 
tommykjensen:

AHA!

The object I have in mind is to reduce the number of string-comparisons
performed.

Hence,

type
  tquerystats = record
  ipid                  : integer;
  iproblem_status       : integer;
  cpriority_code        : char;
  end;

var
  aquerystats           : array of tquerystats;
  bcalc_stats           : Boolean;
  listofproblemstatuses : tstringlist;

...
procedure buildreportquery;
...
For n:=0 to pred(TransCount) do with DBquery[n] do
begin
  if aquerystats=nil then
  begin
    bcalc_stats := true;
    setlength(aquerystats,transcount); // allocate dynamic-array
    //also clear any required stringlists here
    listofproblemstatuses.clear;
  end
  else
    bcalc_stats := false;
  if bcalc_stats then
  begin
    {code to set ipid[n], iproblem_status[n] and cpriority_code[n]}
  end;

  fundet := -1;
  while fundet < pred(SortTrans) do
  begin
    fundet := fundet + 1;
    with DBquerystats[fundet] do
    if ({match as required}) then
    begin
      inc(DBqueryStats[fundet].Antal);
      break;
    end;
  end;

  if fundet=sorttrans then
  begin
    DBqueryStats[SortTrans].Antal:=1;
    DBqueryStats[SortTrans].dbq := n;

(* rest not required

    DBqueryStats[SortTrans].PID:=DBquery[n].PID;
{  DBqueryStats[SortTrans].iPID:=DBquery[n].iPID;}  // if you need it
    DBqueryStats[SortTrans].Query:=DBquery[n].query;
    DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
    DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
    DBqueryStats[SortTrans].Key:=DBquery[n].key;
    DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;

*)

    inc(SortTrans);
  end;
end;


Notes :
1. {code to set ipid[n], iproblem_status[n] and cpriority_code[n]}
  It's difficult to determine what query, keytype and key (which are the
  three elements you originally posted) are.

  I would expect something like:
  ipid[n] := strtoint(trim(dbquery[n].pid));
  iproblem_status[n] := listofproblemstatuses.indexof(dbquery[n].problem_status);
  if iproblem_status[n] < 0 then
    iproblem_status[n] := listofproblemstatuses.add(dbquery[n].problem_status);
  ...


1a) Note that the above is only established ONCE, the first time
    buildreportquery is executed. You can thus build as many reports as
    you like on one set of DBquery returns.

    All that is required to work with the next set of DBQuery returns is
    aquerystats := nil;

1b) Note that DBQueryStats is much simpler, being only 2 elements, the
    count, and an index into DBquery.
    Guess what
    'with DBQuery[DBQuerystats.dbq[reportline]]' is!

1c) (Repeated caution)
    DBQuery[DBQuerystats.dbq[reportline]].Seconds, for instance will return
    the Seconds field from the FIRST DBQuery encountered matching the other
    criteria. This complies with the DBqueryStats[SortTrans].Seconds field
    that would have been established - and may not be what was expected
    (perhaps you intended to accumulate??)

2)  {match as required}
    should be some Boolean operation on aquerystats[n] against
    aquerystats[DBqueryStats[fundet]], established in accordance with
    the IF - AND - OR principles posted earlier

HTH

...Bill
Bill, I see Your point but this idea is not realistic because it would mean that I have to check for more than 10.000+ different field names. And since my intend is that this program should be available to all administrators of that helpdesk application it would not be realistic because any administrator could add his own fields and tables to the database and I would not known of them.

The queries I have listed is just examples of what could appear in the log.

The application has 500 database files with anywhere between 10 and 500 fields each.

Good idea with the pointer to the original array, that will at the least save memory
I just got a simple idea. Having You explanation with IF's in mind.

What if I add a new expression as the first in the if statement. I want only complete matches of the field query so what if I change:

if (DBquery[n].query=DBquerystats[m].query) and (DBquery[n].keytype=DBquerystats[m].Keytype) and (DBquery[n].key=DBquerystats[m].Key) then begin

to

if (length(DBquery[n].query)=lenght(DBquerystats[m].query)) and (DBquery[n].query=DBquerystats[m].query) and (DBquery[n].keytype=DBquerystats[m].Keytype) and (DBquery[n].key=DBquerystats[m].Key) then begin



I believe that this would NOT improve the timing, if anything it would
make it worse.

Strings are stored by Delphi either as Pascal-style or C-style.

Pascal-style strings are stored with a length-byte first, and C-style with
a terminating null (0).

If I was implementing a string-compare (for equality) on a Pascal string,
I would start at byte 0, which is the length byte. Therefore, if ...query
is a Pascal string, finding the length of each will simply compare the
first byte TWICE, once for the length term and then again in the string-
compare term. If the length is different, the string comparison will fail
after comparing the first (length) byte, so a string-compare will make no
difference for non-matching strings (fail on first byte-compare) and you
are stuck with the repeated overhead of the length-compare on matching
strings.

On a C-style string, calculating the length of the string means scanning
the string, counting characters until a null byte is found. In this case,
adding the length() element will thus cause each string to be scanned for
a null byte. It would be quicker to simpy do the string compare on the
...query elements, since it's simply a matter of comparing each character
in the string until a mismatch is found (fail) or a null byte is found
(success.)

======

What you appear to be doing (from your original code) is :-

1) extract data from the database to DBQuery array
2) in the procedure under discussion,
    a) Store certain fields from DBQuery in DBQueryStats for the FIRST
       record that occurs with certain field values
    b) Increment 'Antal' for each extra record found that matches those
       field values
3) Evidently then reporting from DBQueryStats


Without knowing what the meaning of each field is, or how they relate to
the field values used in the matching or reporting criteria, it's a little
difficult to comment constructively.

Take 'seconds' for instance. This was specifically being extracted into
DBQueryStats in the original code. If this is a DURATION of an event,
then the report generated will show only the DURATION of the first
event encountered, and not report any duration of any subsequent
event matching the same selection criteria (such that 'Antal' is
incremented.)

Of course, it may be that this is merely a typical routine used to
construct DBQueryStats - it's not possible to tell.

And it's also not clear whether you intend to produce a number of
different reports from a single extraction of data to DBQuery (ie.
perform the extraction once, then produce many different reports
from the data loaded into DBQuery) - which would tend to imply that
there are many different selection criteria combinations, or perhaps
the selection criteria are controlled by some external (unshown) switch
using the one routine. OR you may be performing ONE report from an
extraction from the log, and performing a new extraction for the
next report.

The 'quickest' may depend on these factors. If you are producing multiple
reports after a single extraction for instance, if the routine was SLOWED
by 50% for the first report, but then was 90% faster for subsequent
reports, then producing 3 reports would take 150 + 10 + 10 time units,
or run nearly twice as fast as the original (170 time units rather than
300.)

---

It would seem that the number of records in DBQueryStats dictates the
length of the report being generated, the content of the report depends
on the fields extracted to DBQueryStats (but now accessed by the index
into DBQuery) and the grouping of record depends on the selection
criteria within the routine published. I can't see how the number of
fieldnames or tables is in any way relevant to the matter.


...Bill
The array DBquerystats is used in two ways.

1. a count run that count the occurences this would not make use of the seconds field that is correct.

2. a display run that displays all the extracted lines with individual information such as the seconds field and the user field.


I don't understand why You would break up the query field and analyse the actual contents? Ie why extract problem.status and compare it to the different values for this field, and the same for priority.code, these 2 fields are just 2 out of 10.000+ in the application in question.

As I explained the query can be anything and I can't make a complete list of all available values as the user inputs those in the application.

I'll think about this for a while - I've got a little theory buzzing around that I'll commit to writing once I've thought it through.

Meanwhile - Her'es a little thought that MAY be of use.

if you add a NEXT field to DBQuery, initialised to zero, and a LAST field to DBQueryStats, then

 if fundet=sorttrans then
 begin
   DBqueryStats[SortTrans].Antal:=1;
   DBqueryStats[SortTrans].dbq := n;
   inc(SortTrans);
 end
 else
  DBqueryStats[SortTrans].Antal:=
    DBqueryStats[SortTrans].Antal + 1; {omitted in last code}


can be modified to
 if fundet=sorttrans then
 begin
   DBqueryStats[SortTrans].Antal:=1;
   DBqueryStats[SortTrans].dbq := n;
   DBqueryStats[SortTrans].last := n; {additional}
   inc(SortTrans);
 end
 else
 begin
  DBqueryStats[SortTrans].Antal:=
    DBqueryStats[SortTrans].Antal + 1;
  DBQuery[DBquerystats[SortTrans].dbq].next := n; {+}
  DBqueryStats[SortTrans].last := n;              {+}
 end;

will set up a nice little linked list through DBQuery[].next - which is really easy to do, and may prove useful.

...Bill




ASKER CERTIFIED SOLUTION
Avatar of billious
billious

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Bill, I am giving You the points and closing the question.

Without posting the entire sourcecode I don't think we will get much closer to a solution than this.

Thank You for all your thoughts.

I have ordered a book "Algorithms and datastructures" which hopefully will help me optimise my code.