asked on

Fast search

I am writing a log analyse program. The input file will usually contain 500.000+ lines that need to be analysed.

I have everything and it works. Its just not fast enough in certain areas.

Here is an extract from the log
************************************************************************************
29264 03/17/2003 07:25:07 User skjba has logged in and is using a Floating license
29264 03/17/2003 14:08:17 DBQUERY^B^msgclass(P4)^3*^0.000000^F^2^0.001000^"class.name="system startup/shutdown""^{"type", "class.name"}^0.000000^0.000000 ( [ 0] message.pro select.msg.class )
************************************************************************************

The first column is a process id. I want to save this id together with the user associated with it as indicated in the first line.

Currently what I do is for each line I read from the file I check if if is the line that indicate a user logon. If it is I store the id and user in an array of records. Then as I traverse the log I lookup the process id in the user array in order to match the second line with a user.

This is taking a long time and the time increases for each new user that is found.

I also use this method to count occurences of other events in the log. I use 2 for-loops and this is very inefficient.

Currently the code looks like this:

Search for the user:

function FindUser(pid2: string) : string;
VAR
n,fundet : integer;
begin
fundet:=-1;
For n:=0 to UserCount do with Users[n] do begin
if trim(pid2)=trim(Pid) then begin
fundet:=n;
break;
end;
end;
if fundet>-1 then
FindUser:=Users[fundet].User
else
FindUser:='UNKNOWN';
end;

Code for matching other events:

For n:=0 to TransCount-1 do with DBquery[n] do begin
fundet:=-1;
pointer:=n;
For m:=0 to SortTrans-1 do with DBquerystats[m] do begin
if (DBquery[n].query=DBquerystats[m].query) and (DBquery[n].keytype=DBquerystats[m].Keytype) and (DBquery[n].key=DBquerystats[m].Key) then begin
fundet:=m;
break;
end;
end;
if fundet>-1 then
DBqueryStats[fundet].Antal:=DBqueryStats[fundet].Antal+1

else begin
DBqueryStats[SortTrans].Antal:=1;
DBqueryStats[SortTrans].PID:=DBquery[n].PID;
DBqueryStats[SortTrans].Query:=DBquery[n].query;
DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
DBqueryStats[SortTrans].Key:=DBquery[n].key;
DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;
SortTrans:=SortTrans+1;
end;
end;

How can I improve this?

emadat

When it comes to searching, nothing beats assembly language.

Look at the article "Optimizing String Searches in Delphi" at
http://www.delphidevelopernewsletter.com/dd/DDMag.nsf/WebIndexByIssue/75EDEE6470ADB1BF852568F2005E2269?opendocument

The author used assembly language to implement some of the fastest searching algorithms I've heard of. He is also providing a downloadable library that contains his ocde.

What you can do about your log file is to load the whole thing into a single string -which Delphi allows to be up to 2 GB in size- and use the search routine found in the library to scan the whole file -now cached into the string- for you.

billious

My suggestions will be with a mind to retaining your current methods.
No doubt you can clean them up later, if required.

1. Declare a new item iPid in Users and DBquery (and possibly DBqueryStats,)
as integer
2. On adding a new User, iPid := strtoint(trim(Pid));
3. Revised Finduser :

function FindUser(pid2: string) : string;
VAR
n,fundet : integer;
begin
n := strtoint(trim(pid2));
fundet:=Usercount;
while fundet>0 do with Users[fundet] do
if n=iPid then
break
else
dec(fundet);
if fundet>-1 then
FindUser:=Users[fundet].User
else
FindUser:='UNKNOWN';
end;

This does an INTEGER comparison, rather than a string, and does the TRIM
once, rather than TWICE for each entry until found.
Furthermore, it looks from the END of the array (where more recent sessions
are recorded) to the beginning (where possibly-defunct processes are
recorded) and thus detects more recent (and thus possibly more active)
processes IDs first.

4. Similarly,
For n:=0 to pred(TransCount) do with DBquery[n] do
begin
fundet := pred(SortTrans);
while fundet >= 0 do
with DBquerystats[fundet] do
if (DBquery[n].query=DBquerystats[fundet].query) and
(DBquery[n].keytype=DBquerystats[fundet].Keytype) and
(DBquery[n].key=DBquerystats[fundet].Key) then
break
else
dec(fundet);
if fundet>-1 then
DBqueryStats[fundet].Antal:=inc(DBqueryStats[fundet].Antal)
else
begin
DBqueryStats[SortTrans].Antal:=1;
DBqueryStats[SortTrans].PID:=DBquery[n].PID;
{ DBqueryStats[SortTrans].iPID:=DBquery[n].iPID;} // if you need it
DBqueryStats[SortTrans].Query:=DBquery[n].query;
DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
DBqueryStats[SortTrans].Key:=DBquery[n].key;
DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;
SortTrans:=inc(SortTrans);
end;
end;

5. examine
if (DBquery[n].query=DBquerystats[fundet].query) and
(DBquery[n].keytype=DBquerystats[fundet].Keytype) and
(DBquery[n].key=DBquerystats[fundet].Key) then

and put the three comparisons in a sequence such that the most likey
to be UNEQUAL is first. Note that the 'if a=b and c=d and e=f'
structure is the same as
'if a=b then if c=d then if e=f'
so that if the final comparison produces most fails, then the first
two still have to be executed every time around. Changing the sequence
to 'if e=f and a=b and c=d' will thus perform only ONE comparison for
these majority-fails, rather than 3.

HTH

...Bill

tommykjensen

ASKER

Thank You for the suggestions.

As for suggestion 4:

This actually made the routine significantly slower?

As for suggestion 5:

The first comparison is the one most likely to fail so that don't need to change.

But which is faster?

One if statement with all three expressions

or three nested if's ?

tommykjensen

ASKER

4:

The text that is compared looks like following 3 examples:

name="datadict"

table.name="schedule" and column.name="schedule.id"

problem.status="Waiting for supplier" or problem.status="Waiting for customer" or problem.status="Wait" or problem.status="Recovered" or problem.status="Reassigned" or problem.status="Open" or problem.status="Closed" or problem.status="Assigned" or problem.status="Accept") and (priority.code="4" or priority.code="3" or priority.code="2" or priority.code="1") and open.time>='02/01/03 00:00:00' and open.time<'03/01/03 00:00:01' and company="zzzzz"

billious

tommykjensen:

I'm a litle surprised that (4) failed, and conclude that patented theory
#1 (later transactions refer to later process ids) is probably suspect.

(I'm even more surprised that SortTrans:=inc(SortTrans); worked, since
it's syntactically invalid, so I suppose you're interpreting... In older
versions of Turbo-Pascal, 'inc' was considerably faster than 'x:=x+1', but
it looks as though the optimiser has taken care of that one...old habits
die hard! Hmmm...seems 'for loop:= low to high' also now detects
high<low ==> no action, so maybe I'll start using that, too!)

Revised (4) [which should be familiar]

For n:=0 to pred(TransCount) do with DBquery[n] do
begin
fundet := -1;
while fundet < pred(SortTrans) do
begin
fundet := fundet + 1;
with DBquerystats[fundet] do
if (DBquery[n].query=query) and
(DBquery[n].keytype=DKeytype) and
(DBquery[n].key=Key) then
break;
if fundet>-1 then
inc(DBqueryStats[fundet].Antal)
else
begin
DBqueryStats[SortTrans].Antal:=1;
DBqueryStats[SortTrans].PID:=DBquery[n].PID;
{ DBqueryStats[SortTrans].iPID:=DBquery[n].iPID;} // if you need it
DBqueryStats[SortTrans].Query:=DBquery[n].query;
DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
DBqueryStats[SortTrans].Key:=DBquery[n].key;
DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;
inc(SortTrans);
end;
end;

since this is fundamentally a very small change to your original, I'd
expect only a marginal improvement...but an improvement!

----------------

I presume that your code
if (DBquery[n].query=query) and
(DBquery[n].keytype=DKeytype) and
(DBquery[n].key=Key) then

is actually
if (DBquery[n].name=name) and
(DBquery[n].table.name=table.name and
DBquery[n].column.name=column.name) and
(DBquery[n].key=Key) then //which I'm not going to rewrite!!

where "datadict", "schedule", "schedule.id"

are typical values for name, table.name and column.name??

I'd suggest a series of steps, and expect an improvement with each.

4a.
for each of name, table.name, column.name, problem.status and
company, set up a table of values encountered, and when the
DBquery is established, record the index found to each (in the
same way as iPid was set up, but by scanning the table) in
DBquery. Similarly, add iName, itable_name, iColumn_name, etc.
to DBquerystats, and transfer in your establishment of the
DBquerystats record.

In your comparison, compare DBquery.iname,DBquerystats.iname;
DBquery.itable_name,DBquerystats.itable_name etc. This reduces
the number of string-compares and should thus improve speed.

Since priority.code and open.time appear to be char and tdatetime,
using this technique on those fields should show a detriment to
performance.

4b.
IF - AND - OR theory:

If you have an expression
if a and b and c and d and e //the 'outer' expression

where a,b,c,d,e are each Boolean expressions, then Delphi will evaluate
each in turn until it finds one is false (and therefore the whole
'outer' expression is false) and it will have to evaluate ALL of the
expressions to determine that the 'outer' expression is true.

It follows then that if e is true only 10% of the time, but a..d are
true 90% of the time, then

if a and b and c and d and e //the 'outer' expression

of the total,
10% will evaluate just a
9% will evaluate a and b
8.1% will evaluate a,b,c
7.29% will evaluate a,b,c,d
65.61% will evaluate a,b,c,d,e (100-10-9-8.1-7.29)
and the result will be 6.561% True (10% of a,b,c,d,e)

on the other hand,

if e and a and b and c and d //revised 'outer' expression
^--------<<<-------------

of the total,
90% will evaluate just e
1% will evaluate e and a
0.9% will evaluate e,a,b
0.81% will evaluate e,a,b,c
7.29% will evaluate e,a,b,c,d (100-90-1-0.9-0.81)
and the result will be 6.651% true (90% of 7.29)

The OR situation is precisely the opposite :-
If you have an expression
if a or b or c or d or e //the 'outer' expression

where a,b,c,d,e are each Boolean expressions, then Delphi will evaluate
each in turn until it finds one is TRUE (and therefore the whole
'outer' expression is TRUE) and it will have to evaluate ALL of the
expressions to determine that the 'outer' expression is FALSE.

So, for obvious reasons, the term most likely to be TRUE should be
placed first.

SO, in your case, which I conclude would be expressed as:

if
(name="datadict") //first expression

AND

(table.name="schedule" and column.name="schedule.id") //second expr

AND //third expr follows

(
problem.status="Waiting for supplier" or
problem.status="Waiting for customer" or
problem.status="Wait" or
problem.status="Recovered" or
problem.status="Reassigned" or
problem.status="Open" or
problem.status="Closed" or
problem.status="Assigned" or
problem.status="Accept")

and

(priority.code="4" or
priority.code="3" or
priority.code="2" or
priority.code="1")

and

open.time>='02/01/03 00:00:00'

and

open.time<'03/01/03 00:00:01'

and

company="zzzzz"
)

I'd suggest that you'd see an improvement WITHOUT implementing 4a by
changing the order to (I believe that the improvement in 4a would
probably mask 4b)

if open.time >= .....
and
open.time < .....
and
company...
and
name....
and
table.name.....
and
column.name.....
and
(priority.code... or priority.code...)
and
(problem.status ... or problem.status...)

where
a) the order company, name, table.name, column.name
is placed in INCREASING order of probability of a match of the
expression (ie. most-probable last)
b) the order of sub-expressions in priority.code and problem.status
should be placed in DECREASING order of probability of a match of the
expression (ie. most-probable first)

HOWEVER
-------

Since your code appears to be detecting certain criteria about a log
record and then counting the NUMBER of records that match those criteria,
then the use of an 'or' expression would indicate that you are interested
in (records with priority code "1".."4") and (records with other priority
codes), such that if the first record encountered in the log had priority
code "3" and the next otherwise identical record had priority code "2" then
your program would count 2 records, and thus all that could be said about
those 2 records is that they had priority codes "1".."4". An identical
argument could be made about problem.status.

I don't know whether this is intentional but if it is (as I suspect) then
I would suggest that you set up yet another value(s) in DBquery, Dbquerystats
bpriority_code (bproblem_status) as a Boolean to indicate whether the
record is/is not in the priority.code range/problem.status set of interest.

It might even be that you could eliminate entries that are of no interest
at that point from DBquery. Naturally, if you are reporting both in-priority-
code-set and not-in-priority-code-set, then the complex OR expression in the
long comparison-for-adding-to-DBquerystats simply becomes a nice Boolean,
(eg. bpriority_code) since the expression was evaluated at entry to DBquery.

HTH

...Bill

tommykjensen

ASKER

First thank You for a very thourough answer, its clear You have used quite some time to help me.

Actually this did not work

SortTrans:=inc(SortTrans);

I misread Your suggestion so I used the

SortTrans:=SortTrans+1

I just changed it to the correct syntax but the speed did not change.

I will try the new suggestion.

As for the query. This should not be interpreted as this is a query done in an application and I want to count the number of exact queries.

The DBquey array only contains the log lines am interested in and the field query should just be treated as a string of characters I want to count exact matches.

As a sidenote the purpose of my program is to show the queries users of a helpdesk application issues on the database. I need this to setup correct keys. So the loglines is the result of a trace on that helpdesk application.

billious

tommykjensen:

AHA!

The object I have in mind is to reduce the number of string-comparisons
performed.

Hence,

type
tquerystats = record
ipid : integer;
iproblem_status : integer;
cpriority_code : char;
end;

var
aquerystats : array of tquerystats;
bcalc_stats : Boolean;
listofproblemstatuses : tstringlist;

...
procedure buildreportquery;
...
For n:=0 to pred(TransCount) do with DBquery[n] do
begin
if aquerystats=nil then
begin
bcalc_stats := true;
setlength(aquerystats,transcount); // allocate dynamic-array
//also clear any required stringlists here
listofproblemstatuses.clear;
end
else
bcalc_stats := false;
if bcalc_stats then
begin
{code to set ipid[n], iproblem_status[n] and cpriority_code[n]}
end;

fundet := -1;
while fundet < pred(SortTrans) do
begin
fundet := fundet + 1;
with DBquerystats[fundet] do
if ({match as required}) then
begin
inc(DBqueryStats[fundet].Antal);
break;
end;
end;

if fundet=sorttrans then
begin
DBqueryStats[SortTrans].Antal:=1;
DBqueryStats[SortTrans].dbq := n;

(* rest not required

DBqueryStats[SortTrans].PID:=DBquery[n].PID;
{ DBqueryStats[SortTrans].iPID:=DBquery[n].iPID;} // if you need it
DBqueryStats[SortTrans].Query:=DBquery[n].query;
DBqueryStats[SortTrans].Filename:=DBquery[n].filename;
DBqueryStats[SortTrans].Keytype:=DBquery[n].keytype;
DBqueryStats[SortTrans].Key:=DBquery[n].key;
DBqueryStats[SortTrans].Seconds:=DBquery[n].seconds;

*)

inc(SortTrans);
end;
end;

Notes :
1. {code to set ipid[n], iproblem_status[n] and cpriority_code[n]}
It's difficult to determine what query, keytype and key (which are the
three elements you originally posted) are.

I would expect something like:
ipid[n] := strtoint(trim(dbquery[n].pid));
iproblem_status[n] := listofproblemstatuses.indexof(dbquery[n].problem_status);
if iproblem_status[n] < 0 then
iproblem_status[n] := listofproblemstatuses.add(dbquery[n].problem_status);
...

1a) Note that the above is only established ONCE, the first time
buildreportquery is executed. You can thus build as many reports as
you like on one set of DBquery returns.

All that is required to work with the next set of DBQuery returns is
aquerystats := nil;

1b) Note that DBQueryStats is much simpler, being only 2 elements, the
count, and an index into DBquery.
Guess what
'with DBQuery[DBQuerystats.dbq[reportline]]' is!

1c) (Repeated caution)
DBQuery[DBQuerystats.dbq[reportline]].Seconds, for instance will return
the Seconds field from the FIRST DBQuery encountered matching the other
criteria. This complies with the DBqueryStats[SortTrans].Seconds field
that would have been established - and may not be what was expected
(perhaps you intended to accumulate??)

2) {match as required}
should be some Boolean operation on aquerystats[n] against
aquerystats[DBqueryStats[fundet]], established in accordance with
the IF - AND - OR principles posted earlier

HTH

...Bill

tommykjensen

ASKER

Bill, I see Your point but this idea is not realistic because it would mean that I have to check for more than 10.000+ different field names. And since my intend is that this program should be available to all administrators of that helpdesk application it would not be realistic because any administrator could add his own fields and tables to the database and I would not known of them.

The queries I have listed is just examples of what could appear in the log.

The application has 500 database files with anywhere between 10 and 500 fields each.

Good idea with the pointer to the original array, that will at the least save memory

tommykjensen

ASKER

I just got a simple idea. Having You explanation with IF's in mind.

What if I add a new expression as the first in the if statement. I want only complete matches of the field query so what if I change:

if (DBquery[n].query=DBquerystats[m].query) and (DBquery[n].keytype=DBquerystats[m].Keytype) and (DBquery[n].key=DBquerystats[m].Key) then begin

to

if (length(DBquery[n].query)=lenght(DBquerystats[m].query)) and (DBquery[n].query=DBquerystats[m].query) and (DBquery[n].keytype=DBquerystats[m].Keytype) and (DBquery[n].key=DBquerystats[m].Key) then begin

billious

I believe that this would NOT improve the timing, if anything it would
make it worse.

Strings are stored by Delphi either as Pascal-style or C-style.

Pascal-style strings are stored with a length-byte first, and C-style with
a terminating null (0).

If I was implementing a string-compare (for equality) on a Pascal string,
I would start at byte 0, which is the length byte. Therefore, if ...query
is a Pascal string, finding the length of each will simply compare the
first byte TWICE, once for the length term and then again in the string-
compare term. If the length is different, the string comparison will fail
after comparing the first (length) byte, so a string-compare will make no
difference for non-matching strings (fail on first byte-compare) and you
are stuck with the repeated overhead of the length-compare on matching
strings.

On a C-style string, calculating the length of the string means scanning
the string, counting characters until a null byte is found. In this case,
adding the length() element will thus cause each string to be scanned for
a null byte. It would be quicker to simpy do the string compare on the
...query elements, since it's simply a matter of comparing each character
in the string until a mismatch is found (fail) or a null byte is found
(success.)

======

What you appear to be doing (from your original code) is :-

1) extract data from the database to DBQuery array
2) in the procedure under discussion,
a) Store certain fields from DBQuery in DBQueryStats for the FIRST
record that occurs with certain field values
b) Increment 'Antal' for each extra record found that matches those
field values
3) Evidently then reporting from DBQueryStats

Without knowing what the meaning of each field is, or how they relate to
the field values used in the matching or reporting criteria, it's a little
difficult to comment constructively.

Take 'seconds' for instance. This was specifically being extracted into
DBQueryStats in the original code. If this is a DURATION of an event,
then the report generated will show only the DURATION of the first
event encountered, and not report any duration of any subsequent
event matching the same selection criteria (such that 'Antal' is
incremented.)

Of course, it may be that this is merely a typical routine used to
construct DBQueryStats - it's not possible to tell.

And it's also not clear whether you intend to produce a number of
different reports from a single extraction of data to DBQuery (ie.
perform the extraction once, then produce many different reports
from the data loaded into DBQuery) - which would tend to imply that
there are many different selection criteria combinations, or perhaps
the selection criteria are controlled by some external (unshown) switch
using the one routine. OR you may be performing ONE report from an
extraction from the log, and performing a new extraction for the
next report.

The 'quickest' may depend on these factors. If you are producing multiple
reports after a single extraction for instance, if the routine was SLOWED
by 50% for the first report, but then was 90% faster for subsequent
reports, then producing 3 reports would take 150 + 10 + 10 time units,
or run nearly twice as fast as the original (170 time units rather than
300.)

---

It would seem that the number of records in DBQueryStats dictates the
length of the report being generated, the content of the report depends
on the fields extracted to DBQueryStats (but now accessed by the index
into DBQuery) and the grouping of record depends on the selection
criteria within the routine published. I can't see how the number of
fieldnames or tables is in any way relevant to the matter.

...Bill

tommykjensen

ASKER

The array DBquerystats is used in two ways.

1. a count run that count the occurences this would not make use of the seconds field that is correct.

2. a display run that displays all the extracted lines with individual information such as the seconds field and the user field.

I don't understand why You would break up the query field and analyse the actual contents? Ie why extract problem.status and compare it to the different values for this field, and the same for priority.code, these 2 fields are just 2 out of 10.000+ in the application in question.

As I explained the query can be anything and I can't make a complete list of all available values as the user inputs those in the application.

billious

I'll think about this for a while - I've got a little theory buzzing around that I'll commit to writing once I've thought it through.

Meanwhile - Her'es a little thought that MAY be of use.

if you add a NEXT field to DBQuery, initialised to zero, and a LAST field to DBQueryStats, then

if fundet=sorttrans then
begin
DBqueryStats[SortTrans].Antal:=1;
DBqueryStats[SortTrans].dbq := n;
inc(SortTrans);
end
else
DBqueryStats[SortTrans].Antal:=
DBqueryStats[SortTrans].Antal + 1; {omitted in last code}

can be modified to
if fundet=sorttrans then
begin
DBqueryStats[SortTrans].Antal:=1;
DBqueryStats[SortTrans].dbq := n;
DBqueryStats[SortTrans].last := n; {additional}
inc(SortTrans);
end
else
begin
DBqueryStats[SortTrans].Antal:=
DBqueryStats[SortTrans].Antal + 1;
DBQuery[DBquerystats[SortTrans].dbq].next := n; {+}
DBqueryStats[SortTrans].last := n; {+}
end;

will set up a nice little linked list through DBQuery[].next - which is really easy to do, and may prove useful.

...Bill

ASKER CERTIFIED SOLUTION

billious

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

tommykjensen

ASKER

Bill, I am giving You the points and closing the question.

Without posting the entire sourcecode I don't think we will get much closer to a solution than this.

Thank You for all your thoughts.

I have ordered a book "Algorithms and datastructures" which hopefully will help me optimise my code.