Solved

Threading algorithm - room for optimization

Posted on 2004-10-10
22
787 Views
Last Modified: 2007-12-19
I have a threading algorithm and would like to optimize it's performance a bit
currently I use two hash lists for lookups
it works quite well but I believe there's room for optimization .. and don't really have time for it :)

the slowest part in the test project is the treeview - would've used mike's VTV but not everybody has it
simple benching is already done inthe project ..
below is the code for the algorithm
the complete test project is here:
http://leenover.homeip.net/isapi/pas2html.dll/pas2html?File=/delphi/Threadable/
(zip file for everything)


this is the threading method I'd like optimized:

function Threaden(const AThreadables: IThreadables): Integer;
var I, Tmp, Idx: Integer;
    MsgSH, IRTSH: TStringHash;
    LRoot, LItem: IThreadable;
begin
     Result:=AThreadables.Count;
     MsgSH:=TStringHash.Create(Result);
     IRTSH:=TStringHash.Create(Result);
     CreateThreadable.GetInterface(IThreadable, LRoot);
     try
        for I:=0 to Result-1 do
            with AThreadables.Item[I] do
            begin
              MsgSH.Add(ID, I);
              // pogledam, ce smo komu parent
              // check if we're a parent to a msg
              Idx:=IRTSH.ValueOf(ID);
              if Idx > -1 then
              begin
                // LItem je nasa navidezna referenca
                // LItem is our temp reference
                LItem:=LRoot.Items.Item[Idx];
                for Tmp:=0 to LItem.Items.Count-1 do
                begin
                  LItem.Items.Item[Tmp].Parent:=AThreadables.Item[I];
                  // dodamo sporocila katerim smo 'parent'
                  // add the messages we're a parent to
                  Items.Add(LItem.Items.Item[Tmp]);
                  LItem.Items.Item[Tmp]:=nil;
                end;
              end;

              // handle InReplyTo
              if ParentID <> '' then
              begin
                // poiscemo 'parent' sporocilo
                // search for the parent msg
                Idx:=MsgSH.ValueOf(ParentID);
                // ce ga najdemo potem 'parentu' dodamo trenutno sporocilo
                // if we find it then we add the current msg to it
                if Idx > -1 then
                begin
                  AThreadables.Item[I].Parent:=AThreadables.Item[Idx];
                  AThreadables.Item[Idx].Items.Add(AThreadables.Item[I]);
                end
                else
                begin
                  // ce ne najdemo parenta potem dodamo v seznam iskanih
                  // if we don't find the parent then we add it to the list

                  // prvo poiscemo, ce ze nekdo isce istega parenta
                  // first check if something is already searching for our parent
                  Idx:=IRTSH.ValueOf(ParentID);
                  if Idx > -1 then
                  begin
                    LItem:=LRoot.Items.Item[Idx];
                  end
                  else
                  begin
                    // ce se noben ne isce tega parent potem ga dodamo
                    // .. if not then we add the parent for searching
                    LItem:=LRoot.Items.Add;
                    LItem.ID:=ParentID;
                    IRTSH.Add(ParentID, LRoot.Items.Count-1);
                  end;
                  LItem.Items.Add(AThreadables.Item[I]);
                end;
              end;
            end;

        for I:=Result-1 downto 0 do
           with AThreadables.Item[I] do
             if Assigned(Parent) then
             begin
               AThreadables.Remove(I);
             end;
     finally
        MsgSH.Free;
        IRTSH.Free;
        LRoot:=nil;
     end;
end;
0
Comment
Question by:Lee_Nover
  • 11
  • 6
  • 5
22 Comments
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
every improvement will get 500p A :)
0
 
LVL 7

Expert Comment

by:sftweng
Comment Utility
I've been trying occasionally for several hours to download from your site - is it down?
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
it was down for a minute or so .. it's working now
sorry for the inconviniences

btw. I was expecting you :) and maybe rllibby :) (from the uniform hash distrib thread :))
0
 
LVL 7

Expert Comment

by:sftweng
Comment Utility
I'm not sure I'll be able to live up to the expectation but I'll take a shot at it. ;-)
0
 
LVL 7

Expert Comment

by:sftweng
Comment Utility
Both IE and FireFox are failing to get through to the site:

 The page cannot be displayed
The page you are looking for is currently unavailable. The Web site might be experiencing technical difficulties, or you may need to adjust your browser settings.
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
is leenover.homeip.net reachable ? if not drop me a mail and I'll send you the zip
(still works from here - remote location as well)
lee_nover at delphi-si dot com :)
0
 
LVL 7

Expert Comment

by:sftweng
Comment Utility
Tracert is getting timeouts at 212.152.192.162. Sending email soon.
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
ah .. routing problems with isp's :)
0
 
LVL 17

Expert Comment

by:Wim ten Brink
Comment Utility
The TStringHash class might be the one you want to be optimizing. :-) I know a hash table can be fast but I've also seen some implementations of hash components that are actually quite slow. I just wonder where that TStringHash came from. Can you influence the size of the hash table? If so, increate it to a very large size. (Over a million elements.) It will eat lots of memory but will increase speed too. In a hash table function that I've used, I had room for about 2.5 million strings and would only be using about 1% of this space in most cases. But it was lightning fast because the larger the table, the less conflicts you get with different strings ending up with the same hash value. (In which case the strings are put in a slower 'bucket'.)
It took me over 10 MB for the empty hashtable, though. So it's costly in memory, but speed? Wow... I used the hash table to find duplicate records in a textfile that I needed to import. A textfile with between 500 and 250.000 records. Even with the largest files I found the duplicate lines within seconds! Just because the hash tables was big enough to avoid many collisions.

Another optimization would be to reverse the directions of the for-next loops, if possible. Thus, instead of using:
    for I := 0 to Result - 1 do begin
use:
    for I := Pred(Result) downto 0 do begin
The reason is that Delphi has it a lot easier to compare a value with zero than a value with another value, at machinecode level. Also, in this way you only need to calculate the comparison value only once, at the start of the loop. I'm just not sure if you can reverse walk through your list, though.
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
I don't want the hashlist that big .. "this" code runs in a web service and will get quite a few requests
the reordering takes about 7x the time as the initial insert
considering all the comparing it does when reordering I don't think it's that much slower
things are also a bit slower because of the use of interfaces
anyway a lightweight hashlist is also welcome :)
quick look at the delphi's HashList doesn't seem that bad

it was "Result-1 downto 0" at first .. when I changed to "0 to Result-1" it got 20% faster :)

> Also, in this way you only need to calculate the comparison value only once, at the start of the loop.
delphi does that anyway you do it, that's why I prefer for..do over while..do :)
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
oh another thing I would like to avoid is to traverse the complete list again at the end and throw non-root elements out
it's quite possibly faster this way as it doesn't reorder the lists and recalculate the hashes (which is IMO inavoidable)
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 17

Expert Comment

by:Wim ten Brink
Comment Utility
Well, a hash table is just an array[0..Result] of strings. Whenever you add a string, it calculates a hash and then sets HashTable[Hash(string)] to the string value. To support a bucket, it would have e.g. a stringlist instead of just a single string. Thus if you add two strings to the hash tables where hash('Apple') = hash('Banana') then you would get a collision. Thus, first it adds Apple to the stringlist at HashTable[x], then it checks if 'Banana' exists in the stringlist at HashTable[x]. If it doesn't, then Banana is added too.
You might already get a speed improvement by making the hash table about 3 to 5 times bigger, since it decreases the chance of collisions considerably. A hash table that is 5 times bigger would reduce the number of collisions to 20%, thus you need less access to the buckets.

Unfortunately, to gain more speed, you often have to sacrifice a bit more memory.

Even if it's a webservice you might consider increasing the size of the hash table to increase speed. If this means you run out of memory a bit sooner than you hoped for, it is just easier to add more memory than to install a faster processor.

> it was "Result-1 downto 0" at first .. when I changed to "0 to Result-1" it got 20% faster :)
Funny. I would have expected the opposite result. :-)

> delphi does that anyway you do it, that's why I prefer for..do over while..do :)
Oops. You're right about that. I use too many while loops anyway. (In which case I try to walk back to 0 all the time.)
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
humz .. creating a larger hash table than it already is is quite useless in my case
if you check the code you'll see I create the hash tables the size of record count :)
     Result:=AThreadables.Count;
     MsgSH:=TStringHash.Create(Result);
so there should be no conflicts here

but tnx for the suggestion .. got me thinking for a while :)
0
 
LVL 17

Expert Comment

by:Wim ten Brink
Comment Utility
Ehm, wrong. There could be more conflicts than you think! It depends a bit on how the hash table is created, though. There's the simple hash table and the table/bucket combination.

The problem is that the hash might create an identical hash value for two different values. Thus Hash(A) = Hash(B) where A <> B. This could happen and happens more often than you think.
One solution to handle these collisions is when you calculate Hash(B) and notice that the location is occupied with value A, then you look if Hash(B)+1 is available. If it is, you store B at position Hash(B+1). Now if we also have Hash(C) = Hash(B) = Hash(A) then even more work needs to be done. Hash(C) is occupied with A. Hash (C)+1 is occupied with B, thus all you have left is position Hash(C)+2.
Now imagine if your hash table contains 1000 records. When you add the last item Hash(1000) it might happen that this location is already occupied. This is actually very likely. You would then have to walk sequentionally through the hash table until you find the free location. This would take about 500 comparisons on average.

The other solution works with buckets and is a bit faster, but requires more lists to be maintained. When Hash(A) is added, the position is empty thus a new list is added to that position and A is added to this list. Then Hash(B) gets at the same address and checks if B is in the list or not. It detects it isn't thus it adds B to the same list. The same for all other hashes that happen to collide. The advantage of the table/bucket method is that your hash table could actually be a lot smaller than the number of records that you might want to store. It's just that the number of items in each bucket might grow quite large if many records happen to collide at the same value.

Say that you have 1000 records. The chance of a collision in a hash table of size 1000 would be 1 in 1000 for every record past the first one. Thus you could expect one to happen at least once. If you have a tablesize of 2000, the chance is reduced to 1 in 2000. Thus, there's about 50% chance that you've avoided any collisions. If you increase the tablesize even more, the chance of collisions will decrease even more too.

Just try to see if "MsgSH:=TStringHash.Create(Result * 5);" increases the performance, just to humor me. ;-)

And as I said, the performance depends heavily on the hash table class that you're using. For the project I've worked on I just created my own hash table with an enormous size because I knew I had lots of memory available in the first place. It improved the speed considerably. I could have made it smaller but I prepared the system so it could handle huge imports. And it's awesome to see how it extracted all duplicate records from a file of 50 MB and more in just a matter of seconds. And then all non-duplicate records were inserted to a database and that took between 30 and 90- minutes... :-(
But okay, I did find the duplicates lightning-fast... :-)
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
>Just try to see if "MsgSH:=TStringHash.Create(Result * 5);" increases the performance, just to humor me. ;-)
I actually did that .. when testing with 100.000 records there were no collisions
with 1mio there were quite a few .. multiplying with 5 there were no collisions
creating the flat list took 2 secs and threadening it took 7 secs ..
surprisingly the test returned ~same speed results in both cases .. leaning towards smaller sized bucket !
I'll play arround a bit more .. tnx so far :)
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
smaller sized list that is !
0
 
LVL 17

Assisted Solution

by:Wim ten Brink
Wim ten Brink earned 400 total points
Comment Utility
Weird. More collisions when you increase the hashtable? That would mean the hashing algorithm that is used in your code doesn't generate hashes that are random enough. A good hash algorithm would not favor any value over others. The hash algorithm I used in my project never generated many collisions. If you can change the algorithm you might try this one:

function TPJWHashChecker.HashValue(const Value: shortstring): Integer;
var
  I, G: Integer;
  Len: Integer;
begin
  Len := Length(Value);
  // Initialize the hash value.
  Result := 0;
  // Now add every character to the hash, in a very randomized way.
  for I := 1 to Len do begin
    Result := (Result shl 4) + Ord(Value[I]);
    G := Result and $F0000000;
    if (G <> 0) then Result := Result xor (G shr 24) xor G;
  end;
  // Fit result in preferred range.
  Result := Result mod 2499997;
end;

The value 2499997 is the value I used in my code. Also important, this number is a prime number! That helps too to avoid collisions. :-)

Oh, well... All I'm saying is that you might get a performance increase by improving your THashString class.
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
no no .. missunderstood
there are less collisions with a larger list ofcourse
but the speed didn't increase that much .. when I created a list of 256 buckets the performance got a bit worse (<10% worse than a 1M list :))
TStringHash is pretty generic and works well .. it's in the IniFiles unit if you care to check :)
the hashing algo looks similar to yours .. it also uses the list size in calculation of the hash
0
 
LVL 7

Accepted Solution

by:
sftweng earned 100 total points
Comment Utility
I support Workshop_Alex's recommendation to make the hash table size a prime number. Sorry I haven't been contributing here; we've been putting the final touches on release 2 of our product for unveiling at a conference this week.
0
 
LVL 17

Expert Comment

by:Wim ten Brink
Comment Utility
It's in the IniFiles unit? Not in Delphi 5... :-( Oh, well. I'll have a look when I'm back home, using Delphi 7 or 8...
However, if you use a prime number for the list size, you will have less collisions than when you just use the normal list size. It might even give more collisions if your hash table is an even number. A multiple of 16 or 256 could even result in more collisions.

And one other tip: if you want to improve performance, avoid modifying strings as much as possible.
0
 
LVL 12

Author Comment

by:Lee_Nover
Comment Utility
now I'm using ListSize:=GetNextPrime(RecordCount); - no collisions in 2M records .. so I guess it'll be fine :)
and no, not modify any strings :) .. just rearanging the parent-child relation (threadening the flat list)
btw .. comment in this topic: http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_21164814.html :)
tnx for your time and the help
0
 
LVL 7

Expert Comment

by:sftweng
Comment Utility
Another strategy  might be to use a b-tree (balanced tree) index, rather than a hash table. Depending on the complexity of the hashing algorithm, this might be a faster approach, given that lookups are essentially logarithmic.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

The uses clause is one of those things that just tends to grow and grow. Most of the time this is in the main form, as it's from this form that all others are called. If you have a big application (including many forms), the uses clause in the in…
In this tutorial I will show you how to use the Windows Speech API in Delphi. I will only cover basic functions such as text to speech and controlling the speed of the speech. SAPI Installation First you need to install the SAPI type library, th…
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, Just open a new email message.  In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now