Algorythm

Hi there,

I'm about to implement a map structure that holds unique ID's (integers) for the keys. Well, there's several ways to do this, but I'm looking for the fastest one (in respect to lookups).
I think that a binary tree would be fast, but I'm not sure if it is the fastest - does anyone have a better idea/other solution ?
I was thinking about a hash table to...

The tree structure is assumed to hold around 25000 entries.
The ID's are NOT sequentially (TList's etc. can't do)

Regards
LVL 3
KEAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
interConnect With a Mentor Commented:
Thanks and c.u.
igor
0
 
LischkeCommented:
Hi KE,

nothing beats binary search in this regard (except direct indices, but this is extremly memory intensive).

I'd suggest that you use a simple Integer set (simulated by an array of Integer). Put a class around to ease inserts and deletes or use simple support functions. Insert new values in order so you can find a specific value extremly fast (1 million items are looked up at an avarage of 10 comparations (max. 20, min. 0)).

Ciao, Mike
0
 
KEAuthor Commented:
Hi there Mike,

Well, I do not understand your suggestion completely. How do you expect to search the integer set ?

Regards
0
Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
interCommented:
The sandart data structure for this kind is generally m-way search tree which is a generalized binary tree. The order of m speeds up the search, lets roughly look at complexty of the search:

binary tree : log2(N)
m-way tree: logm(N) where N is the node (or item count)

let's look at the memory consumption by examining the nodes:

PBinNode = ^TBinNode;
TBinNode = packed record
   Left,Right : PBinNode;
   Data : Integer;
end;

PMwayNode = ^TMwayNode;
TMwayNode = packed record
  n1: PMwayNode;
  data1 : integer;
  n2: PMwayNode;
  data2 : integer;
....
  nm: PMwayNode;
end;

the size of bin and mway nodes are 12 and 8m-4 bytes (note that binary tree case is m=2)

forgetting about the head pointers etc 25000 item requires

binary : 25000*12/1024 =  293K
mway :  25000*(4m-4)/1024 =  98(2m-1) K

however as lischke pointed out there is a compremise between speed and memory and you should decide that. One other factor is complexity. Binary trees can be constructed without much effort but in the case of algorithmic complexity they are orders of magnitude problematic than 2-way trees.

regards, igor
0
 
LischkeCommented:
Mmmh, I wasn't thinking of trees but of simple arrays as this is enough for this case we discuss here:

type
  TIntegerSet = array of Integer;

This one consumes a tenth of the trees above as should be qually fast as the m-way approach. On the other hand it is a bit slower on inserts as you have to move probably large amount of memory around to insert an item say at position 0 of an array with already 2.000.000 items in it. For 25.000 items this shouldn't be a problem, though.

The search function is:

function Find(const List: TIntegerSet; const Entry: Integer; var Index: Integer): Boolean;

var
  L, H, I, C: Integer;

begin
  Result := False;
  L := 0;
  H := High(Set);
  while L <= H do
  begin
    I := (L + H) shr 1;
    C := List[I] - Entry;
    if C < 0 then L := I + 1
             else
    begin
      H := I - 1;
      if C = 0 then
      begin
        Result := True;
        if Duplicates <> cidAccept then L := I;
      end;
    end;
  end;
  Index := L;
end;

This function is very common also in the VCL for various tasks (see source of TStringList). There you can also see what to do to insert an item (despite the memory moves involved, but this shouldn't really be a problem...).

Ciao, Mike
0
 
interCommented:
a little, tiny note if you do not mind, for trees the speed increases by 2 when m increases by one.
I am no means justifying usage of such thing here just a note.

;-)
regards, igor
0
 
KEAuthor Commented:
Mike,

OK, I understand you now... - you are suggesting a "TList" with an enhanced search routine that divides it's search range by two until the value is found (this search scheme actually has a name, but I don't recall it).
As you mention, it's slow to insert and delete objects, because of the memory reallocation involved. Allthough you say that it shouldn't be a problem to insert a single item within 25000 others (I agree), it would be when you choose to insert 5000 items into 20000...
I don't think I will be satisfied with this approach... but I may give it a shot.


Igor,

This is more what I was thinking of. I see that the multiple way tree has some seeking advantages, but unfortunately I don't have any code that implements it. Do you know of a public Delphi implementation (alternatively in C++) ?


Regards
0
 
interCommented:
The name of the scheme is binary search. Actually it depends on the min target computer power to consider the list alternative. Before talking about the search tree I'd better lay down the solution to the problem you stated above. If the nuber of elements that are inserted it fairly larger another standart data structures sorting method come for the help. The method is as follows:
1 - You have your original list L1
2 - Form a second temp list L2 and sort it with quick sort yielding N*log2(N) complexity (no memmoves but just swap the pointers)
3 - Combine the L1 and L2 with MERGE SORT which is used to combine the already sorted lists in L3 the time complexity is linear and Max(n, m) (where n and m are the items in L1 and L2 respectively).
So it can still solve your problem.
Regarding the trees, yes there are codes for pascal and especially for C. Infact old turbo pascal has a building b-tree source, but I must search to find.
So my humble advice for the problem is

1 - If you are plan to run those things on a computer faster than 200MHz do not wory yourself  with the trees. The list approach can be constructucted easily and fast.
2 - Else we may give a try to list, and if it is slow, we may try the tree approach.

by the way what is your core data structure, i.e. what should be linked with the key fields. Could memory holds the enough record or we need to develop lists on files etc..
igor
0
 
KEAuthor Commented:
Well, I've done some test's on the "list" approach. The searching is indeed fast - no problems. Considering the insertions, that's not so good.
I'm running on a 500MHz PIII system and did some test:

1. 25000 items inserted at random positions - 1420ms
2. 25000 items inserted in order (by call to find) 1450 ms
3. 50000 items inserted at random positions - 5940ms
4. 50000 items inserted ordered - 5990ms

The 25000 items is an estimate that I've made, which should reflect the number of files on a given filesystem.
In some case's there may be 50000 or more files (especially with todays storage prices ;-)
So I'm a little concerned that the "list" structure will "take forever" on a large file system (consider a worse case 200MHz PPro with 80000 files - then we are counting in minutes).

I've tested the same amounts with a red/black binary tree:
1. 25000 items inserted "in order" - 125 ms
2. 50000 items inserted "in order" - 250 ms

The binary tree is offcourse faster since it doesn't need to reallocate the memory. With the List, I've tryed to set the capacity before the insertions with no result (as the memory needs to be moved anyway). If I set the capacity, add the items instead, and then sort the list afterwards I can get down to 50ms (on 25000) - I will try to see if I can use this approach, I'm not sure right now since the list is rather dynamic.
I need to build the list from scratch with a large number of items (not sure if this can be done "unsorted") - after that I will do random and (up to 5000) isolated inserts/deletes.

I'll get back with more details when I've tryed to see wether I can use the Add/Sort to initially build the list.
It might turn out to be the fastest way overall...

Regards and thanks for your good advises...

PS. I'll throw in some extra points later for both of you !
0
 
LischkeCommented:
Yes, quite what I expected. Notice also that reallocating the list memory each time a new item is inserted is considerably slower than allocating it in chunks and keeping a running index. In a control I'm currently working on I use:

procedure TCustomVirtualTree.AddToSelection(const Node: PVirtualNode);

var
  Len: Integer;
 
begin
  Assert(Assigned(Node), 'AddToSelection: Node must not be nil!');
  if not (vsSelected in Node.States) then
  begin
    Include(Node.States, vsSelected);
    Len := Length(FSelection);
    if FSelectionCount >= Len then
    begin
      if Len < 100 then SetLength(FSelection, 100)
                   else SetLength(FSelection, Len + Len div 10);
    end;
    FSelection[FSelectionCount] := Node;
    Inc(FSelectionCount);
  end;
end;

You see here that I enhance the array by 10% of its former size. This adds an overhead of 2087 bytes for 1.000.000 entries. Allocating memory the first time also costs a great deal of time under Delphi. But once it is allocated you can free and reallocate it very quickly. As I don't insert in order into the selection array the method above is very fast (~3000ms for first time allocation of 1M entries, ~1500ms each following allocation).

Ciao, Mike
0
 
LischkeCommented:
Oh, I forgot, the times given above are on a double P200MMX (a quite old and slow machine :-)

Ciao, Mike
0
 
AttarSoftwareCommented:
Is it faster to add the new items to the back of the list, and then qSort the list again?

Just asking...

Tim.
0
 
LischkeCommented:
That's what KE meant when talking about 50 ms for the list approach.

Ciao, Mike
0
 
williams2Commented:
Hi everybody, just for a completion, (I know Lischke mentioned this earlier) I will post an example of the use of direct indicies (I know this is not a good idea, if the map has to hold quite large numbers) ..but the memory used anyway, is constant.

Var
  Table: Array[0..9999] of Byte;
//Maximum = high(Table) * 8
//If this is replaced by memory aligned
//32bit integer values, the processor wil
//get to this even faster.

Function LookupNumber(Const Number: Integer): Boolean;
Var
  Bit: Byte;
begin
  //Can handle 80.000 numbers
  Bit:= 1 shl (Number AND 7);
  Result:= Table[Number shr 3] AND Bit > 0;
End;

Function SetNumber(Const Number: Integer): Boolean;
Var
  Bit: Byte;
Begin
  If not(LookupNumber(Number)) Then
  begin
    Bit:= 1 shl (Number AND 7);
    Table[Number shr 3]:= Bit;
  End;
End;

This technique is especially very good when sorting faces (using the Z-Buffering technique) in a 3D rendering algorithm.

....So why am I posting this anyway ? ..This *can* also be used in conjunction with the binary search method allowing severel maps to exist and only cache the most recent ones. So if one 10Kb map can handle 80.000 numbers, how often are we then going to change maps ? is this technique usefull at anything but memoryconsumption ?

Regards,
Williams
0
 
LischkeCommented:
Williams,

right, this is the old game trading speed for memory. But it has surely its applications. The other approaches mentioned so far are more general, as they don't depend on predefined sizes. Your technique is often used together with hashing, btw.

Ciao, Mike
0
 
KEAuthor Commented:
Hi there,

Mike, Igor:
Well I've tested the list approach, and I think I will use it. I may have to find another solution if my estimations are to optimistic, but for a start it will do.

What I've found:
Build 80000 items - 141ms
Random Inserts (maintain order) 5000 - 2015ms
Build 200000 items - 406ms
Random Inserts (maintain order) 5000 - 13141ms

It's quite a lot processing time, even for the 80000 items. Two seconds are a lot when speaking of CPU time (especially when you know what it's "wasting" the time on ;-)...
The 200000 was just to get a feel of the progressive time consumption - I don't think I will get this high, at least not for the next couple of years...


Wiliams:
Very interesting bitset manipulator - but how do you map the number to a pointer this way ?
I would say that Array[0..x] of pointer would be some way easier ;-)



Mike, I'll place another question for you to answer - Igor place an answer on this one. You'll both get a 100pts grade A, is that ok with you ?


Regards everyone, and thanks

PS. Isn't this stuff way more interesting than "How do I enable my mouse button bla bla". I wish there was more Q's about "The beauty of programming" - not to say that either of them is more important...

0
 
LischkeCommented:
KE, I agree. But questions like this one need usually much more work, while asking for a button press is answered with a few lines...

I assume your "How old are you" question is the one I should answer to get the mentioned 100 pts :-)

Ciao, Mike
0
All Courses

From novice to tech pro — start learning today.