std::map<string,string>

Solved

Posted on 2006-03-22

I'm researching in-memory data structures that provide fast lookup time, like Hashtables and T-trees. I'm trying to find out what would be the best data structure to use when I have a table composed of keyfields and datafields, where you can retrieve the datafield by looking up the keyfield. The keyfield would be a word, and the datafield would be the definition of the word.

Some questions I have:

Which data structure is better suited for this, a hashtable or a T-tree?

Where can I find source code examples for these structures (preferably in C++ or C)

Some questions I have:

Which data structure is better suited for this, a hashtable or a T-tree?

Where can I find source code examples for these structures (preferably in C++ or C)

11 Comments

Will the table be created once and then referenced only, or, will you be adding new entries often? If so, is it important to have rapid completion for every addition? Like, when using a tree-structure scheme, some additions will provoke a rebalancing of the tree and some of these re-balancings will involve a lot of work.

Will all keys have about equal likelihood of being requested, or will certain keys predominate? In English text, the word "the" is the most frequently appearing, while "zozzle" is rare.

Have you some ideas as to the likely number of keys? This is important for using a hashing system. Too small a hash table leads to sub-searches, too large wastes space.

I prefer an indexed hash with linked lists and have had satisfaction. Following Knuth, choose a prime number about double the number of the mostest entries you expect to have to deal with, or as big as seems convenient with regard to available storage as consumed; call this N. You will need an array HashIndex(0:N - 1), to point to stored entries. A stored entry contains a copy of the key, the data belonging to the key, and a link (probably zero) to a further entry, needed if more than one key, as encountered, hashed to the same index. The stash of stored entries grows linearly as new entries are added; the HashIndex array is used at random, and the birthday paradox applies for its occupancy.

Generally, the hash table approach delivers its result in one hash probe only, with one key comparison and sometimes two and rarely three should the linked-list have to be chased... Exact statistics depend on how many entries there are compared to N, but even if there were twice as many stored entries as N, average performance would only degrade to around two key comparisons per search. It is quite practical to have the HashIndex array in main memory (typically an array of a few thousand 32-bit numbers), while the stash is on disc storage.

A benefit of the indexed hash table (instead of open hashing) is that popular entries will tend to be together at the start of the entry store, because they would have been likely to have been encountered early as the stash was expanded. Depending on circumstances, you could pre-load the stash with known-to-be-popular entries, or, once the stash has grown to a worthy size, your system could pause and sort its entries according to popularity. If the keys to be stored were all known beforehand and fixed (as with the keywords of a computer language), you could try twiddling your hash function so that a variant is discovered that assigns each of the known keys a unique hash index number.

Of course, the only definitive answer as to which would be best for your problem would be to implement both and run a comparison...

To be more specific, I'd be creating two data structures, one with only about 30 keys, and the other with approximately 200 keys. Both of them would be created once at runtime, and then never added two or deleted from.

Perfect Minimal Hash Function.

SPEED:

The hashtable insert and search speed is O(1) (On average, it takes aproxamatly 1 compare operation to insert or search for an item). The binary tree (if it is balanced) has a insert/search speed of O(lg N). For a binary tree of size 200 that computes to approximatly O(6) (On average it will take approxamatly 6 compares to find and item in the tree).

Memory:

A hashtable intrinsicly uses more space than a binary tree. The binary tree of N elements will allocate space for N elements. A hashtable has a more complex algorithm, the size should be a prime number LARGER than the # of elements stored in it. The reason is that it decreases the probability that the hash function will compute the same location for to different keys (this is called a collision). When the collision occurs, the hashtable will use an alternate method for storing the element (it could be a linked list, or it could choose the next empty slot in the table). So for a hashtable of 200 items, you could choose a size of 401.

Here are some implementations of balanced Binary trees.

http://www.gamedev.net/ref

http://www.codeguru.com/cp

http://www.codeproject.com

As for hashtables, I would use the STL implementation (it has been tested and created by experts)

If you want to examine source implementations, here's one:http://burtleburtle.ne

If you want more just google for it.

With so few entries, speed is important only if you desire to effect many searches in a hurry. Likewise, unless space is at a premium (with gigabyte memories to splurge in?), a hashtable of 400 entries is surely trivial.

Will you know beforehand whether a key belongs to the 30 group or the 200 group? If not, and even if so, rather than have two hash stashes, having one combined would be OK, and a field in the entry would indicate which group it belonged to. The fun part of hash table is that one probe finds the entry (almost always, etc) thus if you dodn't know which group a key belongs to, searching one table and then the other if not found would be two probes and more effort when a single probe of a combined table would reveal all in one go.

Of course, you're right that memory is no big deal. I have 4 GB to spare here, but nonetheless the concept is interesting and it would be very satisfying to create a perfect hash function.

I have used a scheme somewhat as follows, using 32-bit 2's complement arithmetic as it is convenient, but unsigned arithmetic might be better as then no negative numbers can arise to confuse the mod operation. Whichever, here is the scheme:

h:=0; %Or, h:=length(string) just for variety.

for i:=1 to Length(string) do

h:=h*c1 + int(string[i]) + c2; %I don't care about integer overflow. Beware system overhead.

next i;

h:=h mod N; %Beware of mod's response to negative numbers. Ensure that h is in 0:N-1.

For c1, I might have chosen 493, for c2, zero. It doesn't much matter. The object is to thoroughly scramble the bits, and to do so without extensive computation and complex paramaterisation. (100+ values, forsooth!) Perhaps you can equivalence the string to an array of 16-bit integers and crunch double-byted - including the length is quite satisfactory also. Searching for perfection I suspect involves probabilities such as 1/N! which are rather thin.

Obviously, you could run a simple search for values of c1 and c2 that minimise the clashes for your choice of N and collection of keys. The functional form can be varied also, though one does seek to avoid extensive computation. Since these days functions such as ArcTan seem to be computed about as fast as a multiply, further opportunities present themselves beyond fast simple operations such as XOR.

An adequate goal would be a hash function that produces no collisions for your set of keys and choice of N. Then there would be no need for chaining, and thus no need for those fingers. With a N of 400, you still have no more fingers than would a binary tree for 200 entries.

For example if N is 200 which breaks down to 200=2*2*2*5*5. Assuming you have a good hash function, 50% of your numbers will be divisable by two and 20% will be divisable by 5. When the hash value is moded with N you will get the following:

N is even

N mod 200 == N mod (100*2) == N mod 100+N mod 2 == N mod 100 (since n mod 2==0)

Now if N/2 is a multiple of 5

N mod 200 == N mod (20 * 5) == N mod 20 + N mod 5 == N mod 20

If you hash 200 keys with 50% of them going to slots 0-99 and 20% of these 100 going to slots 0-19, then the probably of a collision is MUCH higher. (note: My knowlege of probablity and statistics is probably wrong, but I hope you get the idea).

When you are dealing with primes the probablity of a collision with the hash number and your N is minimized!

So the sizes of your hash tables should be at least 31 and 211 (the next primes after 30 and 200).

With regard to the hash function, one way ahead is to to think of the keys as being numbers in base 26 (if just A:Z, not distinguishing capital and lower case letters, etc.), thus a key "aaa" might correspond to .000, and "zzz" to .999 though if you include a space to allow keys such as "aa " and "nm " a slight rearrangement is needed, as would be needed for other additional symbols. One amusing ploy (and requiring no translation) is to have the storage area holding the text of the keys to be overlaid on an integer variable that swallows the first four bytes (or first two for 16-bit, if all your keys differ in the first two characters); an eighty-bit floating point number could hold ten bytes, but the numerical values would be rather wild, and you'd have to be sure that all key texts generated bit sequences that represent valid floating-point numbers: in this case, fun functions such as arctan could be invoked... More likely, you'll have to do some sort of conversion and obtain a better-behaved number thereby. In the calculation I offered above, this might be done by setting c1 to 26 and c2 to -(int("a")), or similar, and rescaling to bring the result into 0 to 1.

Whichever, your collection of keys are thus viewed as points scattered along a line stretching from zero to one, and a vital question is just how close is the closest pair of such keys. Suppose that stretching the scale from 0 to 1 by a factor of 300·37 is such that the closest pair of points fall to different integers (perhaps aided by cunning choice of using Trunc or Round with a view to splitting the closest pair) - happy days. Nothing more need be done as 301 is a reasonable size for the HashIndex array - annoying issues arise at to whether the end points of zero and one are attained, or merely approached, so need the array run from 0 or 1 to 299 or 300 or 301? Actually, the array need only store the range corresponding to the first and last key point, probably less than a span of 300 unless one key is really "aaa" translating to 0 and another is "zzz" translating to 1.

Matters may be less convenient, and a stretch by a factor of 666 (say) may be needed and that is too big for your taste. Suppose you like a table size of 337 (a prime); H mod 337 will fold points 337 to 666 back to 0:336 and it may be that none of the folded points will fall into slots 0:336 that are already occupied. That would be nice. If not, you could try slightly different stretch factors and table sizes, or, (ugh) a special case test for just one or two troublesome key points... So long as you find values such that there are no collisions for your keys and the hash calculation is swift, it does not matter if the table size is prime or not. Expediency rules.

I'd be curious to know how you get on with your keys. Can you name them?

By clicking you are agreeing to Experts Exchange's Terms of Use.

Title | # Comments | Views | Activity |
---|---|---|---|

Doc'in system (example?) BA | 7 | 63 | |

JQuery Date Time picker not showing | 29 | 82 | |

pre4 challenge | 19 | 73 | |

noX challenge | 17 | 51 |

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**11** Experts available now in Live!