asked on

Hashtable vs. T-tree

I'm researching in-memory data structures that provide fast lookup time, like Hashtables and T-trees. I'm trying to find out what would be the best data structure to use when I have a table composed of keyfields and datafields, where you can retrieve the datafield by looking up the keyfield. The keyfield would be a word, and the datafield would be the definition of the word.

Some questions I have:

Which data structure is better suited for this, a hashtable or a T-tree?

Where can I find source code examples for these structures (preferably in C++ or C)

ozo

in C++ you could use
std::map<string,string>

ASKER CERTIFIED SOLUTION

RNMcLean

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

chsalvia

ASKER

RNMcLean,

To be more specific, I'd be creating two data structures, one with only about 30 keys, and the other with approximately 200 keys. Both of them would be created once at runtime, and then never added two or deleted from.

ozo

With that few keys, created only once, it should be easy enough to create a
Perfect Minimal Hash Function.

ozo

http://www.nist.gov/dads/HTML/minimalPerfectHash.html

SOLUTION

MogalManic

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

RNMcLean

Referring MogalManiac's remak to ozo's: if indeed you were able to contrive a perfect minimal hash function (where each of the 200 keys hashes to a unique integer in 1:200, an extraordinarily neat trick) for your problem, then, the HashIndex table would have one finger for each entry only, occupying less space than the finger collection of a binary tree, which must maintain two fingers for each entry. Thus, there is some slop available; the hash function need not be perfectly minimal so that a suitable one, easy enough to compute, should be more findable and yet require less store than a binary tree scheme. Remembering that if the hash function is perfect (each key goes to a distinct index) then there will be no need for the entries to have a link to other entries whose key hashes to the same index.
With so few entries, speed is important only if you desire to effect many searches in a hurry. Likewise, unless space is at a premium (with gigabyte memories to splurge in?), a hashtable of 400 entries is surely trivial.
Will you know beforehand whether a key belongs to the 30 group or the 200 group? If not, and even if so, rather than have two hash stashes, having one combined would be OK, and a field in the entry would indicate which group it belonged to. The fun part of hash table is that one probe finds the entry (almost always, etc) thus if you dodn't know which group a key belongs to, searching one table and then the other if not found would be two probes and more effort when a single probe of a combined table would reveal all in one go.

chsalvia

ASKER

I think a Hashtable would be a better way to go. But even though I know the key values beforehand, creating a perfect hash function seems extremely difficult. I downloaded a GNU program called "gperf" which is meant to generate a perfect hash function for a set of keys. Although, it never actually does - (for 30 keys there were 10 or so empty slots). Plus, what I really don't like is that gperf creates the hash function by generating a separate array populated with over 100 slots with various numbers, which it uses for the hash function. To me this seems like it's defeating the point of having a minimal perfect hash in the first place. Instead of wasting memory with a largely unpopulated hash table, they're wasting memory with another array.

Of course, you're right that memory is no big deal. I have 4 GB to spare here, but nonetheless the concept is interesting and it would be very satisfying to create a perfect hash function.

SOLUTION

RNMcLean

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

MogalManic

You want to pick the size of your hash table to be a PRIME number. The reason is this will minimize the clustering of your hash codes.

For example if N is 200 which breaks down to 200=2*2*2*5*5. Assuming you have a good hash function, 50% of your numbers will be divisable by two and 20% will be divisable by 5. When the hash value is moded with N you will get the following:
N is even
N mod 200 == N mod (100*2) == N mod 100+N mod 2 == N mod 100 (since n mod 2==0)
Now if N/2 is a multiple of 5
N mod 200 == N mod (20 * 5) == N mod 20 + N mod 5 == N mod 20

If you hash 200 keys with 50% of them going to slots 0-99 and 20% of these 100 going to slots 0-19, then the probably of a collision is MUCH higher. (note: My knowlege of probablity and statistics is probably wrong, but I hope you get the idea).

When you are dealing with primes the probablity of a collision with the hash number and your N is minimized!

So the sizes of your hash tables should be at least 31 and 211 (the next primes after 30 and 200).

RNMcLean

Err... 150 mod 200 is 150, not 50 which would be 150 mod 100, but no matter, prime sizes are still in favour, though perhaps not so much as in the general case...

With regard to the hash function, one way ahead is to to think of the keys as being numbers in base 26 (if just A:Z, not distinguishing capital and lower case letters, etc.), thus a key "aaa" might correspond to .000, and "zzz" to .999 though if you include a space to allow keys such as "aa " and "nm " a slight rearrangement is needed, as would be needed for other additional symbols. One amusing ploy (and requiring no translation) is to have the storage area holding the text of the keys to be overlaid on an integer variable that swallows the first four bytes (or first two for 16-bit, if all your keys differ in the first two characters); an eighty-bit floating point number could hold ten bytes, but the numerical values would be rather wild, and you'd have to be sure that all key texts generated bit sequences that represent valid floating-point numbers: in this case, fun functions such as arctan could be invoked... More likely, you'll have to do some sort of conversion and obtain a better-behaved number thereby. In the calculation I offered above, this might be done by setting c1 to 26 and c2 to -(int("a")), or similar, and rescaling to bring the result into 0 to 1.
Whichever, your collection of keys are thus viewed as points scattered along a line stretching from zero to one, and a vital question is just how close is the closest pair of such keys. Suppose that stretching the scale from 0 to 1 by a factor of 300·37 is such that the closest pair of points fall to different integers (perhaps aided by cunning choice of using Trunc or Round with a view to splitting the closest pair) - happy days. Nothing more need be done as 301 is a reasonable size for the HashIndex array - annoying issues arise at to whether the end points of zero and one are attained, or merely approached, so need the array run from 0 or 1 to 299 or 300 or 301? Actually, the array need only store the range corresponding to the first and last key point, probably less than a span of 300 unless one key is really "aaa" translating to 0 and another is "zzz" translating to 1.
Matters may be less convenient, and a stretch by a factor of 666 (say) may be needed and that is too big for your taste. Suppose you like a table size of 337 (a prime); H mod 337 will fold points 337 to 666 back to 0:336 and it may be that none of the folded points will fall into slots 0:336 that are already occupied. That would be nice. If not, you could try slightly different stretch factors and table sizes, or, (ugh) a special case test for just one or two troublesome key points... So long as you find values such that there are no collisions for your keys and the hash calculation is swift, it does not matter if the table size is prime or not. Expediency rules.
I'd be curious to know how you get on with your keys. Can you name them?