Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1430
  • Last Modified:

Hashtable vs. T-tree

I'm researching in-memory data structures that provide fast lookup time, like Hashtables and T-trees.  I'm trying to find out what would be the best data structure to use when I have a table composed of keyfields and datafields, where you can retrieve the datafield by looking up the keyfield.  The keyfield would be a word, and the datafield would be the definition of the word.  

Some questions I have:

Which data structure is better suited for this, a hashtable or a T-tree?

Where can I find source code examples for these structures (preferably in C++ or C)



0
chsalvia
Asked:
chsalvia
  • 4
  • 3
  • 2
  • +1
3 Solutions
 
ozoCommented:
in C++ you could use
std::map<string,string>
0
 
RNMcLeanCommented:
  Your operating system and language might offer you a random-key file record access service, in which case you could just use the system service. If however you want to write your own scheme, preferably taking advantage of the special behaviours of your problem, then it depends...
   Will the table be created once and then referenced only, or, will you be adding new entries often? If so, is it important to have rapid completion for every addition? Like, when using a tree-structure scheme, some additions will provoke a rebalancing of the tree and some of these re-balancings will involve a lot of work.
   Will all keys have about equal likelihood of being requested, or will certain keys predominate? In English text, the word "the" is the most frequently appearing, while "zozzle" is rare.
   Have you some ideas as to the likely number of keys? This is important for using a hashing system. Too small a hash table leads to sub-searches, too large wastes space.

   I prefer an indexed hash with linked lists and have had satisfaction. Following Knuth, choose a prime number about double the number of the mostest entries you expect to have to deal with, or as big as seems convenient with regard to available storage as consumed; call this N. You will need an array HashIndex(0:N - 1), to point to stored entries. A stored entry contains a copy of the key, the data belonging to the key, and a link (probably zero) to a further entry, needed if more than one key, as encountered, hashed to the same index. The stash of stored entries grows linearly as new entries are added; the HashIndex array is used at random, and the birthday paradox applies for its occupancy.
  Generally, the hash table approach delivers its result in one hash probe only, with one key comparison and sometimes two and rarely three should the linked-list have to be chased... Exact statistics depend on how many entries there are compared to N, but even if there were twice as many stored entries as N, average performance would only degrade to around two key comparisons per search. It is quite practical to have the HashIndex array in main memory (typically an array of a few thousand 32-bit numbers), while the stash is on disc storage.
   A benefit of the indexed hash table (instead of open hashing) is that popular entries will tend to be together at the start of the entry store, because they would have been likely to have been encountered early as the stash was expanded. Depending on circumstances, you could pre-load the stash with known-to-be-popular entries, or, once the stash has grown to a worthy size, your system could pause and sort its entries according to popularity. If the keys to be stored were all known beforehand and fixed (as with the keywords of a computer language), you could try twiddling your hash function so that a variant is discovered that assigns each of the known keys a unique hash index number.

   Of course, the only definitive answer as to which would be best for your problem would be to implement both and run a comparison...
 
0
 
chsalviaAuthor Commented:
RNMcLean,

To be more specific, I'd be creating two data structures, one with only about 30 keys, and the other with approximately 200 keys.  Both of them would be created once at runtime, and then never added two or deleted from.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
ozoCommented:
With that few keys, created only once, it should be easy enough to create a
Perfect Minimal Hash Function.
0
 
MogalManicCommented:
The issue between hashtable and tree is speed vs. memory.  

SPEED:
The hashtable insert and search speed is O(1) (On average, it takes aproxamatly 1 compare operation to insert or search for an item).  The binary tree (if it is balanced) has a insert/search speed of O(lg N).  For a binary tree of size 200 that computes to approximatly O(6) (On average it will take approxamatly 6 compares to find and item in the tree).

Memory:
A hashtable intrinsicly uses more space than a binary tree.   The binary tree of N elements will allocate space for N elements.  A hashtable has a more complex algorithm, the size should be a prime number LARGER than the # of elements stored in it.  The reason is that it decreases the probability that the hash function will compute the same location for to different keys (this is called a collision).  When the collision occurs, the hashtable will use an alternate method for storing the element (it could be a linked list, or it could choose the next empty slot in the table).  So for a hashtable of 200 items, you could choose a size of 401.


Here are some implementations of balanced Binary trees.
http://www.gamedev.net/reference/articles/article2119.asp
http://www.codeguru.com/cpp/cpp/algorithms/article.php/c5129/
http://www.codeproject.com/cpp/avl_tree.asp

As for hashtables, I would use the STL implementation (it has been tested and created by experts)
If you want to examine source implementations, here's one:http://burtleburtle.net/bob/hash/hashtab.html
If you want more just google for it.
0
 
RNMcLeanCommented:
  Referring MogalManiac's remak to ozo's: if indeed you were able to contrive a perfect minimal hash function (where each of the 200 keys hashes to a unique integer in 1:200, an extraordinarily neat trick) for your problem, then, the HashIndex table would have one finger for each entry only, occupying less space than the finger collection of a binary tree, which must maintain two fingers for each entry. Thus, there is some slop available; the hash function need not be perfectly minimal so that a suitable one, easy enough to compute, should be more findable and yet require less store than a binary tree scheme. Remembering that if the hash function is perfect (each key goes to a distinct index) then there will be no need for the entries to have a link to other entries whose key hashes to the same index.
   With so few entries, speed is important only if you desire to effect many searches in a hurry. Likewise, unless space is at a premium (with gigabyte memories to splurge in?), a hashtable of 400 entries is surely trivial.
   Will you know beforehand whether a key belongs to the 30 group or the 200 group? If not, and even if so, rather than have two hash stashes, having one combined would be OK, and a field in the entry would indicate which group it belonged to. The fun part of hash table is that one probe finds the entry (almost always, etc) thus if you dodn't know which group a key belongs to, searching one table and then the other if not found would be two probes and more effort when a single probe of a combined table would reveal all in one go.
0
 
chsalviaAuthor Commented:
I think a Hashtable would be a better way to go.  But even though I know the key values beforehand, creating a perfect hash function seems extremely difficult.  I downloaded a GNU program called "gperf" which is meant to generate a perfect hash function for a set of keys.  Although, it never actually does - (for 30 keys there were 10 or so empty slots).  Plus, what I really don't like is that gperf creates the hash function by generating a separate array populated with over 100 slots with various numbers, which it uses for the hash function.  To me this seems like it's defeating the point of having a minimal perfect hash in the first place.  Instead of wasting memory with a largely unpopulated hash table, they're wasting memory with another array.

Of course, you're right that memory is no big deal.  I have 4 GB to spare here, but nonetheless the concept is interesting and it would be very satisfying to create a perfect hash function.
0
 
RNMcLeanCommented:
  A Perfect Hash Function is obviously a rare and pure beast, a pleasure to behold, and I can tell you an easy way to devise such a thing, though not one of beauty. The function secretly records internally every distinct key presented to it, numbering it "i", and, on being presented with a key to hash, does a secret internal search (linear look-up? Who cares!) and returns the "i" as appropriate. Perhaps your keys all differ within the first two bytes? Who cares what lurks under the carpet. These schemes of course require auxiliary storage as you say, and are to be deprecated as not being proper hash functions.
   I have used a scheme somewhat as follows, using 32-bit 2's complement arithmetic as it is convenient, but unsigned arithmetic might be better as then no negative numbers can arise to confuse the mod operation. Whichever, here is the scheme:

   h:=0;  %Or, h:=length(string) just for variety.
   for i:=1 to Length(string) do
    h:=h*c1 + int(string[i]) + c2;   %I don't care about integer overflow. Beware system overhead.
   next i;
   h:=h mod N; %Beware of mod's response to negative numbers. Ensure that h is in 0:N-1.

   For c1, I might have chosen 493, for c2, zero. It doesn't much matter. The object is to thoroughly scramble the bits, and to do so without extensive computation and complex paramaterisation. (100+ values, forsooth!) Perhaps you can equivalence the string to an array of 16-bit integers and crunch double-byted - including the length is quite satisfactory also. Searching for perfection I suspect involves probabilities such as 1/N! which are rather thin.
   Obviously, you could run a simple search for values of c1 and c2 that minimise the clashes for your choice of N and collection of keys. The functional form can be varied also, though one does seek to avoid extensive computation. Since these days functions such as ArcTan seem to be computed about as fast as a multiply, further opportunities present themselves beyond fast simple operations such as XOR.
   An adequate goal would be a hash function that produces no collisions for your set of keys and choice of N. Then there would be no need for chaining, and thus no need for those fingers. With a N of 400, you still have no more fingers than would a binary tree for 200 entries.
0
 
MogalManicCommented:
You want to pick the size of your hash table to be a PRIME number.  The reason is this will minimize the clustering of your hash codes.

For example if N is 200 which breaks down to 200=2*2*2*5*5.  Assuming you have a good hash function, 50% of your numbers will be divisable by two and 20% will be divisable by 5.  When the hash value is moded with N you will get the following:
   N is even
   N mod 200  == N mod (100*2) == N mod 100+N mod 2 == N mod 100 (since n mod 2==0)
   Now if N/2 is a multiple of 5
   N mod 200  == N mod (20 * 5) == N mod 20 + N mod 5 == N mod 20

If you hash 200 keys with 50% of them going to slots 0-99 and 20% of these 100 going to slots 0-19, then the probably of a collision is MUCH higher.  (note: My knowlege of probablity and statistics is probably wrong, but I hope you get the idea).

When you are dealing with primes the probablity of a collision with the hash number and your N is minimized!

So the sizes of your hash tables should be at least 31 and 211 (the next primes after 30 and 200).
0
 
RNMcLeanCommented:
  Err... 150 mod 200 is 150, not 50 which would be 150 mod 100, but no matter, prime sizes are still in favour, though perhaps not so much as in the general case...

   With regard to the hash function, one way ahead is to to think of the keys as being numbers in base 26 (if just A:Z, not distinguishing capital and lower case letters, etc.), thus a key "aaa" might correspond to .000, and "zzz" to .999 though if you include a space to allow keys such as "aa " and "nm " a slight rearrangement is needed, as would be needed for other additional symbols. One amusing ploy (and requiring no translation) is to have the storage area holding the text of the keys to be overlaid on an integer variable that swallows the first four bytes (or first two for 16-bit, if all your keys differ in the first two characters); an eighty-bit floating point number could hold ten bytes, but the numerical values would be rather wild, and you'd have to be sure that all key texts generated bit sequences that represent valid floating-point numbers: in this case, fun functions such as arctan could be invoked... More likely, you'll have to do some sort of conversion and obtain a better-behaved number thereby. In the calculation I offered above, this might be done by setting c1 to 26 and c2 to -(int("a")), or similar, and rescaling to bring the result into 0 to 1.
   Whichever, your collection of keys are thus viewed as points scattered along a line stretching from zero to one, and a vital question is just how close is the closest pair of such keys. Suppose that stretching the scale from 0 to 1 by a factor of 300ยท37 is such that the closest pair of points fall to different integers (perhaps aided by cunning choice of using Trunc or Round with a view to splitting the closest pair) - happy days. Nothing more need be done as 301 is a reasonable size for the HashIndex array - annoying issues arise at to whether the end points of zero and one are attained, or merely approached, so need the array run from 0 or 1 to 299 or 300 or 301? Actually, the array need only store the range corresponding to the first and last key point, probably less than a span of 300 unless one key is really "aaa" translating to 0 and another is "zzz" translating to 1.
   Matters may be less convenient, and a stretch by a factor of 666 (say) may be needed and that is too big for your taste. Suppose you like a table size of 337 (a prime); H mod 337 will fold points 337 to 666 back to 0:336 and it may be that none of the folded points will fall into slots 0:336 that are already occupied. That would be nice. If not, you could try slightly different stretch factors and table sizes, or, (ugh) a special case test for just one or two troublesome key points... So long as you find values such that there are no collisions for your keys and the hash calculation is swift, it does not matter if the table size is prime or not. Expediency rules.
   I'd be curious to know how you get on with your keys. Can you name them?
0

Featured Post

Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 4
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now