Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1428
  • Last Modified:

Understanding Linear Hashing

I'm having difficulty understanding how linear hashing works.  I'd like an expert to try and clear up some of the difficulties I'm having understanding this.

I'm very familiar with the overall concept of hash tables, and have implemented them in both C and C++ multiple times, using various forms of collision resolution, so any expert who responds can assume I'll understand what you're talking about when it comes to how hash tables work.

What I'm having difficulty understanding is how Linear Hashing specifically works.  I read the paper on it, and also a summary at http://swig.stanford.edu/pub/summaries/database/linhash.html, but I don't understand exactly what they're saying.

What I've gotten so far about Linear Hashing:

When a collision happens, instead of doing a normal collision resolution method, (e.g. chaining or probing), a linear hash table actually changes the hash function dynamically, through something called a "split", which seems to modify the hash function so that instead of doing hashf(key) % N, you do hashf(key) % ((2^i) * N).  Then, from what I can tell, you add another bucket at the end of the table, and move half of the ith bucket to the end of the table.

I really don't understand what this accomplishes, or how this works.  So, can any expert explain to me how this works?  

I seem to learn best by example, so let's take the following example.  Suppose you have a hash table with 100 buckets, and you insert key K and it hashes to bucket 10.  But bucket 10 is already full, so we have a collision.  What precisely happens next?
0
chsalvia
Asked:
chsalvia
  • 8
  • 7
1 Solution
 
Infinity08Commented:
>> I seem to learn best by example, so let's take the following example.  Suppose you have a hash table with 100 buckets, and you insert key K and it hashes to bucket 10.  But bucket 10 is already full, so we have a collision.  What precisely happens next?

The following assumes that we're still in the first "stage" of linear hashing. Meaning that when splitting, we use h1 instead of h0.

If the current split address n is 10, then bucket 10 will be split, and part will be stored in bucket 10 and part in bucket 110.
If the current split address n is any other than 10 (the usual case), then a classic collision resolution will occur (chaining eg.) for the inserted value in bucket 10, and bucket n will be split up.


A very good explanation can be found here (especially paragraph 2.1 - basic schema) :

        http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF

This is btw the original paper by its inventor Witold Litwin.
0
 
chsalviaAuthor Commented:
So, split address begins at 0, and is incremented every time a split occurs, right?  And a split occurs every time there is a collision.

I'm confused though as to how lookup is supposed to work.

If you have a table of 100 buckets, and a split occurs, then we take half the values in bucket 0 and move them to bucket 100.  Then, buckets 1-99 are addressed using h0, but bucket 0 and bucket 100 are addressed using h1.  How does the lookup routine know which hash function to use?
0
What Kind of Coding Program is Right for You?

There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.

 
Infinity08Commented:
>> So, split address begins at 0, and is incremented every time a split occurs, right?

Right.


>> And a split occurs every time there is a collision.

Right.


>> If you have a table of 100 buckets, and a split occurs, then we take half the values in bucket 0 and move them to bucket 100.  Then, buckets 1-99 are addressed using h0, but bucket 0 and bucket 100 are addressed using h1.

Correct.


>> How does the lookup routine know which hash function to use?

Because of the current split address. Since the split address is incremented at each split, we know that the buckets under that address use h1, and the ones above it use h0.
0
 
chsalviaAuthor Commented:
Sorry if I'm being dense, but I'm still not seeing how the lookup actually works.  I understand that we have the split address, and that any bucket below the split address will use h1.  But in order to even find out what bucket we're in, we need to first perform a hash function.  So how do we know which hash function to use?  Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?
0
 
Infinity08Commented:
>> Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?

Indeed. That's what you do.
0
 
Infinity08Commented:
>> >> Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?
>> 
>> Indeed. That's what you do.

Of course : if you find the value in the bucket resulting from h0, you don't need to apply h1.
0
 
chsalviaAuthor Commented:
Okay, I see.  Now, what happens when the number of splits equals the original table size?

For example, if we have a starting table size of 10, and then we do 10 splits, we'll have 20 slots in total, and all of them will use h1.  I assume at this point we create another hash function, h2, and we just repeat this process ad infinitum.  So then, am I correct that every time the size of the table doubles, we must increment the hash function?
0
 
Infinity08Commented:
>> Now, what happens when the number of splits equals the original table size?

You go to the next level, and start using h2 for new splits. (the split address is also reset to 0)


>> So then, am I correct that every time the size of the table doubles, we must increment the hash function?

So, yes indeed. This shouldn't happen too often though, assuming that you properly dimensioned your hash table.
0
 
chsalviaAuthor Commented:
Okay, I think I understand now.  Two last questions:

I was confused earlier when I read the wikipedia article linear hashing because it says that:

"In linear hashing, the address calculation is always bounded by a size that is a power of two.
    * address(level,key) = hash(key) mod (2^level)"

But shouldn't they actually be saying: hash(key) mod (2^level) * N
where N = # of slots in the table at the beginning of the current stage?

Secondly, if an insertion occurs which causes a split, you're supposed to move half of the values in the split address bucket to the newly created end bucket.  But what if the split address bucket happens to be empty at the time, or has only 1 value in it.  Does that matter?
0
 
Infinity08Commented:
>> But shouldn't they actually be saying: hash(key) mod (2^level) * N

Probably. Unless they take N as a power of 2 too, and included that in level ...


>> But what if the split address bucket happens to be empty at the time, or has only 1 value in it.  Does that matter?

No, it doesn't matter. The split still occurs. If there are 0 elements in the bucket, then obviously none will be moved to the newly created bucket (but it still gets created and will be empty for now). If there is 1 element in the bucket, then it might or might not be moved to the new bucket (depending on the h1 hashing function). Etc.
0
 
chsalviaAuthor Commented:
Okay thanks.  One last thing that occurred to me that I wanted to ask:

Since most collisions will not occur in the split address bucket, it seems that a standard collision resolution technique would need to be employed to handle the majority of collisions.  If you use internal probing, e.g. linear/quadratic probing, to resolve collisions, what is the boundary of the probe?  Can you only probe within the original base size of the table, or, if previous splits have extended the total number of slots, can you probe into the extended part of the table as well?
0
 
Infinity08Commented:
If I understand your question correctly, I think this will answer it : The majority of collisions will indeed result in a split of a different bucket than the one that caused the collision (let's call that bucket B).

But, one of the next times there is a collision, bucket B will also be split, and at that moment, the standard collision handling is "undone", and the linear collision handling takes over. So, with a bit of delay, everything gets linearly hashed eventually.
0
 
chsalviaAuthor Commented:
Right, but what I'm saying is, when you have to perform a standard collision resolution method, if you happen to use internal probing, do you only probe within the original table size, or can you probe into the extended region?

For example, take again a table of size 10, and suppose that 5 splits occur, so we have a total table size of 15.

So, now our split address is 4.  Suppose then a collision happens at bucket 6 (for example).  Now we need to use internal probing to resolve the collision.  So, what is the range of the probe?  Can we probe all the way to bucket 14?  Or do we wrap-around after bucket 10? (Since 10 was the original size of the table)
0
 
Infinity08Commented:
It's not very handy to use probing as standard collision resolution in a linear hashing table. I'd rather use chaining or a variant of it.

But if you do use probing, then it's probably best to limit yourself to the current "base" table. So, at level 1, that's only the buckets 1 ... N-1, at level 2, the buckets 1 ... 2N-1, etc.
So, in your example, it's probably best to wrap around on bucket 10. Although, you're of course free to implement this any way you like, as it's not part of the linear hashing algorithm ;) You just have to make sure that it's compatible with the linear has halgorithm.
0
 
chsalviaAuthor Commented:
Okay thanks for all your answers.  You really helped me to understand this.  There's another thing I wanted to ask, but I'll start a new thread.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

What Kind of Coding Program is Right for You?

There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.

  • 8
  • 7
Tackle projects and never again get stuck behind a technical roadblock.
Join Now