Solved

# Understanding Linear Hashing

Posted on 2007-11-18
1,412 Views
I'm having difficulty understanding how linear hashing works.  I'd like an expert to try and clear up some of the difficulties I'm having understanding this.

I'm very familiar with the overall concept of hash tables, and have implemented them in both C and C++ multiple times, using various forms of collision resolution, so any expert who responds can assume I'll understand what you're talking about when it comes to how hash tables work.

What I'm having difficulty understanding is how Linear Hashing specifically works.  I read the paper on it, and also a summary at http://swig.stanford.edu/pub/summaries/database/linhash.html, but I don't understand exactly what they're saying.

What I've gotten so far about Linear Hashing:

When a collision happens, instead of doing a normal collision resolution method, (e.g. chaining or probing), a linear hash table actually changes the hash function dynamically, through something called a "split", which seems to modify the hash function so that instead of doing hashf(key) % N, you do hashf(key) % ((2^i) * N).  Then, from what I can tell, you add another bucket at the end of the table, and move half of the ith bucket to the end of the table.

I really don't understand what this accomplishes, or how this works.  So, can any expert explain to me how this works?

I seem to learn best by example, so let's take the following example.  Suppose you have a hash table with 100 buckets, and you insert key K and it hashes to bucket 10.  But bucket 10 is already full, so we have a collision.  What precisely happens next?
0
Question by:chsalvia
• 8
• 7

LVL 55

Expert Comment

ID: 20308284
0

LVL 53

Accepted Solution

Infinity08 earned 500 total points
ID: 20308864
>> I seem to learn best by example, so let's take the following example.  Suppose you have a hash table with 100 buckets, and you insert key K and it hashes to bucket 10.  But bucket 10 is already full, so we have a collision.  What precisely happens next?

The following assumes that we're still in the first "stage" of linear hashing. Meaning that when splitting, we use h1 instead of h0.

If the current split address n is 10, then bucket 10 will be split, and part will be stored in bucket 10 and part in bucket 110.
If the current split address n is any other than 10 (the usual case), then a classic collision resolution will occur (chaining eg.) for the inserted value in bucket 10, and bucket n will be split up.

A very good explanation can be found here (especially paragraph 2.1 - basic schema) :

http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF

This is btw the original paper by its inventor Witold Litwin.
0

Author Comment

ID: 20309068
So, split address begins at 0, and is incremented every time a split occurs, right?  And a split occurs every time there is a collision.

I'm confused though as to how lookup is supposed to work.

If you have a table of 100 buckets, and a split occurs, then we take half the values in bucket 0 and move them to bucket 100.  Then, buckets 1-99 are addressed using h0, but bucket 0 and bucket 100 are addressed using h1.  How does the lookup routine know which hash function to use?
0

LVL 53

Expert Comment

ID: 20309085
>> So, split address begins at 0, and is incremented every time a split occurs, right?

Right.

>> And a split occurs every time there is a collision.

Right.

>> If you have a table of 100 buckets, and a split occurs, then we take half the values in bucket 0 and move them to bucket 100.  Then, buckets 1-99 are addressed using h0, but bucket 0 and bucket 100 are addressed using h1.

Correct.

>> How does the lookup routine know which hash function to use?

Because of the current split address. Since the split address is incremented at each split, we know that the buckets under that address use h1, and the ones above it use h0.
0

Author Comment

ID: 20309149
Sorry if I'm being dense, but I'm still not seeing how the lookup actually works.  I understand that we have the split address, and that any bucket below the split address will use h1.  But in order to even find out what bucket we're in, we need to first perform a hash function.  So how do we know which hash function to use?  Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?
0

LVL 53

Expert Comment

ID: 20309155
>> Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?

Indeed. That's what you do.
0

LVL 53

Expert Comment

ID: 20309166
>> >> Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?
>>
>> Indeed. That's what you do.

Of course : if you find the value in the bucket resulting from h0, you don't need to apply h1.
0

Author Comment

ID: 20309178
Okay, I see.  Now, what happens when the number of splits equals the original table size?

For example, if we have a starting table size of 10, and then we do 10 splits, we'll have 20 slots in total, and all of them will use h1.  I assume at this point we create another hash function, h2, and we just repeat this process ad infinitum.  So then, am I correct that every time the size of the table doubles, we must increment the hash function?
0

LVL 53

Expert Comment

ID: 20309182
>> Now, what happens when the number of splits equals the original table size?

You go to the next level, and start using h2 for new splits. (the split address is also reset to 0)

>> So then, am I correct that every time the size of the table doubles, we must increment the hash function?

So, yes indeed. This shouldn't happen too often though, assuming that you properly dimensioned your hash table.
0

Author Comment

ID: 20309205
Okay, I think I understand now.  Two last questions:

I was confused earlier when I read the wikipedia article linear hashing because it says that:

"In linear hashing, the address calculation is always bounded by a size that is a power of two.
* address(level,key) = hash(key) mod (2^level)"

But shouldn't they actually be saying: hash(key) mod (2^level) * N
where N = # of slots in the table at the beginning of the current stage?

Secondly, if an insertion occurs which causes a split, you're supposed to move half of the values in the split address bucket to the newly created end bucket.  But what if the split address bucket happens to be empty at the time, or has only 1 value in it.  Does that matter?
0

LVL 53

Expert Comment

ID: 20309268
>> But shouldn't they actually be saying: hash(key) mod (2^level) * N

Probably. Unless they take N as a power of 2 too, and included that in level ...

>> But what if the split address bucket happens to be empty at the time, or has only 1 value in it.  Does that matter?

No, it doesn't matter. The split still occurs. If there are 0 elements in the bucket, then obviously none will be moved to the newly created bucket (but it still gets created and will be empty for now). If there is 1 element in the bucket, then it might or might not be moved to the new bucket (depending on the h1 hashing function). Etc.
0

Author Comment

ID: 20309292
Okay thanks.  One last thing that occurred to me that I wanted to ask:

Since most collisions will not occur in the split address bucket, it seems that a standard collision resolution technique would need to be employed to handle the majority of collisions.  If you use internal probing, e.g. linear/quadratic probing, to resolve collisions, what is the boundary of the probe?  Can you only probe within the original base size of the table, or, if previous splits have extended the total number of slots, can you probe into the extended part of the table as well?
0

LVL 53

Expert Comment

ID: 20309308
If I understand your question correctly, I think this will answer it : The majority of collisions will indeed result in a split of a different bucket than the one that caused the collision (let's call that bucket B).

But, one of the next times there is a collision, bucket B will also be split, and at that moment, the standard collision handling is "undone", and the linear collision handling takes over. So, with a bit of delay, everything gets linearly hashed eventually.
0

Author Comment

ID: 20309331
Right, but what I'm saying is, when you have to perform a standard collision resolution method, if you happen to use internal probing, do you only probe within the original table size, or can you probe into the extended region?

For example, take again a table of size 10, and suppose that 5 splits occur, so we have a total table size of 15.

So, now our split address is 4.  Suppose then a collision happens at bucket 6 (for example).  Now we need to use internal probing to resolve the collision.  So, what is the range of the probe?  Can we probe all the way to bucket 14?  Or do we wrap-around after bucket 10? (Since 10 was the original size of the table)
0

LVL 53

Expert Comment

ID: 20309358
It's not very handy to use probing as standard collision resolution in a linear hashing table. I'd rather use chaining or a variant of it.

But if you do use probing, then it's probably best to limit yourself to the current "base" table. So, at level 1, that's only the buckets 1 ... N-1, at level 2, the buckets 1 ... 2N-1, etc.
So, in your example, it's probably best to wrap around on bucket 10. Although, you're of course free to implement this any way you like, as it's not part of the linear hashing algorithm ;) You just have to make sure that it's compatible with the linear has halgorithm.
0

Author Comment

ID: 20309370
Okay thanks for all your answers.  You really helped me to understand this.  There's another thing I wanted to ask, but I'll start a new thread.
0

## Featured Post

### Suggested Solutions

There is an easy way, in .NET, to centralize the treatment of all unexpected errors. First of all, instead of launching the application directly in a Form, you need first to write a Sub called Main, in a module. Then, set the Startup Object to thâ€¦
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see soâ€¦
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.