Solved

Understanding Linear Hashing

Posted on 2007-11-18
16
1,412 Views
Last Modified: 2012-05-05
I'm having difficulty understanding how linear hashing works.  I'd like an expert to try and clear up some of the difficulties I'm having understanding this.

I'm very familiar with the overall concept of hash tables, and have implemented them in both C and C++ multiple times, using various forms of collision resolution, so any expert who responds can assume I'll understand what you're talking about when it comes to how hash tables work.

What I'm having difficulty understanding is how Linear Hashing specifically works.  I read the paper on it, and also a summary at http://swig.stanford.edu/pub/summaries/database/linhash.html, but I don't understand exactly what they're saying.

What I've gotten so far about Linear Hashing:

When a collision happens, instead of doing a normal collision resolution method, (e.g. chaining or probing), a linear hash table actually changes the hash function dynamically, through something called a "split", which seems to modify the hash function so that instead of doing hashf(key) % N, you do hashf(key) % ((2^i) * N).  Then, from what I can tell, you add another bucket at the end of the table, and move half of the ith bucket to the end of the table.

I really don't understand what this accomplishes, or how this works.  So, can any expert explain to me how this works?  

I seem to learn best by example, so let's take the following example.  Suppose you have a hash table with 100 buckets, and you insert key K and it hashes to bucket 10.  But bucket 10 is already full, so we have a collision.  What precisely happens next?
0
Comment
Question by:chsalvia
  • 8
  • 7
16 Comments
 
LVL 55

Expert Comment

by:Jaime Olivares
ID: 20308284
0
 
LVL 53

Accepted Solution

by:
Infinity08 earned 500 total points
ID: 20308864
>> I seem to learn best by example, so let's take the following example.  Suppose you have a hash table with 100 buckets, and you insert key K and it hashes to bucket 10.  But bucket 10 is already full, so we have a collision.  What precisely happens next?

The following assumes that we're still in the first "stage" of linear hashing. Meaning that when splitting, we use h1 instead of h0.

If the current split address n is 10, then bucket 10 will be split, and part will be stored in bucket 10 and part in bucket 110.
If the current split address n is any other than 10 (the usual case), then a classic collision resolution will occur (chaining eg.) for the inserted value in bucket 10, and bucket n will be split up.


A very good explanation can be found here (especially paragraph 2.1 - basic schema) :

        http://www.cs.cmu.edu/afs/cs.cmu.edu/user/christos/www/courses/826-resources/PAPERS+BOOK/linear-hashing.PDF

This is btw the original paper by its inventor Witold Litwin.
0
 

Author Comment

by:chsalvia
ID: 20309068
So, split address begins at 0, and is incremented every time a split occurs, right?  And a split occurs every time there is a collision.

I'm confused though as to how lookup is supposed to work.

If you have a table of 100 buckets, and a split occurs, then we take half the values in bucket 0 and move them to bucket 100.  Then, buckets 1-99 are addressed using h0, but bucket 0 and bucket 100 are addressed using h1.  How does the lookup routine know which hash function to use?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 20309085
>> So, split address begins at 0, and is incremented every time a split occurs, right?

Right.


>> And a split occurs every time there is a collision.

Right.


>> If you have a table of 100 buckets, and a split occurs, then we take half the values in bucket 0 and move them to bucket 100.  Then, buckets 1-99 are addressed using h0, but bucket 0 and bucket 100 are addressed using h1.

Correct.


>> How does the lookup routine know which hash function to use?

Because of the current split address. Since the split address is incremented at each split, we know that the buckets under that address use h1, and the ones above it use h0.
0
 

Author Comment

by:chsalvia
ID: 20309149
Sorry if I'm being dense, but I'm still not seeing how the lookup actually works.  I understand that we have the split address, and that any bucket below the split address will use h1.  But in order to even find out what bucket we're in, we need to first perform a hash function.  So how do we know which hash function to use?  Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 20309155
>> Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?

Indeed. That's what you do.
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 20309166
>> >> Do we just always start by using h0, and then perform h1 if we hash to a bucket below the current split address?
>>
>> Indeed. That's what you do.

Of course : if you find the value in the bucket resulting from h0, you don't need to apply h1.
0
 

Author Comment

by:chsalvia
ID: 20309178
Okay, I see.  Now, what happens when the number of splits equals the original table size?

For example, if we have a starting table size of 10, and then we do 10 splits, we'll have 20 slots in total, and all of them will use h1.  I assume at this point we create another hash function, h2, and we just repeat this process ad infinitum.  So then, am I correct that every time the size of the table doubles, we must increment the hash function?
0
Why You Should Analyze Threat Actor TTPs

After years of analyzing threat actor behavior, it’s become clear that at any given time there are specific tactics, techniques, and procedures (TTPs) that are particularly prevalent. By analyzing and understanding these TTPs, you can dramatically enhance your security program.

 
LVL 53

Expert Comment

by:Infinity08
ID: 20309182
>> Now, what happens when the number of splits equals the original table size?

You go to the next level, and start using h2 for new splits. (the split address is also reset to 0)


>> So then, am I correct that every time the size of the table doubles, we must increment the hash function?

So, yes indeed. This shouldn't happen too often though, assuming that you properly dimensioned your hash table.
0
 

Author Comment

by:chsalvia
ID: 20309205
Okay, I think I understand now.  Two last questions:

I was confused earlier when I read the wikipedia article linear hashing because it says that:

"In linear hashing, the address calculation is always bounded by a size that is a power of two.
    * address(level,key) = hash(key) mod (2^level)"

But shouldn't they actually be saying: hash(key) mod (2^level) * N
where N = # of slots in the table at the beginning of the current stage?

Secondly, if an insertion occurs which causes a split, you're supposed to move half of the values in the split address bucket to the newly created end bucket.  But what if the split address bucket happens to be empty at the time, or has only 1 value in it.  Does that matter?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 20309268
>> But shouldn't they actually be saying: hash(key) mod (2^level) * N

Probably. Unless they take N as a power of 2 too, and included that in level ...


>> But what if the split address bucket happens to be empty at the time, or has only 1 value in it.  Does that matter?

No, it doesn't matter. The split still occurs. If there are 0 elements in the bucket, then obviously none will be moved to the newly created bucket (but it still gets created and will be empty for now). If there is 1 element in the bucket, then it might or might not be moved to the new bucket (depending on the h1 hashing function). Etc.
0
 

Author Comment

by:chsalvia
ID: 20309292
Okay thanks.  One last thing that occurred to me that I wanted to ask:

Since most collisions will not occur in the split address bucket, it seems that a standard collision resolution technique would need to be employed to handle the majority of collisions.  If you use internal probing, e.g. linear/quadratic probing, to resolve collisions, what is the boundary of the probe?  Can you only probe within the original base size of the table, or, if previous splits have extended the total number of slots, can you probe into the extended part of the table as well?
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 20309308
If I understand your question correctly, I think this will answer it : The majority of collisions will indeed result in a split of a different bucket than the one that caused the collision (let's call that bucket B).

But, one of the next times there is a collision, bucket B will also be split, and at that moment, the standard collision handling is "undone", and the linear collision handling takes over. So, with a bit of delay, everything gets linearly hashed eventually.
0
 

Author Comment

by:chsalvia
ID: 20309331
Right, but what I'm saying is, when you have to perform a standard collision resolution method, if you happen to use internal probing, do you only probe within the original table size, or can you probe into the extended region?

For example, take again a table of size 10, and suppose that 5 splits occur, so we have a total table size of 15.

So, now our split address is 4.  Suppose then a collision happens at bucket 6 (for example).  Now we need to use internal probing to resolve the collision.  So, what is the range of the probe?  Can we probe all the way to bucket 14?  Or do we wrap-around after bucket 10? (Since 10 was the original size of the table)
0
 
LVL 53

Expert Comment

by:Infinity08
ID: 20309358
It's not very handy to use probing as standard collision resolution in a linear hashing table. I'd rather use chaining or a variant of it.

But if you do use probing, then it's probably best to limit yourself to the current "base" table. So, at level 1, that's only the buckets 1 ... N-1, at level 2, the buckets 1 ... 2N-1, etc.
So, in your example, it's probably best to wrap around on bucket 10. Although, you're of course free to implement this any way you like, as it's not part of the linear hashing algorithm ;) You just have to make sure that it's compatible with the linear has halgorithm.
0
 

Author Comment

by:chsalvia
ID: 20309370
Okay thanks for all your answers.  You really helped me to understand this.  There's another thing I wanted to ask, but I'll start a new thread.
0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

There is an easy way, in .NET, to centralize the treatment of all unexpected errors. First of all, instead of launching the application directly in a Form, you need first to write a Sub called Main, in a module. Then, set the Startup Object to th…
This is a short and sweet, but (hopefully) to the point article. There seems to be some fundamental misunderstanding about the function prototype for the "main" function in C and C++, more specifically what type this function should return. I see so…
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
The viewer will be introduced to the member functions push_back and pop_back of the vector class. The video will teach the difference between the two as well as how to use each one along with its functionality.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now