Link to home
Start Free TrialLog in
Avatar of chsalvia
chsalvia

asked on

Disk Based Linear Hashing

I've recently been studying how linear hashing works, and have successfully implemented an in-memory linear hash table.

Disk based linear hashing seems a bit more tricky because handling collisions/bucket overflows is not as simple as allocating another node on a linked-list.  I've read through some collision resolution techniques, such as the one used in Berkeley DB, and there's a few things I don't understand.

It seems a common way to resolve collisions is to allocate specially designated overflow buckets, which are placed between regular buckets at every bucket number which is a power of 2.  So whenever a bucket becomes full, you just place the record in the first available overflow bucket.

The thing I don't understand is, what happens when a linear-hash split occurs?  In an in-memory implementation, when a split occurs, you look at all the records in the split bucket, and move about half of them to a new bucket at the end.  But, on a disk-based implementation, your records could be distributed through a chain of overflow buckets.  So you'd basically have to rehash every single record in the overflow region every time a split occurs, to see if a record needs to be moved.  This could become quite expensive as the hash table grows.  Am I understanding this correctly, or am I missing something about how this works?
Avatar of itsmeandnobodyelse
itsmeandnobodyelse
Flag of Germany image

>>>> what happens when a linear-hash split occurs?
There is no split on a disk-based hash. You already answered what happened yourself:

>>>> So whenever a bucket becomes full, you just place the record in the first available overflow bucket.

Instead of a split the old bucket 'points' to the new 'overflow' bucket.

Regards, Alex
Avatar of Infinity08
>> So you'd basically have to rehash every single record in the overflow region every time a split occurs, to see if a record needs to be moved.  This could become quite expensive as the hash table grows.  Am I understanding this correctly, or am I missing something about how this works?

That's pretty much it, yes. But that's because of your collision resolution mechanism. It will usually not involve too many  extra lookups, since the overflow bucket will usually not be full (on a well dimensioned hash table).
You can avoid this by using chaining as CRM though.
Avatar of chsalvia
chsalvia

ASKER

>> You can avoid this by using chaining as CRM though.

What do you mean by chaining, in terms of a disk based layout?  I thought placing colliding records in overflow buckets is the same thing as chaining.
By chaining, do you mean placing overflow buckets in a separate file?
>>>> I thought placing colliding records in overflow buckets is the same thing as chaining.

'chaining' means to have two fixed-sized records (buckets) pointing to each other. The pointing normally is done by relative record numbers.
>> What do you mean by chaining, in terms of a disk based layout?

Chaining in general means extending the size of a bucket when it gets full. Usually that's implemented by creating some kind of linked list, ie. when a bucket is full, you create a new bucket, and link it to the one that is full.

For disk based hash tables, you can implement that in a similar way. You can create a new bucket (at the end of the same file, or possibly in a new file), and reference it from the full bucket (by using a file offset, an index, ...)


>> I thought placing colliding records in overflow buckets is the same thing as chaining.

In the normal way of using overflow buckets, the same overflow bucket can be used for several hash indexes. That's not the case for chaining.

Take a look here for a nice comparison :

        http://en.wikipedia.org/wiki/Hash_table#Open_addressing_versus_chaining
Okay, so let me see if I understand correctly:

In the context of a disk-based linear hash table, when you say chaining you mean that, when an insert is attempted on a full bucket, a new empty bucket is allocated at the end of the file, and the record is placed there.  The full primary bucket is then linked to the chained bucket using an offset address.  Also, in this scenario, each chained bucket added at the end only accepts records from one primary hash bucket.

The normal way of  using overflow buckets for disk-based linear hashing, however, is similar to chaining in that you still allocate overflow buckets at the end, except overflow buckets can be used to store records from any primary hash bucket.

Am I correct?
>>> a new empty bucket is allocated at the end of the file
Files are fixed-sized in order to avoid fragmentation. So, the free buckets rarely were at end of file but simply unused normal buckets. They may work with free lists were all empty buckets were listed to fastly find a new bucket (or they could use a bitmap where for each free bucket the bit was cleared).  Some disk-hashing may simply read the next buckets until it finds an empty one. Inserting will get very slow then if the file was filled up to a critical threshold or if the hash algorithm produces a bad distribution.

>>>> The normal way of  using overflow buckets for disk-based linear hashing
Don't know whether there is a 'normal' way. Good hashing depends on a good hash code. If you don't know whether your hash code equally will distribute you may have other strategies for the overflow buckets as if you know that nearly each slot was used at least once. A dump method for insert simply is to calculate the hash, then read the bucket. If it is full goto the next bucket already chained to the first bucket. 'Chained' means that at end of each bucket that ever was full was a 'pointer' to the next bucket (normally you would have a reverse pointer as well, in order to be able to 'compress' data later. If all buckets in the chain were full, you search for the next free bucket, normally by reading physical next buckets. If the bucket size is big you also can use a bucket which was not already chained but is less than half full (or any other treshold) and chain it to the end of the already existing bucket chain. An entry can be found by calulating the hash and running thru all chained buckets. If the buckets were big it may make sense to sort entries with a bucket in order to get a faster search.
>> Files are fixed-sized in order to avoid fragmentation.

Hmm, it sounds like you're talking about regular hashing rather than linear hashing.  In linear hashing, the file is not fixed size, but grows/shrinks dynamically.
>>>> In linear hashing, the file is not fixed size, but grows/shrinks dynamically.

You are right. I was not aware that you asked for "linear disk-based hashing" cause I couldn't think that it had a practical relevance (because of fragmentation). But at least the author of the following article has a different opinion: http://swig.stanford.edu/pub/summaries/database/linhash.html   
Disk-based linear hashing is in fact quite often use in database applications, which is why I am interested in learning about it.  Right now, my main source of confusion lies in how exactly overflow/collision handling works in disk-based linear hashing.  

Infinity08 earlier mentioned that you can either use chaining, or overflow buckets.  I'm not exactly sure what is the difference between these two methods, but it seems, if I understood correctly, that the primary difference is that with chaining, a separate chain of overflow buckets is allocated for each primary bucket.  But with the normal way, you have a bunch of designated overflow buckets, which are shared among the primary buckets.  
>> but it seems, if I understood correctly, that the primary difference is that with chaining, a separate chain of overflow buckets is allocated for each primary bucket.  But with the normal way, you have a bunch of designated overflow buckets, which are shared among the primary buckets.

That's basically it, yes. Except that the "normal" way isn't really the normal way ... it's just A way.

Both (and other crm's) have their advantages and disadvantages. It's just my opinion that chaining works better in most situations in the case of linear hashing. In the end having the perfect crm is less important than having a properly dimensioned hash table.
I see what you're saying.  Just to enhance my understanding here, let me ask this:  I was thinking the same thing, that chaining seems to be a better method for collision resolution in the case of linear hashing.

The reason is because with the alternative method, wouldn't you need to attach pointers to each record stored in the overflow buckets?  Since in the alternative method, the records in each overflow bucket are unrelated, and could have come from any primary bucket.  

Therefore, in order to form "chains", you'd need to link the records themselves, rather than whole buckets.  In fact, couldn't the main difference between chaining and the alternative method be described simply as: with chaining you form chains of buckets, but with the alternative method, you form chains of records throughout the overflow region.
>>>> Disk-based linear hashing is in fact quite often use in database applications
I don't know a major DBMS that allows dynamically growing of database files. Can you name one? Or did I misunderstand you?
>> The reason is because with the alternative method, wouldn't you need to attach pointers to each record stored in the overflow buckets?

No, pointers are not needed in that case. When you look something up in a hash table, you already know what you're looking for ... You can just take the hash again to see which primary bucket an overflow record belongs to.
itsmeandnobodyelse:

Berkeley DB implements a dynamically grow linear hash table file.

Infinity08:

>>No, pointers are not needed in that case. When you look something up in a hash table, >> you already know what you're looking for ... You can just take the hash again to see which primary bucket an overflow record belongs to.

But this seems like an very slow way of doing things.  Suppose you lookup a record and don't find it in the primary bucket.  Then wouldn't you have to check every single record in every overflow bucket at current split point?  This would seem especially bad if you have an unsuccessful lookup.  To determine that a key is NOT in the table, you'd need to examine every single overflow record in the entire overflow region at that split point, wouldn't you?

>> But this seems like an very slow way of doing things.

That's indeed one of the downsides of that approach ... But on the other hand, when the hash table is properly dimensioned, you shouldn't have too many overflows before a linear hashing split occurs.


>> Then wouldn't you have to check every single record in every overflow bucket at current split point?

Potentially, yes. But in practice, there's a well defined order in which you check the overflow buckets, and unless you have a lot of overflows, this shouldn't be such a problem. I can't repeat it often enough : a properly dimensioned hash table is VERY important.
Okay, I seem to understand.  Once again, you've been extremely helpful to me on this issue.  The last thing I want to ask is, one potential drawback I see in the chaining method is that it would leave unused buckets throughout the table.

For example, suppose you start with 8 primary buckets (0 - 7) and after a collision, you end up an overflow bucket appended after bucket 7.   Suppose bucket 0 has 5 records in the primary bucket, and 2 records in the overflow bucket.  Now, when a split happens, we need to check every record in bucket 0, including the 2 records in the overflow bucket, to see if we need to move them over to bucket 8.  Supposing the 2 overflow records get moved over to bucket 8, we now have an empty overflow bucket.  Further collisions will now presumably go in overflow buckets appended at the new end of the table, (which would be after bucket 15).  So the original overflow bucket is just sitting there taking up space, never to be used again.  Is that simply a drawback of chaining, or is there some way to reuse old overflow buckets?
ASKER CERTIFIED SOLUTION
Avatar of Infinity08
Infinity08
Flag of Belgium image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>>> Is that simply a drawback of chaining
chaining normally was done in both directions. So, you won't use a new bucket until all buckets of the chain were full.
>>>> Berkeley DB implements a dynamically grow linear hash table file.
I used Berkeley DB myself in a former project but was not aware of any dynamically growing data files. I tried to google for it but found no hint of growing data files:

---------------------------------------
2.1. HashBerkeley DB includes a Hash access method thatimplements extended linear hashing [Litw80]. Extended linear hashing adjusts the hash function as the hash table grows, attempting to keep all buckets under-full in the steady state.The Hash access method supports insertion and deletion of records and lookup by exact match only. Applications may iterate over all records stored in a table, but the order in which they are returned is undefined.
----------------------------------------

If I interpret the above correctly, 'linear hashing' operates with 'growing hash tables' by means of 'adjusting' the hash function. It doesn't mean that the data files grow with each new overflow bucket. That would be a very ineffective method as the file necessarily would get fragmented what can slow down performance dramatically. A sample for that losses is the index file of a file system which slows down with the time in a typical developer environment where temporary files were created and removed.

Regards, Alex