Disk Based Linear Hashing
Posted on 2007-11-26
I've recently been studying how linear hashing works, and have successfully implemented an in-memory linear hash table.
Disk based linear hashing seems a bit more tricky because handling collisions/bucket overflows is not as simple as allocating another node on a linked-list. I've read through some collision resolution techniques, such as the one used in Berkeley DB, and there's a few things I don't understand.
It seems a common way to resolve collisions is to allocate specially designated overflow buckets, which are placed between regular buckets at every bucket number which is a power of 2. So whenever a bucket becomes full, you just place the record in the first available overflow bucket.
The thing I don't understand is, what happens when a linear-hash split occurs? In an in-memory implementation, when a split occurs, you look at all the records in the split bucket, and move about half of them to a new bucket at the end. But, on a disk-based implementation, your records could be distributed through a chain of overflow buckets. So you'd basically have to rehash every single record in the overflow region every time a split occurs, to see if a record needs to be moved. This could become quite expensive as the hash table grows. Am I understanding this correctly, or am I missing something about how this works?