Solved

Posted on 2007-07-25

Hi

the hash algorithm designed by P.J.Weinberger is in widespread use. I've been trying to track down some discussion however of why it only returns 24 significant bits out of 32. I can't find anything on google.

I've tested about 5 different variations of the coded algorithm, all producing the same results (not surprisingly because of the stage in the loop that clears the top nybble in the hash value if it contains any bits)

for illustration, I've included one version of the code I tested (far from optimised) below.

unsigned int PJWHash(const std::string& str)

{

unsigned int BitsInUnsignedInt = (unsigned int)(sizeof(unsigned int) * 8);

unsigned int ThreeQuarters = (unsigned int)((BitsInUnsignedInt * 3) / 4);

unsigned int OneEighth = (unsigned int)(BitsInUnsignedInt / 8);

unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);

unsigned int hash = 0;

unsigned int test = 0;

for(std::size_t i = 0; i < str.length(); i++)

{

hash = (hash << OneEighth) + str[i];

if((test = hash & HighBits) != 0)

{

hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));

}

}

return hash;

}

So my questions.

1. Am I correct in assuming that the hash function increases the chance of hash collision by (IMHO) needlessly decreasing the spread of allowable hash values? I.e. 2^28 instead of 2^32 for a 32-bit system.

2. Is this even a problem?

the hash algorithm designed by P.J.Weinberger is in widespread use. I've been trying to track down some discussion however of why it only returns 24 significant bits out of 32. I can't find anything on google.

I've tested about 5 different variations of the coded algorithm, all producing the same results (not surprisingly because of the stage in the loop that clears the top nybble in the hash value if it contains any bits)

for illustration, I've included one version of the code I tested (far from optimised) below.

unsigned int PJWHash(const std::string& str)

{

unsigned int BitsInUnsignedInt = (unsigned int)(sizeof(unsigned int) * 8);

unsigned int ThreeQuarters = (unsigned int)((BitsInUnsignedInt * 3) / 4);

unsigned int OneEighth = (unsigned int)(BitsInUnsignedInt / 8);

unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);

unsigned int hash = 0;

unsigned int test = 0;

for(std::size_t i = 0; i < str.length(); i++)

{

hash = (hash << OneEighth) + str[i];

if((test = hash & HighBits) != 0)

{

hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));

}

}

return hash;

}

So my questions.

1. Am I correct in assuming that the hash function increases the chance of hash collision by (IMHO) needlessly decreasing the spread of allowable hash values? I.e. 2^28 instead of 2^32 for a 32-bit system.

2. Is this even a problem?

6 Comments

The quality of a hash algorithm is measured by the uniformity of the hash values produced for a set of keys. As you can see from the article this algorithm produces very evenly spread results, both for the full hash value, for the collection of least siginificant bits from the hash, or the collection of most significant bits. The implication is that almost any collection of bits from the resulting has would probably serve equally well.

So, the chance of a collision is about as low as it can get.

The first question is, in effect, would the algorithm have been even better if the resulting has had been allowed to be 32 bits, instead of 28?

I don't know the answer to that question. However, as you can see from the article, any improvement would be theoretical at best. That is, in practice, the algorithm as it stands is about as good as it gets. So there would be no noticeable improvement in performance for any but the most bizarre uses of it (e.g. a hash table with more than 2^28 (over 268 million) entries).

I understand when it comes to hash tables that a hash collision is not fatal - it just means more entries end up in the same bucket. uniformity of distribution is an efficiency issue for hash tables. It's of no concern for compare-by-hash applications.

For compare-by-hash or index-by-hash applications, where implicitly it is assumed that there is a one to one relationship between the value and its hash, then reducing the hash space would seem to be a bad thing.

As for claims about distribution. In test cases here (we are hashing internal type names), strings of the same length with all but the last character the same end up in hash values with all but the last digit the same, so it doesnt' look like it spreads it much to me. Specifically it doesn't provide anywhere near 50% chance of any bit changing if any bit in the input changes (if that last bit is in the last byte).

So I guess the answer in the end is choose the right hash for the application. Maybe I need to use MD5, but I can't take the performance hit of that. also our dictionary is very small (but unknown) which is in favour of a light-weight hash function. We simply can't abide hash collisions, as they would screw the type registry which is the foundation of our schema.

For example, you have many names that differ by only the last character - possibly one bit (xxxA - xxxB), and this does not sufficiently alter the hash value. Have you looked at collisions if only the first character differs? In this case, F(x) means reverse the string, or just move the last character to the front. If that still doesn't solve the problem, you could make the difference greater, by making a permutation function such that F(xxxB) = xBxBxB.

But changing any single input character will always change the output

There is often a trade off between speed and thorough mixing.

PJW tends to make that in favor of speed.

If you want a perfect hash function for a fixed set of inputs., there are ways to develop them.

But even MD5 can't guarantee that you won't have collisions if someone is allowed to choose inputs to create a collision

logically any time the range of the hash is smaller than the possible range of the input there is the possibility of a collision.

So I guess the thing to do is pick a hash function for good performance, and low collision rate, but make sure one catches and handles collisions.

Thanks ozo and all for comments. I'm gonna give the points to ozo though (sorry others), since the link led me to an understanding of the issues that I can now deal with.

Title | # Comments | Views | Activity |
---|---|---|---|

Simple (?) problem getting Rainbow Folders to run | 6 | 73 | |

copyEvens challenge | 6 | 47 | |

word0 challenge | 3 | 33 | |

Maximum possible 10 character string combinations from 36 unique characters | 18 | 29 |

This is an explanation of a simple data model to help parse a JSON feed

A short article about problems I had with the new location API and permissions in Marshmallow

Join the community of 500,000 technology professionals and ask your questions.

Connect with top rated Experts

**18** Experts available now in Live!