Link to home
Start Free TrialLog in
Avatar of Adrien de Croy
Adrien de Croy

asked on

weinberger hash algorithm flawed?

Hi

the hash algorithm designed by P.J.Weinberger is in widespread use.  I've been trying to track down some discussion however of why it only returns 24 significant bits out of 32.  I can't find anything on google.

I've tested about 5 different variations of the coded algorithm, all producing the same results (not surprisingly because of the stage in the loop that clears the top nybble in the hash value if it contains any bits)

for illustration, I've included one version of the code I tested (far from optimised) below.

unsigned int PJWHash(const std::string& str)
{
   unsigned int BitsInUnsignedInt = (unsigned int)(sizeof(unsigned int) * 8);
   unsigned int ThreeQuarters     = (unsigned int)((BitsInUnsignedInt  * 3) / 4);
   unsigned int OneEighth         = (unsigned int)(BitsInUnsignedInt / 8);
   unsigned int HighBits          = (unsigned int)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);
   unsigned int hash              = 0;
   unsigned int test              = 0;

   for(std::size_t i = 0; i < str.length(); i++)
   {
      hash = (hash << OneEighth) + str[i];

      if((test = hash & HighBits)  != 0)
      {
         hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));
      }
   }

   return hash;
}

So my questions.

1. Am I correct in assuming that the hash function increases the chance of hash collision by (IMHO) needlessly decreasing the spread of allowable hash values?  I.e. 2^28 instead of 2^32 for a 32-bit system.

2. Is this even a problem?
Avatar of ozo
ozo
Flag of United States of America image

Given the article that ozo has helpfully provided you can confidently answer question 2 with no, it is not a problem.

The quality of a hash algorithm is measured by the uniformity of the hash values produced for a set of keys. As you can see from the article this algorithm produces very evenly spread results, both for the full hash value, for the collection of least siginificant bits from the hash, or the collection of most significant bits. The implication is that almost any collection of bits from the resulting has would probably serve equally well.

So, the chance of a collision is about as low as it can get.

The first question is, in effect, would the algorithm have been even better if the resulting has had been allowed to be 32 bits, instead of 28?

I don't know the answer to that question. However, as you can see from the article, any improvement would be theoretical at best. That is, in practice, the algorithm as it stands is about as good as it gets. So there would be no noticeable improvement in performance for any but the most bizarre uses of it (e.g. a hash table with more than 2^28 (over 268 million) entries).
Avatar of Adrien de Croy
Adrien de Croy

ASKER

Sorry should have been a bit more clear.

I understand when it comes to hash tables that a hash collision is not fatal - it just means more entries end up in the same bucket.  uniformity of distribution is an efficiency issue for hash tables.  It's of no concern for compare-by-hash applications.

For compare-by-hash or index-by-hash applications, where implicitly it is assumed that there is a one to one relationship between the value and its hash, then reducing the hash space would seem to be a bad thing.

As for claims about distribution.  In test cases here (we are hashing internal type names), strings of the same length with all but the last character the same end up in hash values with all but the last digit the same, so it doesnt' look like it spreads it much to me.  Specifically it doesn't provide anywhere near 50% chance of any bit changing if any bit in the input changes (if that last bit is in the last byte).

So I guess the answer in the end is choose the right hash for the application.  Maybe I need to use MD5, but I can't take the performance hit of that.  also our dictionary is very small (but unknown) which is in favour of a light-weight hash function.  We simply can't abide hash collisions, as they would screw the type registry which is the foundation of our schema.
If PJWHash(x) is giving you too little distribution, but you like the ferformance, a different approch to try might be PJWHash(F(x)), where F(x) is some simple and fast permutation of x.

For example, you have many names that differ by only the last character - possibly one bit (xxxA - xxxB), and this does not sufficiently alter the hash value. Have you looked at collisions if only the first character differs? In this case, F(x) means reverse the string, or just move the last character to the front. If that still doesn't solve the problem, you could make the difference greater, by making a permutation function such that F(xxxB) = xBxBxB.
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
looks like I'm looking for an ideal solution that is impossible - a perfect hash function for an unknown input set with zero chance of collision.

logically any time the range of the hash is smaller than the possible range of the input there is the possibility of a collision.

So I guess the thing to do is pick a hash function for good performance, and low collision rate, but make sure one catches and handles collisions.  

Thanks ozo and all for comments.  I'm gonna give the points to ozo though (sorry others), since the link led me to an understanding of the issues that I can now deal with.