weinberger hash algorithm flawed?

Posted on 2007-07-25
Last Modified: 2008-01-09

the hash algorithm designed by P.J.Weinberger is in widespread use.  I've been trying to track down some discussion however of why it only returns 24 significant bits out of 32.  I can't find anything on google.

I've tested about 5 different variations of the coded algorithm, all producing the same results (not surprisingly because of the stage in the loop that clears the top nybble in the hash value if it contains any bits)

for illustration, I've included one version of the code I tested (far from optimised) below.

unsigned int PJWHash(const std::string& str)
   unsigned int BitsInUnsignedInt = (unsigned int)(sizeof(unsigned int) * 8);
   unsigned int ThreeQuarters     = (unsigned int)((BitsInUnsignedInt  * 3) / 4);
   unsigned int OneEighth         = (unsigned int)(BitsInUnsignedInt / 8);
   unsigned int HighBits          = (unsigned int)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);
   unsigned int hash              = 0;
   unsigned int test              = 0;

   for(std::size_t i = 0; i < str.length(); i++)
      hash = (hash << OneEighth) + str[i];

      if((test = hash & HighBits)  != 0)
         hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));

   return hash;

So my questions.

1. Am I correct in assuming that the hash function increases the chance of hash collision by (IMHO) needlessly decreasing the spread of allowable hash values?  I.e. 2^28 instead of 2^32 for a 32-bit system.

2. Is this even a problem?
Question by:Adrien de Croy
    LVL 84

    Expert Comment

    LVL 16

    Expert Comment

    Given the article that ozo has helpfully provided you can confidently answer question 2 with no, it is not a problem.

    The quality of a hash algorithm is measured by the uniformity of the hash values produced for a set of keys. As you can see from the article this algorithm produces very evenly spread results, both for the full hash value, for the collection of least siginificant bits from the hash, or the collection of most significant bits. The implication is that almost any collection of bits from the resulting has would probably serve equally well.

    So, the chance of a collision is about as low as it can get.

    The first question is, in effect, would the algorithm have been even better if the resulting has had been allowed to be 32 bits, instead of 28?

    I don't know the answer to that question. However, as you can see from the article, any improvement would be theoretical at best. That is, in practice, the algorithm as it stands is about as good as it gets. So there would be no noticeable improvement in performance for any but the most bizarre uses of it (e.g. a hash table with more than 2^28 (over 268 million) entries).
    LVL 3

    Author Comment

    by:Adrien de Croy
    Sorry should have been a bit more clear.

    I understand when it comes to hash tables that a hash collision is not fatal - it just means more entries end up in the same bucket.  uniformity of distribution is an efficiency issue for hash tables.  It's of no concern for compare-by-hash applications.

    For compare-by-hash or index-by-hash applications, where implicitly it is assumed that there is a one to one relationship between the value and its hash, then reducing the hash space would seem to be a bad thing.

    As for claims about distribution.  In test cases here (we are hashing internal type names), strings of the same length with all but the last character the same end up in hash values with all but the last digit the same, so it doesnt' look like it spreads it much to me.  Specifically it doesn't provide anywhere near 50% chance of any bit changing if any bit in the input changes (if that last bit is in the last byte).

    So I guess the answer in the end is choose the right hash for the application.  Maybe I need to use MD5, but I can't take the performance hit of that.  also our dictionary is very small (but unknown) which is in favour of a light-weight hash function.  We simply can't abide hash collisions, as they would screw the type registry which is the foundation of our schema.
    LVL 22

    Expert Comment

    If PJWHash(x) is giving you too little distribution, but you like the ferformance, a different approch to try might be PJWHash(F(x)), where F(x) is some simple and fast permutation of x.

    For example, you have many names that differ by only the last character - possibly one bit (xxxA - xxxB), and this does not sufficiently alter the hash value. Have you looked at collisions if only the first character differs? In this case, F(x) means reverse the string, or just move the last character to the front. If that still doesn't solve the problem, you could make the difference greater, by making a permutation function such that F(xxxB) = xBxBxB.
    LVL 84

    Accepted Solution

    one bit change in the input of PJWHash will change an average of two bits in the output.
    But changing any single input character will always change the output
    There is often a trade off between speed and thorough mixing.
    PJW tends to make that in favor of speed.

    If you want a perfect hash function for a fixed set of inputs., there are ways to develop them.
    But even MD5 can't guarantee that you won't have collisions if someone is allowed to choose inputs to create a collision

    LVL 3

    Author Comment

    by:Adrien de Croy
    looks like I'm looking for an ideal solution that is impossible - a perfect hash function for an unknown input set with zero chance of collision.

    logically any time the range of the hash is smaller than the possible range of the input there is the possibility of a collision.

    So I guess the thing to do is pick a hash function for good performance, and low collision rate, but make sure one catches and handles collisions.  

    Thanks ozo and all for comments.  I'm gonna give the points to ozo though (sorry others), since the link led me to an understanding of the issues that I can now deal with.

    Featured Post

    How your wiki can always stay up-to-date

    Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
    - Increase transparency
    - Onboard new hires faster
    - Access from mobile/offline

    Join & Write a Comment

    This is an explanation of a simple data model to help parse a JSON feed
    A short article about problems I had with the new location API and permissions in Marshmallow
    In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
    In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

    754 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now