[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
Solved

# weinberger hash algorithm flawed?

Posted on 2007-07-25
Medium Priority
1,009 Views
Hi

the hash algorithm designed by P.J.Weinberger is in widespread use.  I've been trying to track down some discussion however of why it only returns 24 significant bits out of 32.  I can't find anything on google.

I've tested about 5 different variations of the coded algorithm, all producing the same results (not surprisingly because of the stage in the loop that clears the top nybble in the hash value if it contains any bits)

for illustration, I've included one version of the code I tested (far from optimised) below.

unsigned int PJWHash(const std::string& str)
{
unsigned int BitsInUnsignedInt = (unsigned int)(sizeof(unsigned int) * 8);
unsigned int ThreeQuarters     = (unsigned int)((BitsInUnsignedInt  * 3) / 4);
unsigned int OneEighth         = (unsigned int)(BitsInUnsignedInt / 8);
unsigned int HighBits          = (unsigned int)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);
unsigned int hash              = 0;
unsigned int test              = 0;

for(std::size_t i = 0; i < str.length(); i++)
{
hash = (hash << OneEighth) + str[i];

if((test = hash & HighBits)  != 0)
{
hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));
}
}

return hash;
}

So my questions.

1. Am I correct in assuming that the hash function increases the chance of hash collision by (IMHO) needlessly decreasing the spread of allowable hash values?  I.e. 2^28 instead of 2^32 for a 32-bit system.

2. Is this even a problem?
0
Question by:Adrien de Croy

LVL 85

Expert Comment

ID: 19567355
0

LVL 16

Expert Comment

ID: 19567687
Given the article that ozo has helpfully provided you can confidently answer question 2 with no, it is not a problem.

The quality of a hash algorithm is measured by the uniformity of the hash values produced for a set of keys. As you can see from the article this algorithm produces very evenly spread results, both for the full hash value, for the collection of least siginificant bits from the hash, or the collection of most significant bits. The implication is that almost any collection of bits from the resulting has would probably serve equally well.

So, the chance of a collision is about as low as it can get.

The first question is, in effect, would the algorithm have been even better if the resulting has had been allowed to be 32 bits, instead of 28?

I don't know the answer to that question. However, as you can see from the article, any improvement would be theoretical at best. That is, in practice, the algorithm as it stands is about as good as it gets. So there would be no noticeable improvement in performance for any but the most bizarre uses of it (e.g. a hash table with more than 2^28 (over 268 million) entries).
0

LVL 3

Author Comment

ID: 19571395
Sorry should have been a bit more clear.

I understand when it comes to hash tables that a hash collision is not fatal - it just means more entries end up in the same bucket.  uniformity of distribution is an efficiency issue for hash tables.  It's of no concern for compare-by-hash applications.

For compare-by-hash or index-by-hash applications, where implicitly it is assumed that there is a one to one relationship between the value and its hash, then reducing the hash space would seem to be a bad thing.

As for claims about distribution.  In test cases here (we are hashing internal type names), strings of the same length with all but the last character the same end up in hash values with all but the last digit the same, so it doesnt' look like it spreads it much to me.  Specifically it doesn't provide anywhere near 50% chance of any bit changing if any bit in the input changes (if that last bit is in the last byte).

So I guess the answer in the end is choose the right hash for the application.  Maybe I need to use MD5, but I can't take the performance hit of that.  also our dictionary is very small (but unknown) which is in favour of a light-weight hash function.  We simply can't abide hash collisions, as they would screw the type registry which is the foundation of our schema.
0

LVL 22

Expert Comment

ID: 19571785
If PJWHash(x) is giving you too little distribution, but you like the ferformance, a different approch to try might be PJWHash(F(x)), where F(x) is some simple and fast permutation of x.

For example, you have many names that differ by only the last character - possibly one bit (xxxA - xxxB), and this does not sufficiently alter the hash value. Have you looked at collisions if only the first character differs? In this case, F(x) means reverse the string, or just move the last character to the front. If that still doesn't solve the problem, you could make the difference greater, by making a permutation function such that F(xxxB) = xBxBxB.
0

LVL 85

Accepted Solution

ozo earned 2000 total points
ID: 19572133
one bit change in the input of PJWHash will change an average of two bits in the output.
But changing any single input character will always change the output
There is often a trade off between speed and thorough mixing.
PJW tends to make that in favor of speed.

If you want a perfect hash function for a fixed set of inputs., there are ways to develop them.
But even MD5 can't guarantee that you won't have collisions if someone is allowed to choose inputs to create a collision

0

LVL 3

Author Comment

ID: 19572192
looks like I'm looking for an ideal solution that is impossible - a perfect hash function for an unknown input set with zero chance of collision.

logically any time the range of the hash is smaller than the possible range of the input there is the possibility of a collision.

So I guess the thing to do is pick a hash function for good performance, and low collision rate, but make sure one catches and handles collisions.

Thanks ozo and all for comments.  I'm gonna give the points to ozo though (sorry others), since the link led me to an understanding of the issues that I can now deal with.
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you’re thinking to yourself “That description sounds a lot like two people doing the work that one could accomplish,” you’re not alone.
This article will show how Aten was able to supply easy management and control for Artear's video walls and wide range display configurations of their newsroom.
With the power of JIRA, there's an unlimited number of ways you can customize it, use it and benefit from it. With that in mind, there's bound to be things that I wasn't able to cover in this course. With this summary we'll look at some places to go…
Screencast - Getting to Know the Pipeline
###### Suggested Courses
Course of the Month19 days, 20 hours left to enroll

#### 873 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.