Hash function for better distribution

I have this class that represents a range of IP addresses:
class IPRange {
      System.Net.IPAddress m_rangeStart;
      System.Net.IPAddress m_rangeEnd;
This class is used as a key in a hashtable, so it overrides Equals (by simply comparing both fields of both instances) and GetHashCode (which returns m_rangeStart.GetHashCode() ^ m_rangeEnd.GetHashCode() - i.e. XORing the hashcodes of both IP addresses).

My problem is that inserting a large amount of records to the hashtable is SLOW (more than 100 seconds for ~40,000 records). My guess is that the hash code is not "random" enough, and does not distribute uniformly among the hashtable's buckets.

Does anyone have an idea for a simple hash function that will distribute better?

(.NET framework 1.0)
Who is Participating?
anarki_jimbelConnect With a Mentor Commented:
Read http://forum.java.sun.com/thread.jspa?threadID=788958

Using prime numbers helps better distribution. May be it's better to use bigger prime numbers like 17 , 23 etc.
100 s! Huge!...

Are you creating big enough hashtable? Capacity should be about 60.000 for 40.000 records.

I'd also check somehow number of hash code collisions, it may be the main source of problem. Every time we have a clash for hash code Equals is invoked, and equals is much slower. You may try something like

m_rangeStart.GetHashCode() *3 +  m_rangeEnd.GetHashCode()*7
Try using string form of IPAddress:

return (m_rangeStart.ToString() + m_rangeEnd()).GetHashCode();
Cloud Class® Course: C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

I believe ToString() functions are slow enoigh for GetHashCode functions. The main point to make GetHashCode as fast as possible so that you don't spend much time looking for a vacant slot in the hashtable.

Have found quite interesting piece from Java guys:


"The major advantage of a number like "31" (any power of 2 plus/minus 1) is that the multiplication is reduced to a shift and an add (actually a subtract):

a * 31 == (a << 5 - a)

which can be done very fast, even in interpreted code. Usually the compiler performs these sorts of optimizations when generating code, even at very low optimization levels.

And, of course, you want this to be a prime, too, so combinations of (2**n +/- 1 *and* prime) are very few:

3, 5, 7 (all too small), 17 (aha! everyone's favorite "pick a number"), 31, 127, ... "

So it's better to use 17, 31, 127...

Well, I would test it out and compare before declaring it too slow.  (I have no idea how the 1.0 runtime computes string hashs - but it should be pretty fast as this is core part of architecture).

Your idea about multiplying the existing hash is slightly flawed in that you can end up overflowing the hash.  Hash code is an integer.   In practice you will need to handle that error - which will add overhead.  Perhaps if you multipled the Address by a prime  then took its hash you would accomplish better distribution without the overhead of trapping for overflows.    However, Address is deprecated in .Net version 2.

So for simplicity, i suggest taking the hash of the string.  You do not have to worry about upward compatability, nor error trapping and it should perform reasonably well (but i would test with targeted runtime).


I don't need to test. You need to test and think before you try to give wrong advice :)

Your proposal:
"return (m_rangeStart.ToString() + m_rangeEnd()).GetHashCode();"

First, thre is a mistake, you add a string to an IP. This is just a misprint but it shows you never tried your code. It's not necessary however.
Still! You add two strings and that makes the operation slow. Believe me, it's much slower than arithmetic operations as you work with objects. And you slow down entire GetHashCode.

Second. I use the technique with multiplications all the time, that's why I propose it here. So I have tried it, you - never!
You will not get overflow, you just do not inderstand how it's working. That's OK but  it's not too wise to give advices in such a situation.
As an example:

            int a = 2147483647;// max int
            int b = a*10;

result: b equals -10;

Who told about flaws and overflows?
Appologies to you my friend, i chose bad words and yes should have done more homework before posting.  It is always a tradeoff - accuracy for time.... :|

I (think I - gads that was hours ago and i am not at the same machine) did try the code (qualifier: on .net 2.0).  The ToString() method tranlates the IPaddress to a string in "xxx.xxx.xxx.xxx" form.   return (m_rangeStart.ToString() + m_rangeEnd()).GetHashCode();"  is not adding a string it is concatenating a string.  

I was offering a simple means to obtain the desired result.  Not the "best" or the "quickest".  Certainly with the current performance, it would have been real easy for yodant to plug it in and compare results.

It is true i did not test your idea -  instead i relied on my memory about arithmetic overflows.  Your short demo led me to investigate the results.   Running "unchecked" your algorythm is dropping the most significant bit if the number exceeds Int.MaxValue.   Running "checked" it tosses an error.   This is at least how it works in 2.0, not sure in 1.0 - are you?

Ok.  I hereby declare i do not want any points!  I do want Yodant to let us know how well each idea worked in practice....
Accepted :)

What's the difference between "adding" strings and their "concatenation"? I can't see any. Anyway:

From http://www.c-sharpcorner.com/Code/2002/Oct/StringBuilderComp.asp

"String concatenation is one of the commonly used operations among programmers. If you don't handle the string concatenation properly, it may be decrease the performance of an application.
You can concatenate strings in two ways. First, traditional way of using string and adding the new string to an existing string. In the .NET Framework, this operation is costly. When you add a string to an existing string, the Framework copies both the existing and new  data to the memory, deletes the existing string, and reads data in a new string. This operation can be very time consuming in lengthy string concatenation operations."

Second way is StringBuilder, of course.

About my approach to HashCodes. In our company we use it in quite serious product where performance is VERY (lot of xml handling) important and cacheing is used evrywhere. For faster put and get something into cache we need proper GetHashCode.

For those who's going to use the approach: check that the code is not 0. If you use multiplication of all field hashcodes to get an object hashcode be aware that boolean false gives 0, so entire result is 0. true gives 1 and is also useless.
yodantAuthor Commented:
First, thank you both for your input.
I would just like to mention that right after I posted the question, I realized that many of my IP ranges may be "one IP long" (i.e. m_rangeStart==m_rangeEnd). So, the first implementation that I tried right away was simply:

return m_rangeStart.GetHashCode();

This alone has improved the execution time (for inserting ~40,000 records) from 100 seconds to less than a second! Obviously, this solution is not yet ideal, because all ranges that share the same starting IP will also share the same hash code. With that in mind, I tried this:

if (m_rangeStart.Equals(m_rangeEnd))
    return m_rangeStart.GetHashCode();
return m_rangeStart.GetHashCode() ^ m_rangeEnd.GetHashCode();

This downgraded the performance significantly - around 80 seconds. It is 20% better than the first one - and I appearantly do have many "one IP long" - but obviously that isn't the main problem here.

I will try using the idea of multiplying the hash codes of both IP Addresses with prime numbers. I also want to try shifting some bits around (could be useful because the IP Address has an underlying Int64 - perhaps I could take the low order bits from the first IP and XOR them with the high order bits from the second one?). Of course, I will report the results here.
yodantAuthor Commented:
As it seems with the data I have so far, the best performance is when I simply use m_rangeStart.GetHashCode as the hash code of the range. It actually gives better performance than all other options I've considered:
17 * m_rangeStart.GetHashCode() + (31 * m_rangeEnd.GetHashCode()
17 * m_rangeStart.GetHashCode() ^ 31 * m_rangeEnd.GetHashCode()
(m_rangeStart.Address << 32) ^ m_rangeEnd.Address

And of course using ToString is a very bad idea (it is both "logically incorrect" and bad performance).

Again, thank you both.
Obviously "simply use m_rangeStart.GetHashCode as the hash code" is faster than same plus some arithmetic operations.

However it would be interesting to know performance (for comparison) for other options you have listed.
It's not just curiosity, sooner or later we all have problems with performance...

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.