Solved

Hash function for better distribution

Posted on 2006-11-27
13
242 Views
Last Modified: 2010-04-16
I have this class that represents a range of IP addresses:
class IPRange {
      System.Net.IPAddress m_rangeStart;
      System.Net.IPAddress m_rangeEnd;
}
This class is used as a key in a hashtable, so it overrides Equals (by simply comparing both fields of both instances) and GetHashCode (which returns m_rangeStart.GetHashCode() ^ m_rangeEnd.GetHashCode() - i.e. XORing the hashcodes of both IP addresses).

My problem is that inserting a large amount of records to the hashtable is SLOW (more than 100 seconds for ~40,000 records). My guess is that the hash code is not "random" enough, and does not distribute uniformly among the hashtable's buckets.

Does anyone have an idea for a simple hash function that will distribute better?

(.NET framework 1.0)
0
Comment
Question by:yodant
  • 8
  • 3
  • 2
13 Comments
 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility
100 s! Huge!...

Are you creating big enough hashtable? Capacity should be about 60.000 for 40.000 records.

I'd also check somehow number of hash code collisions, it may be the main source of problem. Every time we have a clash for hash code Equals is invoked, and equals is much slower. You may try something like

m_rangeStart.GetHashCode() *3 +  m_rangeEnd.GetHashCode()*7
0
 
LVL 29

Accepted Solution

by:
anarki_jimbel earned 75 total points
Comment Utility
Read http://forum.java.sun.com/thread.jspa?threadID=788958

Using prime numbers helps better distribution. May be it's better to use bigger prime numbers like 17 , 23 etc.
0
 
LVL 7

Expert Comment

by:mjmarlow
Comment Utility
Try using string form of IPAddress:

return (m_rangeStart.ToString() + m_rangeEnd()).GetHashCode();
0
 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility
I believe ToString() functions are slow enoigh for GetHashCode functions. The main point to make GetHashCode as fast as possible so that you don't spend much time looking for a vacant slot in the hashtable.
0
 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility

Have found quite interesting piece from Java guys:

http://forum.java.sun.com/thread.jspa?threadID=682352&messageID=3979066

"The major advantage of a number like "31" (any power of 2 plus/minus 1) is that the multiplication is reduced to a shift and an add (actually a subtract):

a * 31 == (a << 5 - a)

which can be done very fast, even in interpreted code. Usually the compiler performs these sorts of optimizations when generating code, even at very low optimization levels.

And, of course, you want this to be a prime, too, so combinations of (2**n +/- 1 *and* prime) are very few:

3, 5, 7 (all too small), 17 (aha! everyone's favorite "pick a number"), 31, 127, ... "

So it's better to use 17, 31, 127...
0
 
LVL 7

Expert Comment

by:mjmarlow
Comment Utility
anarki,

Well, I would test it out and compare before declaring it too slow.  (I have no idea how the 1.0 runtime computes string hashs - but it should be pretty fast as this is core part of architecture).

Your idea about multiplying the existing hash is slightly flawed in that you can end up overflowing the hash.  Hash code is an integer.   In practice you will need to handle that error - which will add overhead.  Perhaps if you multipled the Address by a prime  then took its hash you would accomplish better distribution without the overhead of trapping for overflows.    However, Address is deprecated in .Net version 2.

So for simplicity, i suggest taking the hash of the string.  You do not have to worry about upward compatability, nor error trapping and it should perform reasonably well (but i would test with targeted runtime).





0
Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility
mjmarlow,

I don't need to test. You need to test and think before you try to give wrong advice :)

Your proposal:
"return (m_rangeStart.ToString() + m_rangeEnd()).GetHashCode();"

First, thre is a mistake, you add a string to an IP. This is just a misprint but it shows you never tried your code. It's not necessary however.
Still! You add two strings and that makes the operation slow. Believe me, it's much slower than arithmetic operations as you work with objects. And you slow down entire GetHashCode.

Second. I use the technique with multiplications all the time, that's why I propose it here. So I have tried it, you - never!
You will not get overflow, you just do not inderstand how it's working. That's OK but  it's not too wise to give advices in such a situation.
0
 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility
As an example:

            int a = 2147483647;// max int
            int b = a*10;
            Debug.WriteLine(b);

result: b equals -10;

Who told about flaws and overflows?
0
 
LVL 7

Expert Comment

by:mjmarlow
Comment Utility
Appologies to you my friend, i chose bad words and yes should have done more homework before posting.  It is always a tradeoff - accuracy for time.... :|

I (think I - gads that was hours ago and i am not at the same machine) did try the code (qualifier: on .net 2.0).  The ToString() method tranlates the IPaddress to a string in "xxx.xxx.xxx.xxx" form.   return (m_rangeStart.ToString() + m_rangeEnd()).GetHashCode();"  is not adding a string it is concatenating a string.  

I was offering a simple means to obtain the desired result.  Not the "best" or the "quickest".  Certainly with the current performance, it would have been real easy for yodant to plug it in and compare results.

It is true i did not test your idea -  instead i relied on my memory about arithmetic overflows.  Your short demo led me to investigate the results.   Running "unchecked" your algorythm is dropping the most significant bit if the number exceeds Int.MaxValue.   Running "checked" it tosses an error.   This is at least how it works in 2.0, not sure in 1.0 - are you?

Ok.  I hereby declare i do not want any points!  I do want Yodant to let us know how well each idea worked in practice....
0
 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility
Accepted :)

What's the difference between "adding" strings and their "concatenation"? I can't see any. Anyway:

From http://www.c-sharpcorner.com/Code/2002/Oct/StringBuilderComp.asp

"String concatenation is one of the commonly used operations among programmers. If you don't handle the string concatenation properly, it may be decrease the performance of an application.
You can concatenate strings in two ways. First, traditional way of using string and adding the new string to an existing string. In the .NET Framework, this operation is costly. When you add a string to an existing string, the Framework copies both the existing and new  data to the memory, deletes the existing string, and reads data in a new string. This operation can be very time consuming in lengthy string concatenation operations."

Second way is StringBuilder, of course.

About my approach to HashCodes. In our company we use it in quite serious product where performance is VERY (lot of xml handling) important and cacheing is used evrywhere. For faster put and get something into cache we need proper GetHashCode.

For those who's going to use the approach: check that the code is not 0. If you use multiplication of all field hashcodes to get an object hashcode be aware that boolean false gives 0, so entire result is 0. true gives 1 and is also useless.
0
 
LVL 1

Author Comment

by:yodant
Comment Utility
First, thank you both for your input.
I would just like to mention that right after I posted the question, I realized that many of my IP ranges may be "one IP long" (i.e. m_rangeStart==m_rangeEnd). So, the first implementation that I tried right away was simply:

return m_rangeStart.GetHashCode();

This alone has improved the execution time (for inserting ~40,000 records) from 100 seconds to less than a second! Obviously, this solution is not yet ideal, because all ranges that share the same starting IP will also share the same hash code. With that in mind, I tried this:

if (m_rangeStart.Equals(m_rangeEnd))
    return m_rangeStart.GetHashCode();
return m_rangeStart.GetHashCode() ^ m_rangeEnd.GetHashCode();

This downgraded the performance significantly - around 80 seconds. It is 20% better than the first one - and I appearantly do have many "one IP long" - but obviously that isn't the main problem here.

I will try using the idea of multiplying the hash codes of both IP Addresses with prime numbers. I also want to try shifting some bits around (could be useful because the IP Address has an underlying Int64 - perhaps I could take the low order bits from the first IP and XOR them with the high order bits from the second one?). Of course, I will report the results here.
0
 
LVL 1

Author Comment

by:yodant
Comment Utility
As it seems with the data I have so far, the best performance is when I simply use m_rangeStart.GetHashCode as the hash code of the range. It actually gives better performance than all other options I've considered:
17 * m_rangeStart.GetHashCode() + (31 * m_rangeEnd.GetHashCode()
17 * m_rangeStart.GetHashCode() ^ 31 * m_rangeEnd.GetHashCode()
(m_rangeStart.Address << 32) ^ m_rangeEnd.Address

And of course using ToString is a very bad idea (it is both "logically incorrect" and bad performance).

Again, thank you both.
0
 
LVL 29

Expert Comment

by:anarki_jimbel
Comment Utility
Obviously "simply use m_rangeStart.GetHashCode as the hash code" is faster than same plus some arithmetic operations.

However it would be interesting to know performance (for comparison) for other options you have listed.
It's not just curiosity, sooner or later we all have problems with performance...

thanks
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Introduction Although it is an old technology, serial ports are still being used by many hardware manufacturers. If you develop applications in C#, Microsoft .NET framework has SerialPort class to communicate with the serial ports.  I needed to…
Real-time is more about the business, not the technology. In day-to-day life, to make real-time decisions like buying or investing, business needs the latest information(e.g. Gold Rate/Stock Rate). Unlike traditional days, you need not wait for a fe…
Illustrator's Shape Builder tool will let you combine shapes visually and interactively. This video shows the Mac version, but the tool works the same way in Windows. To follow along with this video, you can draw your own shapes or download the file…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now