Question

HashTable - Possiblity of Duplicate Hash Values

Asked by: Ignyte_Software

I am working on an application, and I want to use a HashTable to store and lookup data.  The key for the HashTable will be a 20-digit bar code.

Everything was going fine until I ran across a couple of articles that says it is possible for more than one text string to result in the same hash value.

*****************

Article 1

The concept of "hashing" in general needs to be clearly understood. A hash (or 'hash value') is a number generated from a string of text. The string could have numbers in it, but for hashing, consider them to be just text characters. The hash value is calculated so that it's unlikely - but not impossible - for some other text string to result in the same hash value. But the calculation guarantees that the same string will always produce the same hash value. Think of the hash as being a nickname for the entire string - a nickname on steroids.

http://visualbasic.about.com/od/usingvbnet/l/aa071203a.htm

*****************

Article 2

A hash function may map two or more keys to the same hash value. In many applications, it is desirable to minimize the occurrence of such collisions; which means that the hash function must map the keys to the hash values as evenly as possible.

http://en.wikipedia.org/wiki/Hash_function

*****************

This concerns me, because I have to be able to pass in a bar code and have the HashTable accurately return the correct data.  I cannot have two bar codes creating the same hash value.

Can someone either confirm my fears or put them to rest?  What is the likelihood that I could get a duplicate hash value if I load 1 million 20-digit bar codes into a HashTable?

This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.

Subscribe now for full access to Experts Exchange and get

Instant Access to this Solution

  • Plus...
  • 30 Day FREE access, no risk, no obligation
  • Collaborate with the world's top tech experts
  • Unlimited access to our exclusive solution database
  • Never be left without tech help again

Subscribe Now

Asked On
2009-09-04 at 07:53:39ID24707885
Tags

hash

,

hash table

,

visual basic

,

.net

,

.net framework

,

unique values

,

hash values

Topics

Algorithms

,

Theory

,

.NET

Participating Experts
7
Points
500
Comments
26

Trusted by hundreds of thousands everyday for fast, accurate and reliable tech support.

  • "The time we save is the biggest benefit of Experts Exchange to Warner Bros. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange." Mike Kapnisakis, Warner Bros.
  • "Our team likes having a resource that is more secure than just using Google and most experts using this service really know their stuff. It's nice to look here first versus using Google." Dayna Sellner, Lockheed Martin
  • "Anytime that I've been stumped with a problem, 9 out of 10 times Experts Exchange has either the accepted solution or an open discussion of the potential solution to the problem." Kenny Red, eBay Inc.

See what Experts Exchange can do for you.

Got a question?

We've got the answer.

Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.

Screenshot of Experts Exchange Knowledgebase

Need individual assistance?

Our experts are ready to help.

If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.

Screenshot of Experts Exchange Knowledgebase

Want to learn from the best?

Read articles from industry experts.

Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.

Screenshot of an Article

Working on a long term project?

Store your work and research.

Save solutions to your questions, answers you’ve discovered through searching plus helpful articles in your personal knowledgebase for easy future access.

Screenshot of Experts Exchange Knowledgebase

Access the answers to your technology questions today.

Subscribe Now

30-day free trial. Register in 60 seconds.

What Makes Experts Exchange Unique?

Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Trusted by the world's most respected brands.

image of each brand's logo

Faithfully serving IT professionals since 1996.

Experts Exchange Logo

Try it out and discover for yourself.

Subscribe Now

30-day free trial. Register in 60 seconds.

Related Solutions

  1. HashTable ...
    Hi experts, I am studying hashtable ... is there any written hashtable class in STL or anywhere in C++ ? or I need to make one myself ? thanks.
  2. Hashtable structure
    I'm slightly confused as to one aspect of how hashtables work. I understand that a hashtable is really just an array of key values, which can be turned into numbers using a hash function, which then allows you to map those values to data values. But, I'm not sure if a hasht...

Free Tech Articles

  1. WARNING: 5 Reasons why you should NEVER fix a computer for free.
    It is in our nature to love the puzzle. We are obsessed. The lot of us. We love puzzles. We love the challenge. We thrive on finding the answer. We hate disarray. It bothers us deep in our soul. W...
  2. SCCM OSD Basic troubleshooting
    SCCM 2007 OSD is a fantastic way to deploy operating systems, however, like most things SCCM issues can sometimes be difficult to resolve due to the sheer volume of logs to sift through and the dispe...
  3. Migrate Small Business Server 2003 to Exchange 2010 and Windows 2008 R2
    This guide is intended to provide step by step instructions on how to migrate from Small Business Server 2003 to Windows 2008 R2 with Exchange 2010. For this migration to work you will need the fo...
  4. Create a Win7 Gadget
    This article shows you how to create a simple "Gadget" -- a sort of mini-application supported by Windows 7 and Vista. Gadgets can be dropped anywhere on the desktop to provide instant information, ...
  5. Outlook continually prompting for username and password
    There have been a lot of questions recently regarding Outlook prompting for a username and password whilst using Exchange 2007. There are a few reasons why this would happen and I will try to cover t...
  6. Backup Exchange 2010 Information Store using Windows Backup
    There seems to be quite a lot of confusion around the ability to backup Exchange 2010 using the built in Windows Backup feature. This stems from the omission of this feature prior to Exchange 2007 s...

Cloud Class Webinars

  1. Avoiding Bugs in Microsoft Access
    Alison Balter takes and in-depth look at avoiding bugs in Access. In this webinar you will learn about using the immediate window to debug your applications, invoking the debugger, using breakpoints to troubleshoot, stepping through code, setting the next statement to execute, ...
  2. Top 10 Best New Features in Visio 2010
    Scott Helmers gives live demonstrations of the top 10 new features in Visio 2010. This webinar will teach you how to create compelling diagrams by adding shapes to the page with a single click, linking the shapes in a diagram to data in Excel (or SQL Server, or SharePoint), ...
  3. IT Consultant Business Secrets Revealed
    Michael Munger, Experts Exchange tech pro and IT consultant, pulls back the curtain on his very successful businesses and answers question on every IT consultant and business owner should know about. He shares secrets on what he did to solve the 5 most common problems in IT, ...
  4. Disaster Recovery and Business Continuity
    Quest CTO, Mike Billon, gives an overview of the steps involved in building a dunamic disaster recovery plan. Through case studies and an examination of software/hardware tooles for monitoring and testing, you'll gain a better understandin of where you are, where you want ...
  5. Organize Your Visio Diagrams with Containers and Lists
    Scott Helmers uses cross functional flowcharts, wireframe diagrams, data graphic legends and seating charts to teach you: how to ustilize all three new structured diagram components in Visio 2010, the best practices for organizeing shapes in previous version of Visio, how to organize ...
  6. How to Us Objects, Properties, Events and Methods in Microsoft Access
    Alison Dalter gives an in-depbth look at objects, properties, events and methods in Microsoft Access. In this webinar you will learn about using the object browser, referring to objects, working with properties and methods, working with object variables, understanding the ...

Join the Community

Give a Little. Get a Lot.

Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.

Join the Community

Answers

 

by: ladarlingPosted on 2009-09-04 at 08:02:31ID: 25260351

If you have 1 million unique barcodes, then you are very unlikely to encounter this problem. If you have duplicates, you will almost certainly have it.

"But the calculation guarantees that the same string will always produce the same hash value."

That is true so that comparisons can be made.

 

by: MistralolPosted on 2009-09-04 at 08:04:14ID: 25260365


Yes you will get multiple hash values that re-occur for different strings.

If you are building your own hash table. you may will to look at howto implement skip lists. Or some similar method to overcome this issue. This efftivly lets you have duplicate keys. Which a small performance hit where duplicate keys occurs. If you get a lot of the same duplicate key you then either expend the number of bits in the key and alter how the keys are generated and rebuild the hash table.

This can work really well for a hash implementation that does not know how many keys it is likly to have in the system at design time.


 

by: ladarlingPosted on 2009-09-04 at 08:12:51ID: 25260450

But (and especially using something like barcodes), it may be to your advantage overall to be able to identify duplicates, because that lets you know that one of them is potentially wrong or has changed. The hashtable will throw a System.Argument exception if you try to add a duplicate key, which you can handle and investigate.

 

by: daryalPosted on 2009-09-04 at 08:13:30ID: 25260456

The methods used to solve the problem of collision is called collision resolution. The two most popular method of resolving collision are,
- Separate chaining
- Open addressing

Maybe you should try to implement one of these methods instead of trying to avoid collisions.

 

by: Ignyte_SoftwarePosted on 2009-09-04 at 08:22:37ID: 25260519

Let me clarify . . .

The bar codes that go into the HashTable will be unique.  There will be 1 million unique bar codes / unique keys inserted into the hash table.  Knowing this, can the hash value created for each bar code ever be duplicated?

 

by: ladarlingPosted on 2009-09-04 at 08:33:52ID: 25260616

Then you very little to worry about. This is pretty easy to test:


Dim ht As New Hashtable
For i = 1000000 To 2000000
ht.Add(i.GetHashCode, i)
Next
MsgBox(ht.Count)

'Generate a unique hash code for a million 7 digit integers. At 20 digits, you will get an even more diverse hash range

 

by: BusacPosted on 2009-09-04 at 09:24:39ID: 25261134

Ignyte -- even if some of the barcodes will have the same hash values, the HashTable / Dictionary classes will correctly store them as two different entries. You don't need to worry about this problem at all -- the hash value is only used to improve the performance of the data structure. (Sort of.)

 

by: MistralolPosted on 2009-09-04 at 09:30:46ID: 25261186


Busac --

you may or may not be correct. Some implementation will not handle duplicate keys. Some however will. Its impossible to tell unless you see the documentation for the hash table being used.

 

by: BusacPosted on 2009-09-04 at 09:32:08ID: 25261195

I had the .net base class library one in mind.

 

by: ladarlingPosted on 2009-09-04 at 09:43:35ID: 25261284

Mistralol: ...Some however will. Its impossible to tell unless you see the documentation for the hash table being used.

This is just muddying the waters on the whole discussion. The base class, as stated, is going to return unique keys on base types. I dont know how many million entries it would take to break that 'rule', but I have personally tried it with up to 10,000,000 entries.

Hash functions go wrong when custom types define their own hash schemes, or create hashes based on non-unique members of their classes. That is not what the author is looking at. If its unique strings, integers, or the other base types, he is not going to have duplication problems.

 

by: BusacPosted on 2009-09-04 at 09:46:19ID: 25261308

Calling Integer.GetHashCode() just returns the same integer so obviously you're never going to see a collision. I think that the OP will work with strings though.

But as I said, even if the collision occurs, the HashTable class will still correctly handle it. (I am assuming that OP is using the HasTable or Dictionary class from .net BCL.)

 

by: ladarlingPosted on 2009-09-04 at 09:58:30ID: 25261407

Ahh, yeah, forgot to .ToString. Makes no difference though, this also produces no duplicates:


Dim ht As New Hashtable
For i As Integer = 0 To 2000000
Dim s As String = i.ToString
ht.Add(s.GetHashCode, s)
Next

 

by: GwynforWebPosted on 2009-09-05 at 05:37:29ID: 25266010

Hash insertion  consists of 2 parts

(1) Creation of a hash address for a record key

(2) Insertion of record into table using a collision resolution algorithm ie linear hash open addressing etc.

The very essence of hashing is the collision resolution, I suggest you read a good article on the subject
eg  http://www.pctechtalk.com/forums/community-news/12557-introduction-hashing.html
(the Wiki pages on this are not great imo btw). Hashing is fundamental and commonly used in many areas of computer science, reading up about it is worth it.

If your canned routines are not  implementing collision resolution of some then they are not hash routines.


 

by: GwynforWebPosted on 2009-09-05 at 05:39:33ID: 25266021

typo  "...resolution of some kind then they.."

 

by: thehagmanPosted on 2009-09-05 at 12:40:43ID: 25267451

If your bar codes are 20 digits, there are 10^20 possible inputs. If your hash value is a simple 4 byte integer, there are only about 4*10^9 possible hash values.
Hence, if you'd really use up *all* possible input values, each hash value would be produced by about 2.5*10^10 (!) input values on average.
Of course, if you have only 1million random entries and a good hash function (with a really good hash function, the input need not even be random) there should be plenty of room for avoiding duplicates.
And yet beware - the so-called "Birthday paradox" may strike: if we have about sqrt(N) independant random values between 1 and N, inclusive, then the probability of no collision occuring is approximately 0.5.
With 4*10^9 possible hash values this would correspond to only 65,000 inputs.
Hecne with 1,000,000 inputs collisions are virtually unavoidable.

If the hash values are bigger, say 128 bit numbers as in MD5 hash, then the probability of collisions occuring is of course much lower: The birthday paradox would strike at about 1.8*10^44 inputs - with 1,000,000 inputs you can be relatively confident that no collisions occur. (In fact, 2^128 is much more than 2^20, so the re's no reason to use hashes at all).
However, if you want to use the hash (or part of it) as an address e.g. to a memory based array, then you'll necessarily squeeze the hash values into some useable size for addresses and we're back at the 32bit limit discussed above.

 

by: CoryDambachPosted on 2009-09-05 at 18:05:42ID: 25268447

I have used hashtables for many very large sets of data and never had this problem.  As I understand it, the key value is hashed and this hash maps to a bucket, the bucket still contains the original keys and their corresponding object values, so if two or more keys map to the same bucket, it is not a problem, as the bucket is then searched sequentially until the correct entry is found.  Since the buckets should be guaranteed to be fairly small by a good hash function, the performance of the hashtable remains at virtually constant time.

 

by: Ignyte_SoftwarePosted on 2009-09-10 at 18:55:48ID: 25306162

Several of you have suggested problems with canned routines or implementing my own hash function.

I plan on using the Hashtable Class in the System.Collections Namespace in the Microsoft .Net Framework.

Will the Hastable Class in the .Net Framework handle collision resolution?

 

by: BusacPosted on 2009-09-11 at 01:36:09ID: 25307554

As I already said twice :D, yes it will. You don't have to worry about it.

 

by: MistralolPosted on 2009-09-11 at 01:37:53ID: 25307566


Yes it will handle duplicate keys.

What some of the people here seem to fail to realize. Is that if you have a 128 bit value and implement a hash to reduce this size into smaller bits. You will always have collisions. there is no way around this. But there is a way to deal with it in a more efficent manner.



 

by: ladarlingPosted on 2009-09-11 at 07:01:26ID: 25309584

What some of the people here seem to fail to realize. Is that if you have a 128 bit value and implement a hash to reduce this size into smaller bits. You will always have collisions. there is no way around this. But there is a way to deal with it in a more efficent manner.

What you seem to fail to realize is that the author of this post asked a specific question regarding the .NET hastable object. He is not trying to research how to create a has-table scheme, skip list, collision resolution function, or anything of the sort. You are again using the term 'if you are going to implement a hash', which of course the author is not, he is using the stock object and the hash functions written by Microsoft for their base types.

If you are going to be critical of the answers given by other experts, at least address why it does not satisfy the problem that the author wishes to avoid.

 

by: MistralolPosted on 2009-09-11 at 08:26:09ID: 25310509


Sorry if i assume it was an general hash table algorithms as it was post in Programming Algorithms. Therefore i was not being specific about certain framework implementations. I was also not making any assumtions that barcodes only containe numbers. Some support u to around 120 - 140 different characters.

ladarlin:

I did point out that it avioded the problem. You cut that bit from your trolling quote. I was only pointing out that you will always have collisions in a hash table. som others (including you) were pointing out that you dont. So if you could show me how to compact 7 * 20 bits into something smaller than a 32bit value without having duplicates I would really be intrested.

I have something like the problem here. We use bar code to process returned mail. We use different sections in the bar code to store different record id's eg job / batch / letter numbers. The barcode's produced are not sequentally generated. There is know way to test these at design time for all combinations.

So testing these things with a loop from 0 - 2m really doesnt actually test for anything usefull in a situation like this.
Testing with a loop from 0 - 1000m also doesnt even cover 1% of what may be required in the live enviroment.

We know that the C# hashtable implementation will deal with duplicate entires. But i think the real quesiton that should have been asked here. Is how well will it deal with many multiple duplicates keys returned from the data?
For anything at best guess up to approx 50k or so keys.

 

by: ladarlingPosted on 2009-09-11 at 10:08:33ID: 25311482

I did point out that it avioded the problem. You cut that bit from your trolling quote...

Trolling? I believe that I have been in this discussion from the start, and It appears that others as well as the author agreed that further consideration of the issues that you insist are relevant are not, in this case.

So if you could show me how to compact 7 * 20 bits into something smaller than a 32bit value without having duplicates I would really be intrested.

Again, not relevant, and, happily, we dont need to solve that particular problem, since Microsoft had the forethough to return a 32bit signed integer from their string.getHashCode function.

But i think the real quesiton that should have been asked here. Is how well will it deal with many multiple duplicates keys returned from the data?

Ahhh, I see, I didnt realize that the author failed to asked the correct question. What he really wanted to know, then, is how the hastable object's underlying collision resolution algorithm is written?

For anything at best guess up to approx 50k or so keys

Based on what? Whos best guess? So, with 4,294,967,295 possible unique values for the hash itself, are you asserting that 1 million unique strings (the authors number, mind you) hashed could result in 50,000 duplicates? That is 5%. Or, 1 in every 20 unique strings passed to that function will return a duplicate hashcode... I dont believe that.

 

by: MistralolPosted on 2009-09-11 at 10:29:06ID: 25311662


You do know as the data that you are generating the key from gets longer you get more duplicates.

So for a 32bit int.

Assume a 7 bit char set (large barcode)

The longest barcode you can have which can fit its key into 32bit is 4 chars.
So for an 20 char barcode you require a hash key length of up to 140 bits for no duplicates ot occur. This is of course if your hash function is "perfect". So the difference between 32bit and 140bit number is somewhat easy to generate 50k duplicates ;)
In fact 50k is less that 0.0001% of the key size.

You are not working off 1 million unique strings. You are working off a 32bit key generated from a 140bit value. Which data so long for a key its possible to really hit the bad side of things and generate a lot of duplicate keys.

If you dont belive me you should go and try it :)

 

by: ladarlingPosted on 2009-09-11 at 11:59:09ID: 25312469

You do know as the data that you are generating the key from gets longer you get more duplicates.

Actually no, I dont know that. Since string.getHashCode works on both the characters value and its position within the string, performing a simple mathematical function on an initial value, it follows that a longer string will result in more iterations of modification to the value that a shorter string.

Given that it *is* a fixed seed and that the functions performed on the initial value are hard coded (and thus reliable) , you can positively say that string values whos characters 'add up' to identical values will exist, but that is all that you can 'guess' at.

----

See System.String.GetHashCode() below for further study. I think I'll move on to other Q's now. Thats for the discussion.

[ReliabilityContract(Consistency.WillNotCorruptState, Cer.MayFail)]
public override unsafe int GetHashCode()
{
    fixed (char* text = ((char*) this))
    {
        char* chPtr = text;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
                break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}
                                              
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:

Select allOpen in new window

 

by: ladarlingPosted on 2009-09-11 at 12:11:49ID: 25312580

Thats for the discussion. << Thanks I mean. Have a great day.

 

by: MistralolPosted on 2009-09-11 at 12:40:47ID: 25312829


Sure and when you compact a long string to 32bits you get duplicates. So far i have now found several. If this was left running for long enough. A pattern would emerge. Which is working out at approx 1 duplicate key per 32bits. Assuming that the barcodes are allocated in sequence this is ok. But if the allocation happens to match the pattern then there is a problem. These are predictable. Note how the numbers are actually quite close together on matching numbers. In fact in some systems they are predictable enough to effectivly disable it.

Crossed Max Int
Duplicate 00000000012023564166 = -566267043 Keys Generated So Far
Crossed Max Int
Duplicate 00000000012625554161 = -566267043 Keys Generated So Far 4532459951
Crossed Max Int
Duplicate 00000000022428565768 = -566267043 Keys Generated So Far 7934753016
Crossed Max Int
Crossed Max Int
Duplicate 00000000032028535463 = -566267043 Keys Generated So Far 11249330602
Duplicate 00000000032724585768 = -566267043 Keys Generated So Far 11548564938
Crossed Max Int
Crossed Max Int

    class Program
    {
        static int HashCode = 0;
        static Int64 Calculated = 0;
        static bool CrossedMaxInt = false;
 
        static void Main(string[] args)
        {
            StringBuilder Str = new StringBuilder("01234567890123456789");
 
            HashCode = Str.ToString().GetHashCode();
 
            Work(Str, 0);
        }
 
        static void Work(StringBuilder Str, int Pos)
        {
            Str[Pos] = '0';
            while (Str[Pos] != '9')
            {
                if (Pos == Str.Length - 1)
                {
                    Calculated++;
 
                    if (Calculated % int.MaxValue == 0)
                        Console.WriteLine("Crossed Max Int");
 
                    if (Str.ToString().GetHashCode() == HashCode)
                        Console.WriteLine("Duplicate {0} = {1} Keys Generated So Far {2}", Str.ToString(), Str.ToString().GetHashCode(), Calculated);
                }
                else
                {
                    Work(Str, Pos + 1);
                }
 
                Str[Pos] = Convert.ToChar(Convert.ToInt16(Str[Pos]) + 1);
            }
        }
    }

                                              
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:

Select allOpen in new window

20120131-EE-VQP-002

3 Ways to Join

30-Day Free Trial

The Experts

98% positive feedback on 31,087 answers since March 2000. angeliii is a Microsoft Most Valuable Professional for his work with MS SQL Server & Develoment.

He has also proven his knowledge of Visual Basic Programming, PHP Scripting and Oracle Databases.

The Experts

97% positive feedback on 10,752 answers since July 2000. lrmoore has more than 18 years experience in the networking industry.

The six-time Mircosoft MVPs specialties include firewalls, virtual private networking, and network management.

Testimonials

"...and excellent source for support... Kind of like having your very own IT dept." Electriciansnet

Testimonials

"I was apprehensive at signing up at first. However... it has already made my life as an IT administrator much easier." JaCrews

Testimonials

"WOW! You guys have great, active, and knowledgeable people on here." moore50

Business Clients

Business Clients

In the Press

"If you’ve got a question... Experts Exchange can supply an answer.”

In the Press

"...an invaluable aid for both IT professionals and those who require tech support."

In the Press

"where IT professionals provide quick answers on just about any topic"

Business Account Plans

Loading Advertisement...