Solved

Hashing

Posted on 2002-03-28
255 Views
Hi, im having a bit of trouble. ok. a lot of trouble.

I am learning C  and i have been given an exercise on arrays and pointers.

The exercise is to make 1 list to store student names, and another list to store the course they are enrolled on.

I know basically how it works, ie the ascii values are added together and then you use the mod operator to convert the total into a number which becomes the size of the array.

So can anybody give me a bit of help to start me on my way.  I must state that I am not asking anybody to do the work for me, but the more help the better.  Could you please answer asap. I look forward to hearing from youin the near future.

Thanx

~OutOvSyt
0
Question by:OutOvSyt

LVL 16

Expert Comment

ID: 6902829
How about I give a slightly more precise description:

Suppose you have 100 entries in your list. Then, for each name you would find the index for it by adding the ASCII values together, the applying mod 100. That is your hash function. So, for example, a student named "Li", consists of ASCII values 0x4c and 0x69. When you add them you get 0xb5 which is 181 decimal. If you take mod 100 then you get 81. So Li should be stored at index 81.

Does that help?

If you're having trouble coding it, please provide whatever attempt you have made so far, complete with compiler errors if you can't solve those yourself, and we'll take it one step at a time.
0

LVL 84

Expert Comment

ID: 6904602
/* a better hash could be */
int hash(char *s){
int h=0;
while( *s ){
h += (h<<5) + *s++;
}
return h;
}
0

LVL 1

Expert Comment

ID: 6904702
For ozo: how is that better?!
Try your function with the input "Johnny":
This are the intermediaries at every step:
74
2553
84353
2783759
91864157
-1263449994
This is the result:
Final hash value:-1263449994
Am I missing something?

For imladris. Yes, you are right. And this is just for the sake of argument: isn't, statistically speeking, the hash table distribution better when the length of the entry list is a prime number.

For OutOvSyt.
How do you intend to solve conflicts in your implementation. There are at least two approaches, depending very much on the number of entries you estimate in your list.
If the number is at most the length of the list, you can try adding the conflicting value at the imediate empty position.
If you have more than that, you should consider your initial array as an array of pointers to some dynamic structure, like a linked list, which holds all the values with the same hash key, or better yet, a binary tree, sorted by name (in your case).

Bye!
0

LVL 1

Expert Comment

ID: 6904705
Sorry, imladris, forgot the question mark -- ?
0

LVL 1

Expert Comment

ID: 6905335
For imladris - forget about my question, goes beyond the scope of this discussion, of course, the bigger the number of entries, the wider the dispersion, but then again, no point making this number too big (what's the point of having 10000 entries when the average number of records is about 100).
What I had in mind was the way to decide on the optimum number of entries given an estimation of the number of records, but, as I said, I think this is not the place for such debate, unless, of course, OutOvSyt is interested.

By, the way, OutOvSyt, have you been helped?
Anything to ask, some code to post?
0

Author Comment

ID: 6915700
Thanks. You are all being a great help, but its not so much the code i am having trouble with, its more like the whole concept of hashing.

I have read your comments and i appreciate them, but can we try going a bit more basic?

Sorry but the only code i have is the algorithm to get the total and convert it to a range of 0 to MaxNum-1 where MaxNum of the size of the array.

This is basically add the ascii values together and divide them by 24. (I am using 24 as the size of the array).

I would be very grateful if you could post your revised comments after taking this comment into consideration.

~OutOvSyt

0

LVL 1

Accepted Solution

Robalitoru earned 300 total points
ID: 6917555
Well, I see the others got bored, so I might as well tell you a thing or two.
Now, "the whole concept of hashing"... Basically, is an alternative to the ordinary sort. Sometimes is preferable to use hash tables instead of sorted lists (arrays) or binary trees in order to retrieve items fast from a collection.
Why is that? Well, just consider you have a sorted, index based,  array: searching for an element is quite fast with binary search O(lg2 n), but adding or deleting one is O(n).
Let's say than that you have a binary search tree. Basically on these kind of strucutures the complexity of add/delete/retrieve is around O(lg2 n), but, depending on the order of all operations, the tree might get seriously unbalanced, then you will have to balance it in order to ensure the targeted complexity... and so on.
So, depending on the size of data being processed, and on the presumed dynamics of the operations, one might choose a hashing table for manipulating the set of records. The discussion on when exactly each approach is preferable is really too involved...

There are two, as I see it, main use cases for hashing:
1.  You don't have such a large amount of records, but your program acceses them extensively -- you need a way to retrieve the location of one record in almost no time -- let's say you have about 50 records -- making a hash table of about 200 records with pointers to such records -- resolving eventual conflicts with assigning to the imediate free position is the best approach, far better than a sorted array or a binary tree, or any other structure.

2. You have a large amount of data, and adding and deleting are very, very frequent. You don't want a sorted list from reasons of complexity. You don't want a binary tree, because frequent adds and deletes will most likely debalance it.
You might use a hash table with a quite big number of entries, and then, for the records on the same key decide again on the kind of structure. Migth be just a list, a priority list, a sorted array, a binary tree, or even another hash table on another key...

Again, if you think the answer is not specific enough, try making your question more specific.
Cheers!
0

LVL 2

Expert Comment

ID: 6933927
Slight clarify on the above:

Alternative to a sort and search. Data is often sorted so you can do the binary search (throws away half the candidates each time, like a human with a phonebook, open to middle, move 1/2 way in the correct direction each time (because its sorted)).  This search is fast, but not always fast enough...

The best luck I've had for collision resolution is to have several hash functions, and to use each in turn until a space is found.  But I usually only use it on small known data sets with very carefully chosen functions, not on user entered data...

0

Author Comment

ID: 8240052
alot more information than i expected. much more detail than i thought i would need but very very helpful. thanx alot
0

Featured Post

Suggested Solutions

Preface I don't like visual development tools that are supposed to write a program for me. Even if it is Xcode and I can use Interface Builder. Yes, it is a perfect tool and has helped me a lot, mainly, in the beginning, when my programs were smallâ€¦
This tutorial is posted by Aaron Wojnowski, administrator at SDKExpert.net.  To view more iPhone tutorials, visit www.sdkexpert.net. This is a very simple tutorial on finding the user's current location easily. In this tutorial, you will learn hoâ€¦
The goal of this video is to provide viewers with basic examples to understand and use structures in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to create, access, and change arrays in the C programming language.