Link to home
Start Free TrialLog in
Avatar of resonantwa
resonantwa

asked on

Resolve / Convert string or MD5 hash into a number within a specific range

Hello,

I want to convert a string into a unique integer value that lies within a range.  What I will be doing is feeding a given URL, or MD5 hash of that URL as input to a formula that returns a value not less than the bottom of my range, and not greater than my range.  

For example, lets say I have http://www.yourdomain.com/cats/ as a string, and a range from one (1) to sixty (60).
All I need is to feed that URL into the formula, and it evaluates it somehow into a number that falls between 1 and 60. But if I have a large set of very similar, but unique URLS, I need to get a fairly even distribution between that range. That is to say, if I had 60 unique strings, I could probably get numbers anywhere in that range.

As another example, two similar strings such as:
http://www.your...../cats/ 
and
http://www.your...../cats1/ 

might return what seems like random values between 1 and 60, as oppose to two integers that are very close to each other (because the strings are the same)

So I was thinking, maybe perform an md5 hash on the string, and then add up the ASCII values of each character in the MD5 hash. Then somehow use this get my number betwwen 1 and 60.


This "number" will act sort of as a serial number for the string, always the same unique evaluation between those ranges.  The upper value in my range with be the last record in a database table. I'll use the "serial number" to always retreive the same record from the table.

At some point in the future, I'll want each url to have a new unique serial number, so all I do is extend my range (for example, as records are added to the database table).
ASKER CERTIFIED SOLUTION
Avatar of cookre
cookre
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>Since collisions are always possible:

Virtually guaranteed ...

I do not think you can achieve what you are asking in this way.

Hashing systems have always worked on what is known as a "bathtub" distribuion and you will naturally come across collisions when your "bathtub" is aproximately 60% full.

with a "bathtub" of 60 you are almost guarnteed to hit your first collision by the time you have added just 36 hash entries to your table of approx 60.


I beleive your requirement is unrealastic, I would advise that you rethink it!

Avatar of resonantwa
resonantwa

ASKER

HI cookre,

I knew there was an easy way of getting a number within a range, I just couldn't remember my math (mod). How would the hash string be evaluated into a numerical value on which MOD 'N' would be performed?

David,

The bathub metaphor gives me an idea how effective this would be. I am not advanced enough to know, but given what you've said I think it is reasonable to assume that the uniqueness of the hash will dictate how good the distribution is. Correct?  Uniqueness is actually not critical in this situation, but a distribution covering better than 36 of 60 values would be nice. Do you have any constructive advice to move forward?
Sample code:
http://www.obviex.com/samples/hash.aspx

I would suspect most would provide reasonably well distributed results.
I guess what I mean is, how is a string represented numerically when math is performed on it?

Given a set of hashes, with an acceptable amount of collisions, each hash is going to be the same length, and use a similar character set. Because whatever numerical form the hash translates to mathmatically would also be a factor in the distribution throughout that range.
A quick and dirty way that ought to give reasonable results:

Pad the string with its first n bytes to make it a multiple of 4 bytes.
Treat each 4-byte climp as an unsigned int.
Add them up.
Use the result as the seed to a random generator and mod the result.

To get the int representing a given 4-byte clump, use a union or, if c#, a [StructLayout(LayoutKind.Explicit)]
>>I guess what I mean is, how is a string represented numerically when math is performed on it?
normally the char are converted to ascii codes and it's the ascii code value that math is being performed on it
union {
         s char[4];
         i int;
         } u;
Plop four chars from the target string into u.s, then use u.i.


Or, for c#:
[StructLayout(LayoutKind.Explicit)]
struct FourBytesToInt
{
[FieldOffset(0)] public byte b0;
[FieldOffset(1)] public byte b1;
[FieldOffset(2)] public byte b2;
[FieldOffset(3)] public byte b3;
[FieldOffset(0)] public IntOut;
}
well, I didn't do a very great job of weeding out the ancillary, but essentially  my question was a math question.

And the MOD gets it!

Thanks.
<I guess what I mean is, how is a string represented numerically when math is performed on it?
normally the char are converted to ascii codes and it's the ascii code value that math is being performed on it>

That's what the union does for you.  You copy 4 characters into u.s and use it numerically as u.i

For example, let's say your url is http://www.there.com.  The corresponding 4-byte clumps are
http
://w
ww.t
here
.com

Looking at the first one in RAM you would see the bytes  68 74 74 70.  Then, by referring to u.i you'ld have the integer 1886680168.  Adding up all the u.i integers for each 4-byte clump would give you the final integer you'ld eather take the mod of. or, optionally, use it as a seed for a random generator.

Unless I'm having an extended senior moment and have missed the whole point.