Issues in using a predictable hash function for a service

I was going through the url shortner example on this page -
Although my query is simply about the hashing function that can be used to compute the shortened url .
Here it gives a way to encode the url -
I have few queries on the same  -
1) suppose i use a base62 hash function to shorten a url or even some other function as anyone can easily discover this then that is not desirable.
what exactly are the reasons for this. Although it does feel bit insecure that someone knows the hash function but so what. what are the pros and cons ?

2) Database ID encoded to base_62 also won't be suitable for a production environment because it will leak information about the database. For example, patterns can be learnt to compute the growth rate(new URLs added per day) and in the worst case, copy the whole database.

How is this possible because at most a user can predict the shortenedUrl thats all .
And shortened url, original url is inserted into a database.

Please help me understand why above both are true..
Rohit BajajAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ste5anSenior DeveloperCommented:

1) base62 is NOT a hash function. It's an encoding function for mapping binary data to ASCII. Thus it MUST be reversible.
md5 is a hash function, which is not reversible. It returns a result of 128 bit.

The idea is, that you can get always a safe value to be used as part of a URL without further URL-encoding.

2) This argument is against using sequential, monotonic increasing values. It is not true for random GUID's.

But the whole point is: that the user must use the shortening service to translate/expand the shortened URL.
Rohit BajajAuthor Commented:
Need some more clarity especially on point 2)
Database ID encoded to base_62 also won't be suitable for a production environment because it will leak information about the database.
Can you please give an example.. to illustrate it.
Bernard S.CTOCommented:
Since thee are urls: web pages, the hiding of the db structure should be hidden before shortening, if any.

What exactly do you display on the wab page? whicg security system do you use to control db access? If none, then you should be prepared that some day some robot will enventually access your db after testing some random urls.

Securing by concealing is usually a poor defence system if it is the only one, because one day or the other concealed facts will leak.
Rohit BajajAuthor Commented:
My webpage will allow the user to enter a url and i will return a shortened URL. The db structure will anyways not be visible to the user.
And a user can open a shortened URL on his browser and my server will redirect it to the original URL if it exists..
Its basically i am trying to design a service like for my learning purpose.
Just working on how will i design such a URL shortening sevice like the above... And what database should i use..

Can you put some more light on Securing by concealing ?
Normally the database i create in a project can only be accessed with a username and password.
ste5anSenior DeveloperCommented:
Point 2 is about database ID. Thus in more precise terms: about using primary key values as part of the URL.

For example, when you have a table, where the primary key is a single column using a sequential identity. And you would use it directly as https:\\yourdomain.tld\id, then an attacker can simply test and retrieve data by querying all natural numbers. So he would see what URLs were shortened. This would also be possible when using an encoding like base62 , e.g. https:\\yourdomain.tld\base62(id). It's a little bit trickier, cause the attack must first identify the encoding schema, but this is in many cases a simple exercise.

The only way to avoid this kind of attack is to use a randomized number from a larger domain. This is done by using a hash function. Normally a database ID is an INT (integer, often 4 bytes like in T-SQL). And a hash function like MD5 returns a 128bit value, 16 bytes. Thus finding a used value in this larger space is much more time-consuming.

BUT: a database ID must not be such an integer. I could also be a 128-bit value. In T-SQL we can use the UNIQUEIDENTIFIER data type and the NEWID() function for a randomized value. In this case, we don't need a hash function. Cause it is already a value from the same (16 bytes) domain.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.