Solved

converting (string) urls to 4 byte integer numbers

Posted on 2004-09-07
10
822 Views
Last Modified: 2007-12-19
Hi,
I was wondering if anyone knows how a url (string representation) could be mapped to a unique identifier number (4 byte if possible).  What I would like to do is to map up to 100,000,000 urls to a 4 byte unique identifier to save on memory as these UID's will be used to represent the URL's within a database.
      Cheers,
      everton690.
0
Comment
Question by:everton690
10 Comments
 
LVL 11

Expert Comment

by:cjjclifford
ID: 11996829
you could create a LUT in the database, with the ID, URL as columns (e.g. In oracle, use a SEQUENCE to generate the ID), and use the ID as foreign key everywhere.

Other than that, you could generate a CRC function, or some type of hashcode function, on the URL to convert the URL to a code. This would not be guaranteed to be unique though...
0
 
LVL 19

Expert Comment

by:drichards
ID: 11997076
That sounds like the best idea (creating an autonumber as an ID).  The other problem with a hash, besides the fact that with a 4-byte hash you will have a reasonable probability of collisions, is that it is one way.  You cannot back-figure the URL from the hash, so you would need to store the URL anyway unless you are only looking up based on a URL as input.
0
 
LVL 19

Expert Comment

by:RanjeetRain
ID: 12000368
CRC generation is a proven technology. And it works on strings of unlimited length. Any function you code yourself, may or may not gurantee the uniqueness of a the key associated with a URL.

Another solution, and a little easier, can be to generate and use GUIDs. Its standrad on Windows platform, and can be relaibly used to store keys. Its a virtual gurantee that a GUID will be unique.
0
 
LVL 37

Expert Comment

by:gregoryyoung
ID: 12001582
lets say you have a 50 character url ...

there is no way it could possibly be uniquely hashed into a 4 byte integer (simple math on the number of combinations)

the idea of the uniqueid is good ...

another idea would just be to use a hash function ...

    unsigned long
    hash(unsigned char *str)
    {
        unsigned long hash = 5381;
        int c;

        while (c = *str++)
            hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

        return hash;
    }


is a decent one.
0
 
LVL 51

Expert Comment

by:Julian Hansen
ID: 12005057
If you are using MS SQL

create table myURLS
(
    recid int IDENTITY(1,1) NOT NULL,
    URL varchar(256) PRIMARY KEY NOT NULL
)

Then

insert into myURLS ( URL ) value ('www.1.com' )
insert into myURLS ( URL ) value ( 'www.2.com' )
...

Good for 2^32 URLS which is slightly more URLS than are currently out there.

0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 11

Accepted Solution

by:
cjjclifford earned 200 total points
ID: 12005145
all these great repeats of my suggestions :-)

btw, if you are using Java, the builtin hashCode() method on String is the correct one to use for the hash value - not this is not going to be unique, but it is going to be very long, and quite likely to be well distributed.

E.g. System.out.println( "http://www.experts-exchange.com/Programming/".hashCode() ); generated the output: 505291298

For the LUT, to expand my original suggestion, the syntax for Oracle would be:

CREATE TABLE urls (
    id NUMBER NOT NULL,
    url VARCHAR2(255) NOT NULL
);
CREATE SEQUENCE url_id_seq;

then,

insert into urls( id, url ) VALUES( url_id_seq.nextval, 'www.abc.com' );

Note, you'll probably want a primary key on urls(id) if this is going to be used...
0
 
LVL 3

Expert Comment

by:aravindtj
ID: 12008537
hi,
 you can specify the IP address in integer format (called in windows uint32 - unsigned long.).
 you can get it by inet_addr (const char * ipaddress ). ipaddress is in dotted notation format.
 you can get host name/ url using  gethostbyaddr method.
Syntax:
 struct hostent FAR * gethostbyaddr ( const char FAR * addr,  int len, int type )
 addr - dotted notation IP address.
 len - length of address.
 type - type of address

hostent structure:

struct hostent {
    char FAR *       h_name;
    char FAR * FAR * h_aliases;
    short            h_addrtype;
    short            h_length;
    char FAR * FAR * h_addr_list;
};

try that.
all the best
0
 
LVL 11

Expert Comment

by:cjjclifford
ID: 12008820
aravindtj, at best this would work for domain names, not full URLs, and the overhead of doing the DNS lookup (etc) to get the ID would probably be over the top...
0
 
LVL 51

Expert Comment

by:Julian Hansen
ID: 12008878
Also there is the fact that IP's have a one to many relationship with domain names. A server with a single IP can host more than 1 domain.
0
 

Author Comment

by:everton690
ID: 12009035
Thanks everone who posted a comment  and thanks to cjjclifford for some relevant suggestions.  In the end (in case anyone is interested) I decided to use java's GZIP to compress the string url.  Cheers,
          everton690.
0

Featured Post

How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

Join & Write a Comment

Purpose To explain how to place a textual stamp on a PDF document.  This is commonly referred to as an annotation, or possibly a watermark, but a watermark is generally different in that it is somewhat translucent.  Watermark’s may be text or graph…
Whether you've completed a degree in computer sciences or you're a self-taught programmer, writing your first lines of code in the real world is always a challenge. Here are some of the most common pitfalls for new programmers.
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
In this fourth video of the Xpdf series, we discuss and demonstrate the PDFinfo utility, which retrieves the contents of a PDF's Info Dictionary, as well as some other information, including the page count. We show how to isolate the page count in a…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now