Link to home
Start Free TrialLog in
Avatar of elmoredaniel
elmoredaniel

asked on

TLD algorithm challenge!

How could you extract the subdomain, domain and tld from an FQDN?

I'm huge up on determining "second-level" tlds or whatever their proper name is.

Example:  www.google.co.uk

The domain is "google" not "co", my current algo would return:

subdomain: "www.google"
domain: "co"
tld: "uk"

Any thoughts on how to accurately handle the dotted tlds?  I don't know of any tld lists that include these.


Avatar of scampgb
scampgb
Flag of United Kingdom of Great Britain and Northern Ireland image

Hi elmoredaniel,
The simple answer is, that there isn't a simple answer.

Each TLD (such as .com or .uk) has a registrar which defined what (if any) SLDs exist.
Essentially, you'd need to check with each TLD registrar to find out whether or not they use second level domains.  Once you've got that info, you can parse the domain name correctly.

You can find a list of country TLDs at http://www.iana.org/cctld/cctld-whois.htm and a list of "generic" TLDs at http://www.iana.org/gtld/gtld.htm
These lists will let you know who the administrator of each TLD is, so you can check on their website.

You can find a more detailed explanation of this at http://www.faqts.com/knowledge_base/view.phtml/aid/3564

Does that help?
Avatar of Harisha M G
Hi elmoredaniel,
    Since you are asking for an algorithm...
    1) Get the whole string "http://www.google.co.uk/search?q..."
    2) Find whether the string has "://" and find its location, say x. In your case "://" is at the fifth position.
    3) Remove the characters upto x + 2. Now you are left with "www.google.co.uk/search?q..."
    4) Now find the first occurence of "/". If exists, then remove the whole thing starting from that position. You are now left with "www.google.co.uk"
    5) Count the number of "." in the string.. 3
    6) Split the string in to substrings using functions similar to Split in VB.
    7) If the dots are 1, first one is domain and second one is tld (google.com)
    8) If the dots are 3, second one is domain, fourth one is tld, third one is subdomain. Ignore first substring(typically www) (www.google.co.in)
    9) If the dots are 2, check the first substring. If it is "www", then second is domain and third is tld. (www.google.com)
    10) If the dots are 2 and first substring is not "www", then first one is subdomain, second one is domain, third is tld (gmail.google.com)

Hope this helps :)

Bye
---
Harish
Avatar of elmoredaniel
elmoredaniel

ASKER

scampgb,  I feared that was the case. Do you know any links to get more detailed information on the format of these SLDs, particularly I wonder if there is a length restriction. Two or three characters seems to be all that I see. If that's the case, then I could check if the TLD is a CC and then check the length of the SLD, if 2 or 3 I could probably conclude that it's "part" of the TLD. What do you think?

Your links were very helpful!
ASKER CERTIFIED SOLUTION
Avatar of scampgb
scampgb
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
#!/usr/bin/perl -w
$_ = "www.google.co.uk";
print /([^.]*)\.\w*(?:\.(?:ac|at|au|be|ca|cn|co|ec|fr|hk|il|in|jp|kr|mc|mm|mx|pl|ro|ru|sg|th\
|tr|uk|za)?)$/,"\n";

in javascript :)

<script type="text/javascript">
 url="http://www.google.co.uk/search?q=puzzle"
 url=url.split('//')[1].split('/')[0].split('.')
 for (i=0;i<url.length;i++)
   alert(url[i])
</script>
Just to add another spanner in the works, www.bl.uk, bl is a domain rather than a ccSLD.
ozo: Just as a matter of interest, how did you get the list "ac|at|au|be|ca|cn|co|ec|fr|hk|il|in|jp|kr|mc|mm|mx|pl|ro|ru|sg|th|tr|uk|za" ?

andyalder: Very good point - as is www.nic.uk :-)
Good old Nominet. Notice that their whois (at least the web based version) knows who nic.uk are but it doesn't resolve jet.uk, sl.uk and a handful of others that still have domains rather than ccSLDs under the UK ccTLD. Other bodies may have their own quirks but at least http://lo.ve.ly is back again ;)