• Status: Solved
• Priority: Medium
• Security: Public
• Views: 499

TLD algorithm challenge!

How could you extract the subdomain, domain and tld from an FQDN?

I'm huge up on determining "second-level" tlds or whatever their proper name is.

The domain is "google" not "co", my current algo would return:

domain: "co"
tld: "uk"

Any thoughts on how to accurately handle the dotted tlds?  I don't know of any tld lists that include these.

0
elmoredaniel
1 Solution

Commented:
Hi elmoredaniel,

Each TLD (such as .com or .uk) has a registrar which defined what (if any) SLDs exist.
Essentially, you'd need to check with each TLD registrar to find out whether or not they use second level domains.  Once you've got that info, you can parse the domain name correctly.

You can find a list of country TLDs at http://www.iana.org/cctld/cctld-whois.htm and a list of "generic" TLDs at http://www.iana.org/gtld/gtld.htm
These lists will let you know who the administrator of each TLD is, so you can check on their website.

You can find a more detailed explanation of this at http://www.faqts.com/knowledge_base/view.phtml/aid/3564

Does that help?
0

Commented:
Hi elmoredaniel,
Since you are asking for an algorithm...
1) Get the whole string "http://www.google.co.uk/search?q..."
2) Find whether the string has "://" and find its location, say x. In your case "://" is at the fifth position.
3) Remove the characters upto x + 2. Now you are left with "www.google.co.uk/search?q..."
4) Now find the first occurence of "/". If exists, then remove the whole thing starting from that position. You are now left with "www.google.co.uk"
5) Count the number of "." in the string.. 3
6) Split the string in to substrings using functions similar to Split in VB.
7) If the dots are 1, first one is domain and second one is tld (google.com)
8) If the dots are 3, second one is domain, fourth one is tld, third one is subdomain. Ignore first substring(typically www) (www.google.co.in)
9) If the dots are 2, check the first substring. If it is "www", then second is domain and third is tld. (www.google.com)
10) If the dots are 2 and first substring is not "www", then first one is subdomain, second one is domain, third is tld (gmail.google.com)

Hope this helps :)

Bye
---
Harish
0

Author Commented:
scampgb,  I feared that was the case. Do you know any links to get more detailed information on the format of these SLDs, particularly I wonder if there is a length restriction. Two or three characters seems to be all that I see. If that's the case, then I could check if the TLD is a CC and then check the length of the SLD, if 2 or 3 I could probably conclude that it's "part" of the TLD. What do you think?

0

Commented:
elmoredaniel:
Sorry, once again it's not that straightforward :-(
For example, there's an SLD for police.uk - this has the same number of characters as "google".

I think what you'll have to do is go through all the CC registrars and see whether they use SLDs for administrative reasons (as "UK" does for example).
You could then build the TLD list into your process and lookup whether or not is uses SLDs.  If it does, you know how to treat the domain name.

Well, that's the theory - Canada could prove a notable exception.

0

Commented:
#!/usr/bin/perl -w
print /([^.]*)\.\w*(?:\.(?:ac|at|au|be|ca|cn|co|ec|fr|hk|il|in|jp|kr|mc|mm|mx|pl|ro|ru|sg|th\
|tr|uk|za)?)\$/,"\n";

0

Commented:
in javascript :)

<script type="text/javascript">
url=url.split('//')[1].split('/')[0].split('.')
for (i=0;i<url.length;i++)
</script>
0

Saggar makers bottom knockerCommented:
Just to add another spanner in the works, www.bl.uk, bl is a domain rather than a ccSLD.
0

Commented:
ozo: Just as a matter of interest, how did you get the list "ac|at|au|be|ca|cn|co|ec|fr|hk|il|in|jp|kr|mc|mm|mx|pl|ro|ru|sg|th|tr|uk|za" ?

andyalder: Very good point - as is www.nic.uk :-)
0

Saggar makers bottom knockerCommented:
Good old Nominet. Notice that their whois (at least the web based version) knows who nic.uk are but it doesn't resolve jet.uk, sl.uk and a handful of others that still have domains rather than ccSLDs under the UK ccTLD. Other bodies may have their own quirks but at least http://lo.ve.ly is back again ;)
0

Featured Post

Tackle projects and never again get stuck behind a technical roadblock.