Solved

Script to find Top Level Domains Only

Posted on 2013-11-05
20
736 Views
Last Modified: 2013-11-06
I've got hundreds of domains, all over the world, with various TLD's ccTLD's, plus a list of the subdomains (thousands). I want to use a script to match the list of TLD's, and cut any sudomains off the front of the input...

Domain list:
subdomain1.subdomain2.subdomain3.example.co.jp
subdomain1.subdomain2.example.co.uk
subdomain1.subdomain2.example.com.mx
subdomain.example.gov.tx
subdomain.example.org
example.info

TLD's: (for this example, see attached for full txt)
.org
.info
.gov.tx
.com.mx
.co.uk
.co.jp

and so on. Sometimes I've got subdomains, other times not, so regex *seems* out of the question because there can be many variants in the "dot" counts (anywhere from 2-6 dot's in an entry) in the domain list.
So I was thinking read each line of the domain list (above), match a TLD to the end, put that aside, match anything left of the TLD up to one dot or beginning of line (if no dot found) and then combine that into one whole TLD.

Again I've got thousands of subdomains and hundreds of TLD's, our registrar is a mess because it's not letting us export them as just the TLD's :( We can only export the DNS records for some reason (and we want to leave this registrar).

I am attaching list of valid domain's (ccTLD and gTLD), I'd like the script to read from that, and the other file, strip off any subdoomains and leave me with just TLD's like
Example.info
Example.org
Example.co.uk
Example.co.jp
Example.gov.tx
Example.com.mx
-rich
valid-domains.txt
0
Comment
Question by:Rich Rumble
  • 11
  • 6
  • 3
20 Comments
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 250 total points
ID: 39625718
I think it's as simple as this...
use strict;
use warnings;
my $fil = shift or die "Usage: $0 inputfile\n";
open IN, 'valid-domaints.txt' or die "could not open valid-domains.txt: $!";
my @list = map { chomp; s{\.}{\\.}g; $_ } <IN>;
close IN;
my $rx = join '|', @list;
open IN, $fil or die "could not open $fil: $!";
while (<IN>) {
    chomp;
    if (s{^.*([^.]+\.(?:$rx)$}{$1}) { # lazy regex - could replace .* with valid char class
        print $_, "\n";
    } else {
        warn "could not match a TLD in $_";
    }
}

Open in new window

0
 
LVL 11

Accepted Solution

by:
tel2 earned 250 total points
ID: 39625800
Hi richrumble,

When I put your sample domain list (including subdomains) in my-sites.txt, then run this:
#!/usr/bin/perl

open(TLD, '<valid-domains.txt') or die "Can't open TLD file: $!";
while (<TLD>)
{
    chomp;
    s/^\.//;
    $tld{$_} ++;
}

open(SITE, '<my-sites.txt') or die "Can't open SITE file: $!";
while (<SITE>)
{
    chomp;
    $site = $_;
    while (s/^([^.]+)\.(.+)/$2/)
    {
        if ($tld{$2})
        {
            print "$1.$2\n" unless $seen{"$1.$2"};
            $seen{"$1.$2"} ++;
            last;
        }
    }
}

Open in new window

I get this output:
example.co.jp
example.co.uk
example.com.mx
example.org
example.info
Is that what you want?
0
 
LVL 38

Author Closing Comment

by:Rich Rumble
ID: 39625827
Wow, that was fast, and both work equally well, ty both!
0
 
LVL 11

Expert Comment

by:tel2
ID: 39625834
Line 15 can be removed from mine, rich:
    $site = $_;
It was just there for testing.

It looks as if wilcoxon's should tell you if no TLD was matched, which is good.  I can change mine to do that if needed.
0
 
LVL 11

Expert Comment

by:tel2
ID: 39626012
Hi wilcoxon.

Nice looking script.

I was about half way through my solution when you posted yours, so I was planning to abandon my attempt, but I couldn't get yours to work so I continued.

Here's the error I get when I run yours:

Unmatched ( in regex; marked by <-- HERE in m/^.*( <-- HERE [^.]+\.(?:\.a\.se|\.ab\.ca|\.abo\.pa|\.ac|\.ac\.ae|\.ac\.at|\.ac\.cn|\.ac\.cr|\.ac\.cy|\.ac\.fj|\.ac\.fk|\.ac\.gn|\.ac\.id|\.ac\.il|\.ac\.in|\.ac\.ir|\.ac\.jp|\.ac\.ke|\.ac\.kr|\.ac\.ma|\.ac\.me|\.ac\.mu|\.ac\.mw|\.ac\.mz|\.ac\.ni|\.ac\.nz|\.ac\.om|\.ac\.pa|\.ac\.pr|\.ac\.rs|\.ac\.ru|\.ac\.rw|\.ac\.se|\.ac\.sz|\.ac\.th|\.ac\.tj|\.ac\.tz|\.ac\.ug|\.ac\.uk|\.ac\.vn|\.ac\.yu|\.ac\.za|\.ac\.zm|\.ad|\.ad\.jp|\.adm\.br|\.adv\.br|\.adygeya\.ru|\.ae|\.aero|\.aero\.mv|\.aero\.tt|\.af|\.ag|\.agr\.br|\.agric\.za|\.agrinet\.tn|\.ah\.cn|\.ai|\.al|\.alt\.za|\.altai\.ru|\.am|\.am\.br|\.amur\.ru|\.an|\.ao|\.aq|\.ar|\.arkhangelsk\.ru|\.arpa|\.arq\.br|\.art\.br|\.art\.do|\.art\.dz|\.art\.pl|\.art\.sn|\.arts\.nf|\.arts\.ro|\.as|\.asia|\.asn\.au|\.asn\.lv|\.assn\.lk|\.asso\.dz|\.asso\.fr|\.asso\.km|\.asso\.mc|\.asso\.re|\.astrakhan\.ru|\.at|\.ato\.br|\.au|\.av\.tr|\.aw|\.ax|\.az|\.b\.br|\.b\.se|\.ba|\.bashkiria\.ru|\.bb|\.bbs\.tr|\.bc\.ca|\.bd|\.bd\.se|\.be|\.bel\.tr|\.belau\.pw|\.belgorod\.ru|\.bf|\.bg|\.bh|\.bi|\.bialystok\.pl|\.bio\.br|\.bir\.ru|\.biz|\.biz\.bb|\.biz\.bh|\.biz\.ck|\.biz\.cy|\.biz\.et|\.biz\.fj|\.biz\.ki|\.biz\.mv|\.biz\.ng|\.biz\.nr|\.biz\.om|\.biz\.pk|\.biz\.pl|\.biz\.pr|\.biz\.tj|\.biz\.tr|\.biz\.tt|\.biz\.ua|\.biz\.vn|\.bj|\.bj\.cn|\.bl|\.bl\.uk|\.blog\.br|\.bm|\.bmd\.br|\.bn|\.bo|\.bourse\.za|\.bq|\.br|\.british-library\.uk|\.bryansk\.ru|\.bs|\.bt|\.buryatia\.ru|\.busan\.kr|\.bv|\.bw|\.by|\.bz|\.c\.se|\.ca|\.cat|\.cat\.tt|\.cbg\.ru|\.cc|\.cc\.bh|\.cd|\.cf|\.cg|\.ch|\.chel\.ru|\.chelyabinsk\.ru|\.cherkassy\.ua|\.chernigov\.ua|\.chernovtsy\.ua|\.chita\.ru|\.chukotka\.ru|\.chungbuk\.kr|\.chungnam\.kr|\.chuvashia\.ru|\.ci|\.cim\.br|\.city\.za|\.ck|\.ck\.ua|\.cl|\.club\.tw|\.cm|\.cn|\.cn\.ua|\.cng\.br|\.cnt\.br|\.co|\.co\.ae|\.co\.ao|\.co\.at|\.co\.ba|\.co\.bb|\.co\.ck|\.co\.cr|\.co\.fk|\.co\.gg|\.co\.id|\.co\.il|\.co\.in|\.co\.ir|\.co\.je|\.co\.jp|\.co\.ke|\.co\.kr|\.co\.ma|\.co\.me|\.co\.mu|\.co\.mw|\.co\.mz|\.co\.na|\.co\.ni|\.co\.nz|\.co\.om|\.co\.pw|\.co\.rs|\.co\.rw|\.co\.sh|\.co\.st|\.co\.sz|\.co\.th|\.co\.tj|\.co\.tt|\.co\.tz|\.co\.ua|\.co\.ug|\.co\.uk|\.co\.ve|\.co\.vi|\.co\.ye|\.co\.yu|\.co\.za|\.co\.zm|\.com|\.com\.ac|\.com\.af|\.com\.al|\.com\.ar|\.com\.au|\.com\.ba|\.com\.bb|\.com\.bh|\.com\.bn|\.com\.bo|\.com\.br|\.com\.bs|\.com\.bz|\.com\.cn|\.com\.co|\.com\.cy|\.com\.do|\.com\.dz|\.com\.ec|\.com\.eg|\.com\.er|\.com\.es|\.com\.et|\.com\.fj|\.com\.fr|\.com\.gh|\.com\.gn|\.com\.gr|\.com\.gt|\.com\.gu|\.com\.hk|\.com\.iq|\.com\.jo|\.com\.kh|\.com\.ki|\.com\.km|\.com\.kw|\.com\.ky|\.com\.kz|\.com\.lb|\.com\.lk|\.com\.lr|\.com\.lv|\.com\.ly|\.com\.mg|\.com\.mk|\.com\.ml|\.com\.mo|\.com\.mt|\.com\.mu|\.com\.mv|\.com\.mw|\.com\.mx|\.com\.my|\.com\.na|\.com\.nf|\.com\.ng|\.com\.ni|\.com\.np|\.com\.nr|\.com\.om|\.com\.pa|\.com\.pe|\.com\.ph|\.com\.pk|\.com\.pl|\.com\.pr|\.com\.ps|\.com\.pt|\.com\.py|\.com\.qa|\.com\.re|\.com\.ro|\.com\.ru|\.com\.rw|\.com\.sa|\.com\.sb|\.com\.sc|\.com\.sd|\.com\.sg|\.com\.sh|\.com\.sl|\.com\.sn|\.com\.st|\.com\.sv|\.com\.sy|\.com\.tj|\.com\.tn|\.com\.tr|\.com\.tt|\.com\.tw|\.com\.ua|\.com\.uy|\.com\.ve|\.com\.vi|\.com\.vn|\.com\.ye|\.com\.zm|\.conf\.lv|\.consulado\.st|\.coop|\.coop\.br|\.coop\.km|\.coop\.mv|\.coop\.mw|\.coop\.tt|\.cq\.cn|\.cr|\.cri\.nz|\.crimea\.ua|\.csiro\.au|\.cu|\.cv|\.cv\.ua|\.cw|\.cx|\.cy|\.cybernet\.za|\.cym\.uk|\.cz|\.d\.se|\.daegu\.kr|\.daejeon\.kr|\.dagestan\.ru|\.db\.za|\.de|\.de\.ki|\.defense\.tn|\.dj|\.dk|\.dm|\.dn\.ua|\.dnepropetrovsk\.ua|\.dni\.us|\.dnssec\.ir|\.do|\.donetsk\.ua|\.dp\.ua|\.dr\.tr|\.dz|\.e-burg\.ru|\.e\.se|\.ebiz\.tw|\.ec|\.ecape\.school\.za|\.ecn\.br|\.ed\.ao|\.ed\.cr|\.ed\.jp|\.ed\.pw|\.edu|\.edu\.af|\.edu\.al|\.edu\.ar|\.edu\.au|\.edu\.ba|\.edu\.bb|\.edu\.bh|\.edu\.bn|\.edu\.bo|\.edu\.br|\.edu\.bs|\.edu\.bz|\.edu\.ck|\.edu\.cn|\.edu\.co|\.edu\.do|\.edu\.dz|\.edu\.ec|\.edu\.eg|\.edu\.er|\.edu\.es|\.edu\.et|\.edu\.gh|\.edu\.gr|\.edu\.gt|\.edu\.gu|\.edu\.hk|\.edu\.in|\.edu\.iq|\.edu\.it|\.edu\.jo|\.edu\.kh|\.edu\.ki|\.edu\.km|\.edu\.kn|\.edu\.kw|\.edu\.ky|\.edu\.kz|\.edu\.lb|\.edu\.lk|\.edu\.lr|\.edu\.lv|\.edu\.ly|\.edu\.me|\.edu\.mg|\.edu\.mk|\.edu\.ml|\.edu\.mn|\.edu\.mo|\.edu\.mt|\.edu\.mv|\.edu\.mw|\.edu\.mx|\.edu\.my|\.edu\.mz|\.edu\.ng|\.edu\.ni|\.edu\.np|\.edu\.nr|\.edu\.om|\.edu\.pa|\.edu\.pe|\.edu\.ph|\.edu\.pk|\.edu\.pl|\.edu\.pr|\.edu\.ps|\.edu\.pt|\.edu\.py|\.edu\.qa|\.edu\.rs|\.edu\.ru|\.edu\.rw|\.edu\.sa|\.edu\.sb|\.edu\.sc|\.edu\.sd|\.edu\.sg|\.edu\.sh|\.edu\.sl|\.edu\.sn|\.edu\.st|\.edu\.sv|\.edu\.sy|\.edu\.tj|\.edu\.tr|\.edu\.tt|\.edu\.tw|\.edu\.ua|\.edu\.uy|\.edu\.ve|\.edu\.vn|\.edu\.yu|\.edu\.za|\.edu\.zm|\.edunet\.tn|\.ee|\.eg|\.eh|\.ekloges\.cy|\.embaixada\.st|\.eng\.br|\.ens\.tn|\.er|\.ernet\.in|\.es|\.es\.kr|\.esp\.br|\.est\.pr|\.et|\.etc\.br|\.eti\.br|\.eu|\.eun\.eg|\.f\.se|\.fam\.pk|\.far\.br|\.fed\.us|\.fi|\.fi\.cr|\.fin\.ec|\.fin\.tn|\.firm\.in|\.firm\.nf|\.firm\.ro|\.fj|\.fj\.cn|\.fk|\.flog\.br|\.fm|\.fm\.br|\.fnd\.br|\.fo|\.fot\.br|\.fr|\.fs\.school\.za|\.fst\.br|\.g\.se|\.g12\.br|\.ga|\.game\.tw|\.gangwon\.kr|\.gb|\.gd|\.gd\.cn|\.gda\.pl|\.gdansk\.pl|\.ge|\.geek\.nz|\.gen\.ck|\.gen\.in|\.gen\.nz|\.gen\.tr|\.gf|\.gg|\.ggf\.br|\.gh|\.gi|\.gl|\.gm|\.gn|\.go\.cr|\.go\.id|\.go\.jp|\.go\.ke|\.go\.kr|\.go\.pw|\.go\.th|\.go\.tj|\.go\.tz|\.go\.ug|\.gob\.ar|\.gob\.bo|\.gob\.do|\.gob\.es|\.gob\.gt|\.gob\.mx|\.gob\.ni|\.gob\.pa|\.gob\.pe|\.gob\.pk|\.gob\.sv|\.gob\.ve|\.gok\.pk|\.gon\.pk|\.gop\.pk|\.gorzow\.pl|\.gos\.pk|\.gouv\.fr|\.gouv\.km|\.gouv\.rw|\.gouv\.sn|\.gov|\.gov\.ac|\.gov\.ae|\.gov\.af|\.gov\.al|\.gov\.ar|\.gov\.au|\.gov\.ba|\.gov\.bb|\.gov\.bh|\.gov\.bn|\.gov\.bo|\.gov\.br|\.gov\.bs|\.gov\.bz|\.gov\.ck|\.gov\.cn|\.gov\.co|\.gov\.cy|\.gov\.do|\.gov\.dz|\.gov\.ec|\.gov\.eg|\.gov\.er|\.gov\.et|\.gov\.fk|\.gov\.gh|\.gov\.gn|\.gov\.gr|\.gov\.gu|\.gov\.hk|\.gov\.il|\.gov\.in|\.gov\.iq|\.gov\.ir|\.gov\.it|\.gov\.jo|\.gov\.kh|\.gov\.ki|\.gov\.kn|\.gov\.kw|\.gov\.ky|\.gov\.kz|\.gov\.lb|\.gov\.lk|\.gov\.lr|\.gov\.lv|\.gov\.ly|\.gov\.ma|\.gov\.me|\.gov\.mg|\.gov\.mk|\.gov\.ml|\.gov\.mn|\.gov\.mo|\.gov\.mt|\.gov\.mu|\.gov\.mv|\.gov\.mw|\.gov\.my|\.gov\.mz|\.gov\.ng|\.gov\.np|\.gov\.nr|\.gov\.om|\.gov\.ph|\.gov\.pk|\.gov\.pl|\.gov\.pr|\.gov\.ps|\.gov\.pt|\.gov\.py|\.gov\.qa|\.gov\.rs|\.gov\.ru|\.gov\.rw|\.gov\.sa|\.gov\.sb|\.gov\.sc|\.gov\.sd|\.gov\.sg|\.gov\.sh|\.gov\.sl|\.gov\.st|\.gov\.sy|\.gov\.tj|\.gov\.tn|\.gov\.tr|\.gov\.tt|\.gov\.tw|\.gov\.ua|\.gov\.uk|\.gov\.vn|\.gov\.ye|\.gov\.yu|\.gov\.za|\.gov\.zm|\.govt\.nz|\.govt\.uk|\.gp|\.gp\.school\.za|\.gq|\.gr|\.gr\.jp|\.grondar\.za|\.grozny\.ru|\.grp\.lk|\.gs|\.gs\.cn|\.gt|\.gu|\.gub\.uy|\.gv\.ao|\.gv\.at|\.gw|\.gwangju\.kr|\.gx\.cn|\.gy|\.gyeongbuk\.kr|\.gyeonggi\.kr|\.gyeongnam\.kr|\.gz\.cn|\.h\.se|\.ha\.cn|\.hb\.cn|\.he\.cn|\.health\.nz|\.health\.vn|\.hi\.cn|\.hk|\.hl\.cn|\.hm|\.hn|\.hn\.cn|\.hotel\.lk|\.hr|\.hs\.kr|\.ht|\.hu|\.i\.ph|\.i\.se|\.iaccess\.za|\.icnet\.uk|\.id|\.id\.au|\.id\.ir|\.id\.lv|\.id\.ly|\.idf\.il|\.idn\.sg|\.idv\.hk|\.idv\.tw|\.ie|\.if\.ua|\.il|\.im|\.imb\.br|\.imt\.za|\.in|\.in\.rs|\.in\.th|\.in\.ua|\.inca\.za|\.incheon\.kr|\.ind\.br|\.ind\.er|\.ind\.gt|\.ind\.in|\.ind\.tn|\.inf\.br|\.inf\.mk|\.info|\.info\.bb|\.info\.bh|\.info\.ck|\.info\.ec|\.info\.et|\.info\.fj|\.info\.ke|\.info\.ki|\.info\.mv|\.info\.nf|\.info\.nr|\.info\.pl|\.info\.pr|\.info\.ro|\.info\.sd|\.info\.tj|\.info\.tn|\.info\.tr|\.info\.tt|\.info\.ve|\.info\.vn|\.ing\.pa|\.int|\.int\.ar|\.int\.bo|\.int\.lk|\.int\.mv|\.int\.mw|\.int\.pt|\.int\.ru|\.int\.rw|\.int\.tj|\.int\.tt|\.int\.vn|\.intl\.tn|\.io|\.iq|\.ir|\.irkutsk\.ru|\.is|\.isa\.us|\.isla\.pr|\.it|\.it\.ao|\.its\.me|\.ivano-frankivsk\.ua|\.ivanovo\.ru|\.iwi\.nz|\.izhevsk\.ru|\.jar\.ru|\.je|\.jeju\.kr|\.jeonbuk\.kr|\.jeonnam\.kr|\.jet\.uk|\.jl\.cn|\.jm|\.jo|\.jobs|\.jobs\.tt|\.jor\.br|\.joshkar-ola\.ru|\.jp|\.js\.cn|\.jus\.br|\.jx\.cn|\.k\.se|\.k12\.il|\.k12\.tr|\.k12\.vi|\.kalmykia\.ru|\.kaluga\.ru|\.kamchatka\.ru|\.karelia\.ru|\.katowice\.pl|\.kazan\.ru|\.kchr\.ru|\.ke|\.kemerovo\.ru|\.kg|\.kg\.kr|\.kh|\.kh\.ua|\.khabarovsk\.ru|\.khakassia\.ru|\.kharkov\.ua|\.kherson\.ua|\.khmelnitskiy\.ua|\.khv\.ru|\.ki|\.kids\.us|\.kiev\.ua|\.kirov\.ru|\.kirovograd\.ua|\.km|\.km\.ua|\.kn|\.koenig\.ru|\.komi\.ru|\.kostroma\.ru|\.kp|\.kr|\.kr\.ua|\.krakow\.pl|\.kranoyarsk\.ru|\.ks\.ua|\.kuban\.ru|\.kurgan\.ru|\.kursk\.ru|\.kv\.ua|\.kw|\.ky|\.kz|\.kzn\.school\.za|\.l\.se|\.la|\.landesign\.za|\.law\.za|\.lb|\.lc|\.lea\.uk|\.lel\.br|\.lg\.jp|\.lg\.ua|\.li|\.lipetsk\.ru|\.lk|\.ln\.cn|\.lodz\.pl|\.lp\.school\.za|\.lr|\.ls|\.lt|\.ltd\.cy|\.ltd\.lk|\.ltd\.uk|\.ltd\.ye|\.lu|\.lublin\.pl|\.lugansk\.ua|\.lutsk\.ua|\.lv|\.lviv\.ua|\.ly|\.m\.se|\.ma|\.magadan\.ru|\.maori\.nz|\.mari-el\.ru|\.mari\.ru|\.marine\.ru|\.mat\.br|\.mb\.ca|\.mc|\.md|\.me|\.me\.ke|\.me\.ua|\.me\.uk|\.me\.ye|\.med\.br|\.med\.ec|\.med\.ly|\.med\.om|\.med\.pa|\.med\.sa|\.med\.sd|\.medecin\.km|\.mf|\.mg|\.mh|\.mi\.th|\.mil|\.mil\.ac|\.mil\.ae|\.mil\.al|\.mil\.ar|\.mil\.ba|\.mil\.bo|\.mil\.br|\.mil\.cn|\.mil\.co|\.mil\.do|\.mil\.ec|\.mil\.eg|\.mil\.er|\.mil\.fj|\.mil\.gh|\.mil\.gr|\.mil\.gt|\.mil\.id|\.mil\.in|\.mil\.iq|\.mil\.jo|\.mil\.kh|\.mil\.km|\.mil\.kr|\.mil\.kz|\.mil\.lv|\.mil\.mg|\.mil\.mv|\.mil\.my|\.mil\.ng|\.mil\.ni|\.mil\.np|\.mil\.nz|\.mil\.om|\.mil\.pe|\.mil\.ph|\.mil\.pl|\.mil\.py|\.mil\.qa|\.mil\.ru|\.mil\.rw|\.mil\.st|\.mil\.sy|\.mil\.tj|\.mil\.tt|\.mil\.tw|\.mil\.uk|\.mil\.uy|\.mil\.ve|\.mil\.za|\.mincom\.tn|\.mk|\.mk\.ua|\.ml|\.mm|\.mn|\.mo|\.mob\.ki|\.mobi|\.mobi\.ke|\.mobi\.ng|\.mobi\.tt|\.mod\.uk|\.mordovia\.ru|\.mosreg\.ru|\.mp|\.mpm\.school\.za|\.mq|\.mr|\.ms|\.ms\.kr|\.msk\.ru|\.mt|\.mu|\.muni\.il|\.murmansk\.ru|\.mus\.br|\.museum|\.museum\.mv|\.museum\.mw|\.museum\.om|\.museum\.tt|\.mv|\.mw|\.mx|\.my|\.mz|\.n\.se|\.na|\.nalchik\.ru|\.name|\.name\.ae|\.name\.cy|\.name\.eg|\.name\.et|\.name\.fj|\.name\.jo|\.name\.mk|\.name\.mv|\.name\.my|\.name\.ng|\.name\.pr|\.name\.tj|\.name\.tr|\.name\.tt|\.name\.vn|\.nat\.tn|\.national-library-scotland\.uk|\.nb\.ca|\.nc|\.ncape\.school\.za|\.ne|\.ne\.jp|\.ne\.ke|\.ne\.kr|\.ne\.pw|\.ne\.tz|\.ne\.ug|\.nel\.uk|\.net|\.net\.ac|\.net\.ae|\.net\.af|\.net\.al|\.net\.ar|\.net\.au|\.net\.ba|\.net\.bb|\.net\.bh|\.net\.bn|\.net\.bo|\.net\.br|\.net\.bs|\.net\.bz|\.net\.ck|\.net\.cn|\.net\.co|\.net\.cy|\.net\.do|\.net\.dz|\.net\.ec|\.net\.eg|\.net\.er|\.net\.et|\.net\.fj|\.net\.fk|\.net\.gg|\.net\.gn|\.net\.gr|\.net\.gt|\.net\.gu|\.net\.hk|\.net\.id|\.net\.il|\.net\.in|\.net\.iq|\.net\.ir|\.net\.je|\.net\.jo|\.net\.kh|\.net\.ki|\.net\.kn|\.net\.kw|\.net\.ky|\.net\.kz|\.net\.lb|\.net\.lk|\.net\.lr|\.net\.lv|\.net\.ly|\.net\.ma|\.net\.me|\.net\.mk|\.net\.ml|\.net\.mo|\.net\.mt|\.net\.mu|\.net\.mv|\.net\.mw|\.net\.mx|\.net\.my|\.net\.nf|\.net\.ng|\.net\.ni|\.net\.np|\.net\.nr|\.net\.nz|\.net\.om|\.net\.pa|\.net\.pe|\.net\.ph|\.net\.pk|\.net\.pl|\.net\.pr|\.net\.ps|\.net\.pt|\.net\.py|\.net\.qa|\.net\.ru|\.net\.rw|\.net\.sa|\.net\.sb|\.net\.sc|\.net\.sd|\.net\.sg|\.net\.sh|\.net\.sl|\.net\.st|\.net\.sy|\.net\.th|\.net\.tj|\.net\.tn|\.net\.tr|\.net\.tt|\.net\.tw|\.net\.ua|\.net\.uk|\.net\.uy|\.net\.ve|\.net\.vi|\.net\.vn|\.net\.ye|\.net\.za|\.net\.zm|\.news\.sy|\.nf|\.nf\.ca|\.ng|\.ngo\.lk|\.ngo\.ph|\.ngo\.pl|\.ngo\.za|\.nhs\.uk|\.ni|\.nic\.in|\.nic\.tj|\.nic\.uk|\.nikolaev\.ua|\.nis\.za|\.nl|\.nl\.ca|\.nls\.uk|\.nm\.cn|\.nnov\.ru|\.no|\.nom\.br|\.nom\.co|\.nom\.es|\.nom\.fk|\.nom\.fr|\.nom\.km|\.nom\.mg|\.nom\.ni|\.nom\.pa|\.nom\.pe|\.nom\.re|\.nom\.ro|\.nom\.sh|\.nom\.za|\.nome\.pt|\.not\.br|\.notaires\.km|\.nov\.ru|\.novosibirsk\.ru|\.np|\.nr|\.ns\.ca|\.nsk\.ru|\.nsn\.us|\.nt\.ca|\.nt\.ro|\.ntr\.br|\.nu|\.nu\.ca|\.nw\.school\.za|\.nx\.cn|\.nz|\.o\.se|\.od\.ua|\.odessa\.ua|\.odo\.br|\.og\.ao|\.olivetti\.za|\.olsztyn\.pl|\.om|\.omsk\.ru|\.on\.ca|\.or\.at|\.or\.cr|\.or\.id|\.or\.jp|\.or\.ke|\.or\.kr|\.or\.mu|\.or\.pw|\.or\.th|\.or\.tz|\.or\.ug|\.orenburg\.ru|\.org|\.org\.ac|\.org\.ae|\.org\.af|\.org\.al|\.org\.ar|\.org\.au|\.org\.ba|\.org\.bb|\.org\.bh|\.org\.bn|\.org\.bo|\.org\.br|\.org\.bs|\.org\.bz|\.org\.ck|\.org\.cn|\.org\.co|\.org\.cy|\.org\.do|\.org\.dz|\.org\.ec|\.org\.eg|\.org\.er|\.org\.es|\.org\.et|\.org\.fj|\.org\.fk|\.org\.gg|\.org\.gh|\.org\.gn|\.org\.gr|\.org\.gt|\.org\.gu|\.org\.hk|\.org\.il|\.org\.in|\.org\.iq|\.org\.ir|\.org\.je|\.org\.jo|\.org\.kh|\.org\.ki|\.org\.kn|\.org\.kw|\.org\.ky|\.org\.kz|\.org\.lb|\.org\.lk|\.org\.lr|\.org\.lv|\.org\.ly|\.org\.ma|\.org\.me|\.org\.mg|\.org\.mk|\.org\.ml|\.org\.mn|\.org\.mo|\.org\.mt|\.org\.mu|\.org\.mv|\.org\.mw|\.org\.mx|\.org\.my|\.org\.mz|\.org\.ng|\.org\.ni|\.org\.np|\.org\.nr|\.org\.nz|\.org\.om|\.org\.pa|\.org\.pe|\.org\.ph|\.org\.pk|\.org\.pl|\.org\.pr|\.org\.ps|\.org\.pt|\.org\.py|\.org\.qa|\.org\.ro|\.org\.rs|\.org\.ru|\.org\.sa|\.org\.sb|\.org\.sc|\.org\.sd|\.org\.se|\.org\.sg|\.org\.sh|\.org\.sl|\.org\.sn|\.org\.st|\.org\.sv|\.org\.sy|\.org\.sz|\.org\.tj|\.org\.tn|\.org\.tr|\.org\.tt|\.org\.tw|\.org\.ua|\.org\.ug|\.org\.uk|\.org\.uy|\.org\.ve|\.org\.vi|\.org\.vn|\.org\.ye|\.org\.yu|\.org\.za|\.org\.zm|\.orgn\.uk|\.oryol\.ru|\.other\.nf|\.p\.se|\.pa|\.parliament\.cy|\.parliament\.nz|\.parliament\.uk|\.parti\.se|\.pb\.ao|\.pe|\.pe\.ca|\.pe\.kr|\.penza\.ru|\.per\.kh|\.per\.nf|\.per\.sg|\.perm\.ru|\.perso\.sn|\.perso\.tn|\.pf|\.pg|\.ph|\.pharmaciens\.km|\.pix\.za|\.pk|\.pl|\.pl\.ua|\.plc\.ly|\.plc\.uk|\.plc\.ye|\.plo\.ps|\.pm|\.pn|\.pol\.dz|\.pol\.tr|\.police\.uk|\.poltava\.ua|\.post|\.poznan\.pl|\.pp\.ru|\.pp\.se|\.pp\.ua|\.ppg\.br|\.pr|\.prd\.fr|\.prd\.mg|\.press\.cy|\.press\.ma|\.press\.se|\.presse\.fr|\.presse\.km|\.presse\.ml|\.principe\.st|\.priv\.me|\.pro|\.pro\.ae|\.pro\.br|\.pro\.cy|\.pro\.ec|\.pro\.fj|\.pro\.mk|\.pro\.mv|\.pro\.om|\.pro\.pr|\.pro\.tt|\.pro\.vn|\.prof\.pr|\.ps|\.psc\.br|\.psi\.br|\.pskov\.ru|\.pt|\.ptz\.ru|\.pub\.sa|\.publ\.pt|\.pw|\.pwr\.pl|\.py|\.qa|\.qc\.ca|\.qh\.cn|\.qsl\.br|\.r\.se|\.radom\.pl|\.re|\.re\.kr|\.rec\.br|\.rec\.nf|\.rec\.ro|\.red\.sv|\.res\.in|\.rnd\.ru|\.rnrt\.tn|\.rns\.tn|\.rnu\.tn|\.ro|\.rochest\.er|\.rovno\.ua|\.rs|\.rs\.ba|\.ru|\.rv\.ua|\.rw|\.ryazan\.ru|\.s\.se|\.sa|\.sa\.cr|\.sakhalin\.ru|\.samara\.ru|\.saotome\.st|\.saratov\.ru|\.sb|\.sc|\.sc\.cn|\.sc\.ke|\.sc\.kr|\.sc\.ug|\.sch\.ae|\.sch\.id|\.sch\.ir|\.sch\.jo|\.sch\.lk|\.sch\.ly|\.sch\.my|\.sch\.ng|\.sch\.om|\.sch\.sa|\.sch\.uk|\.sch\.zm|\.school\.nz|\.school\.za|\.sci\.eg|\.scot\.uk|\.sd|\.sd\.cn|\.se|\.sebastopol\.ua|\.sec\.ps|\.seoul\.kr|\.sg|\.sh|\.sh\.cn|\.si|\.simbirsk\.ru|\.sj|\.sk|\.sk\.ca|\.sl|\.sld\.do|\.sld\.pa|\.sld\.pe|\.slg\.br|\.slupsk\.pl|\.sm|\.smolensk\.ru|\.sn|\.sn\.cn|\.so|\.soc\.lk|\.soc\.uk|\.spb\.ru|\.sr|\.srv\.br|\.ss|\.st|\.stavropol\.ru|\.store\.bb|\.store\.nf|\.store\.ro|\.store\.st|\.stv\.ru|\.su|\.sumy\.ua|\.surgut\.ru|\.sv|\.sx|\.sx\.cn|\.sy|\.sz|\.szczecin\.pl|\.t\.se|\.tambov\.ru|\.tatarstan\.ru|\.tc|\.td|\.te\.ua|\.tel|\.tel\.ki|\.tel\.tr|\.tel\.tt|\.ternopil\.ua|\.test\.tj|\.tf|\.tg|\.th|\.tj|\.tj\.cn|\.tk|\.tl|\.tm|\.tm\.cy|\.tm\.fr|\.tm\.km|\.tm\.mc|\.tm\.mg|\.tm\.ro|\.tm\.se|\.tm\.za|\.tmp\.br|\.tn|\.to|\.tom\.ru|\.tomsk\.ru|\.torun\.pl|\.tourism\.tn|\.tp|\.tr|\.travel|\.travel\.tt|\.trd\.br|\.tsaritsyn\.ru|\.tsk\.ru|\.tsk\.tr|\.tt|\.tula\.ru|\.tur\.ar|\.tur\.br|\.tuva\.ru|\.tv|\.tv\.bb|\.tv\.bo|\.tv\.br|\.tv\.sd|\.tv\.tr|\.tver\.ru|\.tw|\.tw\.cn|\.tyumen\.ru|\.tz|\.u\.se|\.ua|\.udm\.ru|\.udmurtia\.ru|\.ug|\.uk|\.ulan-ude\.ru|\.ulsan\.kr|\.um|\.unbi\.ba|\.univ\.sn|\.unmo\.ba|\.unsa\.ba|\.untz\.ba|\.unze\.ba|\.us|\.uy|\.uz|\.uzhgorod\.ua|\.va|\.vc|\.ve|\.vet\.br|\.veterinaire\.km|\.vg|\.vi|\.vinnica\.ua|\.vladikavkaz\.ru|\.vladimir\.ru|\.vladivostok\.ru|\.vlog\.br|\.vn|\.vn\.ua|\.volgograd\.ru|\.vologda\.ru|\.voronezh\.ru|\.vrn\.ru|\.vu|\.vyatka\.ru|\.w\.er|\.w\.se|\.war\.net\.id|\.warszawa\.pl|\.waw\.pl|\.wcape\.school\.za|\.web\.do|\.web\.id|\.web\.lk|\.web\.nf|\.web\.pk|\.web\.tj|\.web\.tr|\.web\.ve|\.web\.za|\.wf|\.wiki\.br|\.wroc\.pl|\.wroclaw\.pl|\.ws|\.www\.ro|\.x\.se|\.xj\.cn|\.xxx|\.xz\.cn|\.y\.se|\.yakutia\.ru|\.yamal\.ru|\.ye|\.yekaterinburg\.ru|\.yk\.ca|\.yn\.cn|\.yt|\.yuzhno-sakhalinsk\.ru|\.z\.se|\.za|\.zaporizhzhe\.ua|\.zgora\.pl|\.zhitomir\.ua|\.zj\.cn|\.zlg\.br|\.zm|\.zp\.ua|\.zt\.ua|\.zw)$/ at wilcoxon2.pl line 13, <IN> line 1.

I seems richrumble was able to run it OK.  Any ideas why I'm getting the above error?  My version of Perl is 5.10.1.

Thanks.
tel2
0
 
LVL 11

Expert Comment

by:tel2
ID: 39626019
PS: Having had a further look at it, it seems to be a missing ")" in line 11, but I'm not sure where it's meant to go.  Did you have to change it to get it to work, richrumble?
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 39626152
Yes i did, well my coworker finally had a chance, I'll see if he has it still.
-rich
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39626386
Oops.  Yep.  That line is missing a ).  It should go right at the end just before the $.
0
 
LVL 11

Expert Comment

by:tel2
ID: 39626703
Is this what you mean, wilcoxon?

$ cat wilcoxon2.pl
use strict;
use warnings;
my $fil = shift or die "Usage: $0 inputfile\n";
open IN, 'valid-domains.txt' or die "could not open valid-domains.txt: $!";
my @list = map { chomp; s{\.}{\\.}g; $_ } <IN>;
close IN;
my $rx = join '|', @list;
open IN, $fil or die "could not open $fil: $!";
while (<IN>) {
    chomp;
    if (s{^.*([^.]+\.(?:$rx))$}{$1}) { # lazy regex - could replace .* with valid char class
        print $_, "\n";
    } else {
        warn "could not match a TLD in $_";
    }
}
When I run that, I get this:

$ perl wilcoxon2.pl my-sites.txt
could not match a TLD in subdomain0.subdomain2.subdomain3.example.co.jp at wilcoxon2.pl line 14, <IN> line 1.
could not match a TLD in subdomain1.subdomain2.subdomain3.example.co.jp at wilcoxon2.pl line 14, <IN> line 2.
could not match a TLD in subdomain1.subdomain2.example.co.uk at wilcoxon2.pl line 14, <IN> line 3.
could not match a TLD in subdomain1.subdomain2.example.com.mx at wilcoxon2.pl line 14, <IN> line 4.
could not match a TLD in subdomain.example.gov.tx at wilcoxon2.pl line 14, <IN> line 5.
could not match a TLD in subdomain.example.org at wilcoxon2.pl line 14, <IN> line 6.
could not match a TLD in example.info at wilcoxon2.pl line 14, <IN> line 7.
Is that what you got?
It's not correct, is it?
Or am I doing something wrong?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39627615
Oops.  Lots of little mistakes in that one regex.  The important change is removing \. inside the match but I also changed .* to hopefully make it more efficient.
if (s{^(?:.+\.)?([^.]+(?:$rx))$}{$1}) {

Open in new window

0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 
LVL 11

Expert Comment

by:tel2
ID: 39628423
Thanks wilcoxon.

I've tried that and this is the new output:
co.jp
co.jp
co.uk
com.mx
could not match a TLD in subdomain.example.gov.tx at wilcoxon3.pl line 14, <IN> line 5.
example.org
example.info
The kind of output that Rich has requested in his first post is:
Example.info
Example.org
Example.co.uk
Example.co.jp
Example.gov.tx    [A mistake I guess, since "gov.tx" is not in valid-domains.txt]
Example.com.mx
Your first 3 lines of output are TLDs, but based on his sample output, Rich needed websites.

Also, although it wasn't explicity requested, since Rich seems to have a lot of subdomains, your current code would give a lot of duplicates in it's output.  For example, if this was the input:
subdomain1.domain1.co.uk
subdomain2.domain1.co.uk
Your code would give 2 lines of output instead of 1.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39628554
Ah.  I missed that he was listing .co.jp and .jp as TLDs.  It may take a very carefully crafted regex to work in this case.  However, I think valid-domains.txt is in error as, as far as I'm aware (though I certainly could be wrong not having dealt with international domains much), you'll never have just site.jp (it will always be site.somethingstandard.jp).

As a quick try, what happens if I make everything non-greedy?
if (s{^(?:.+?\.)?([^.]+?(?:$rx))$}{$1}) {

Open in new window

0
 
LVL 11

Expert Comment

by:tel2
ID: 39628859
> However, I think valid-domains.txt is in error as, as far as I'm aware (though I certainly could be wrong not having dealt with international domains much), you'll never have just site.jp (it will always be site.somethingstandard.jp).
I don't know about Japan or other countries, but for New Zealand, domain.nz is being proposed now:
    www.stuff.co.nz/technology/digital-living/9273374/Websites-simpler-with-a-nz
I also note that the registrar OnlyDomains.com (and probably others) sells China domains .cn, .com.cn & .cn.com.

Did you test your original script before posting it, wilcoxon?  If not, errors on only 1 line is not bad for an untested script of that size.

With your latest fix, your script sample output is now looking good:
example.co.jp
example.co.jp
example.co.uk
example.com.mx
could not match a TLD in subdomain.example.gov.tx at wilcoxon4.pl line 14, <IN> line 5.
example.org
example.info
It will still print duplicates, of course, which could be hundreds in Rich's case.
0
 
LVL 11

Expert Comment

by:tel2
ID: 39628958
But all this begs the question, Rich:
How did you (or your co-worker) get wilcoxon's solution to work?  And even if it did work, didn't it give you hundreds of duplicates because of the subdomains?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629006
Honestly, no I didn't test it before posting.  I'm often not where I can easily get to a perl instance to test.

A modified version to omit dupes is easy...
use strict;
use warnings;
my $fil = shift or die "Usage: $0 inputfile\n";
open IN, 'valid-domaints.txt' or die "could not open valid-domains.txt: $!";
my @list = map { chomp; s{\.}{\\.}g; $_ } <IN>;
close IN;
my $rx = join '|', @list;
open IN, $fil or die "could not open $fil: $!";
my %domain;
while (<IN>) {
    chomp;
    if (s{^(?:.+?\.)?([^.]+?(?:$rx))$}{$1}) {
        $domain{$_}++;
    } else {
        warn "could not match a TLD in $_";
    }
}
print $_, "\n" for sort keys %domains; # could add a custom sort if desired

Open in new window

0
 
LVL 11

Expert Comment

by:tel2
ID: 39629023
Fair enough, wilcoxon,

> A modified version to omit dupes is easy...
I know it's easy.  My code does it.  However, your code now fails with this error:
Global symbol "%domains" requires explicit package name at wilcoxon5.pl line 18.
Execution of wilcoxon5.pl aborted due to compilation errors.
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 39629137
First:
.jp is valid for consumers, it's like many country codes, it's the TOP level domain, and other prefixes are added to them too, like co.uk, or gov.tx to the texas government sites.
http://www.iana.org/domains/root/db (google.jp is reserved but not available, google.co.jp is in use by google)
I did not have all the XX.gov domains as we don't have any really, and they are for state run domains, Dot-gov is it's own registrar, verisign or godaddy can't sell you one :)

My co-worker said she fixed the error in the first post, but used the second since I sent them in the same email to her. She said either worked, and piped them through sort|uniq anyway by force of habit :) She just didn't have time to address the issue, but quickly sorted it out once there was something to work with first.
-rich
0
 
LVL 11

Expert Comment

by:tel2
ID: 39629167
OK, thanks for that, Rich.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629184
Sigh.  That's what I get for quickly editing code that I haven't tested.  The %domains in the final print/for line should be %domain...
0
 
LVL 11

Expert Comment

by:tel2
ID: 39629195
Oh.  I should have noticed that.  Here's the output:

could not match a TLD in subdomain.example.gov.tx at wilcoxon5.pl line 15, <IN> line 4.
domain1.co.uk
example.co.jp
example.co.uk
example.com.mx
example.info
example.org
I think you've "sorted" it (if you'll excuse the pun).
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

In this tutorial I will show you how to make a simple HTML bar chart with the usage of WhizBase, If you want more information about WhizBase please read my previous articles at http://www.experts-exchange.com/ARTH_5123186.html (http://www.experts-ex…
Batch, VBS, and scripts in general are incredibly useful for repetitive tasks.  Some tasks can take a while to complete and it can be annoying to check back only to discover that your script finished 5 minutes ago.  Some scripts may complete nearly …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

708 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now