Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Script to find Top Level Domains Only

Posted on 2013-11-05
20
Medium Priority
?
759 Views
Last Modified: 2013-11-06
I've got hundreds of domains, all over the world, with various TLD's ccTLD's, plus a list of the subdomains (thousands). I want to use a script to match the list of TLD's, and cut any sudomains off the front of the input...

Domain list:
subdomain1.subdomain2.subdomain3.example.co.jp
subdomain1.subdomain2.example.co.uk
subdomain1.subdomain2.example.com.mx
subdomain.example.gov.tx
subdomain.example.org
example.info

TLD's: (for this example, see attached for full txt)
.org
.info
.gov.tx
.com.mx
.co.uk
.co.jp

and so on. Sometimes I've got subdomains, other times not, so regex *seems* out of the question because there can be many variants in the "dot" counts (anywhere from 2-6 dot's in an entry) in the domain list.
So I was thinking read each line of the domain list (above), match a TLD to the end, put that aside, match anything left of the TLD up to one dot or beginning of line (if no dot found) and then combine that into one whole TLD.

Again I've got thousands of subdomains and hundreds of TLD's, our registrar is a mess because it's not letting us export them as just the TLD's :( We can only export the DNS records for some reason (and we want to leave this registrar).

I am attaching list of valid domain's (ccTLD and gTLD), I'd like the script to read from that, and the other file, strip off any subdoomains and leave me with just TLD's like
Example.info
Example.org
Example.co.uk
Example.co.jp
Example.gov.tx
Example.com.mx
-rich
valid-domains.txt
0
Comment
Question by:Rich Rumble
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 11
  • 6
  • 3
20 Comments
 
LVL 26

Assisted Solution

by:wilcoxon
wilcoxon earned 1000 total points
ID: 39625718
I think it's as simple as this...
use strict;
use warnings;
my $fil = shift or die "Usage: $0 inputfile\n";
open IN, 'valid-domaints.txt' or die "could not open valid-domains.txt: $!";
my @list = map { chomp; s{\.}{\\.}g; $_ } <IN>;
close IN;
my $rx = join '|', @list;
open IN, $fil or die "could not open $fil: $!";
while (<IN>) {
    chomp;
    if (s{^.*([^.]+\.(?:$rx)$}{$1}) { # lazy regex - could replace .* with valid char class
        print $_, "\n";
    } else {
        warn "could not match a TLD in $_";
    }
}

Open in new window

0
 
LVL 12

Accepted Solution

by:
tel2 earned 1000 total points
ID: 39625800
Hi richrumble,

When I put your sample domain list (including subdomains) in my-sites.txt, then run this:
#!/usr/bin/perl

open(TLD, '<valid-domains.txt') or die "Can't open TLD file: $!";
while (<TLD>)
{
    chomp;
    s/^\.//;
    $tld{$_} ++;
}

open(SITE, '<my-sites.txt') or die "Can't open SITE file: $!";
while (<SITE>)
{
    chomp;
    $site = $_;
    while (s/^([^.]+)\.(.+)/$2/)
    {
        if ($tld{$2})
        {
            print "$1.$2\n" unless $seen{"$1.$2"};
            $seen{"$1.$2"} ++;
            last;
        }
    }
}

Open in new window

I get this output:
example.co.jp
example.co.uk
example.com.mx
example.org
example.info
Is that what you want?
0
 
LVL 38

Author Closing Comment

by:Rich Rumble
ID: 39625827
Wow, that was fast, and both work equally well, ty both!
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 12

Expert Comment

by:tel2
ID: 39625834
Line 15 can be removed from mine, rich:
    $site = $_;
It was just there for testing.

It looks as if wilcoxon's should tell you if no TLD was matched, which is good.  I can change mine to do that if needed.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39626012
Hi wilcoxon.

Nice looking script.

I was about half way through my solution when you posted yours, so I was planning to abandon my attempt, but I couldn't get yours to work so I continued.

Here's the error I get when I run yours:

Unmatched ( in regex; marked by <-- HERE in m/^.*( <-- HERE [^.]+\.(?:\.a\.se|\.ab\.ca|\.abo\.pa|\.ac|\.ac\.ae|\.ac\.at|\.ac\.cn|\.ac\.cr|\.ac\.cy|\.ac\.fj|\.ac\.fk|\.ac\.gn|\.ac\.id|\.ac\.il|\.ac\.in|\.ac\.ir|\.ac\.jp|\.ac\.ke|\.ac\.kr|\.ac\.ma|\.ac\.me|\.ac\.mu|\.ac\.mw|\.ac\.mz|\.ac\.ni|\.ac\.nz|\.ac\.om|\.ac\.pa|\.ac\.pr|\.ac\.rs|\.ac\.ru|\.ac\.rw|\.ac\.se|\.ac\.sz|\.ac\.th|\.ac\.tj|\.ac\.tz|\.ac\.ug|\.ac\.uk|\.ac\.vn|\.ac\.yu|\.ac\.za|\.ac\.zm|\.ad|\.ad\.jp|\.adm\.br|\.adv\.br|\.adygeya\.ru|\.ae|\.aero|\.aero\.mv|\.aero\.tt|\.af|\.ag|\.agr\.br|\.agric\.za|\.agrinet\.tn|\.ah\.cn|\.ai|\.al|\.alt\.za|\.altai\.ru|\.am|\.am\.br|\.amur\.ru|\.an|\.ao|\.aq|\.ar|\.arkhangelsk\.ru|\.arpa|\.arq\.br|\.art\.br|\.art\.do|\.art\.dz|\.art\.pl|\.art\.sn|\.arts\.nf|\.arts\.ro|\.as|\.asia|\.asn\.au|\.asn\.lv|\.assn\.lk|\.asso\.dz|\.asso\.fr|\.asso\.km|\.asso\.mc|\.asso\.re|\.astrakhan\.ru|\.at|\.ato\.br|\.au|\.av\.tr|\.aw|\.ax|\.az|\.b\.br|\.b\.se|\.ba|\.bashkiria\.ru|\.bb|\.bbs\.tr|\.bc\.ca|\.bd|\.bd\.se|\.be|\.bel\.tr|\.belau\.pw|\.belgorod\.ru|\.bf|\.bg|\.bh|\.bi|\.bialystok\.pl|\.bio\.br|\.bir\.ru|\.biz|\.biz\.bb|\.biz\.bh|\.biz\.ck|\.biz\.cy|\.biz\.et|\.biz\.fj|\.biz\.ki|\.biz\.mv|\.biz\.ng|\.biz\.nr|\.biz\.om|\.biz\.pk|\.biz\.pl|\.biz\.pr|\.biz\.tj|\.biz\.tr|\.biz\.tt|\.biz\.ua|\.biz\.vn|\.bj|\.bj\.cn|\.bl|\.bl\.uk|\.blog\.br|\.bm|\.bmd\.br|\.bn|\.bo|\.bourse\.za|\.bq|\.br|\.british-library\.uk|\.bryansk\.ru|\.bs|\.bt|\.buryatia\.ru|\.busan\.kr|\.bv|\.bw|\.by|\.bz|\.c\.se|\.ca|\.cat|\.cat\.tt|\.cbg\.ru|\.cc|\.cc\.bh|\.cd|\.cf|\.cg|\.ch|\.chel\.ru|\.chelyabinsk\.ru|\.cherkassy\.ua|\.chernigov\.ua|\.chernovtsy\.ua|\.chita\.ru|\.chukotka\.ru|\.chungbuk\.kr|\.chungnam\.kr|\.chuvashia\.ru|\.ci|\.cim\.br|\.city\.za|\.ck|\.ck\.ua|\.cl|\.club\.tw|\.cm|\.cn|\.cn\.ua|\.cng\.br|\.cnt\.br|\.co|\.co\.ae|\.co\.ao|\.co\.at|\.co\.ba|\.co\.bb|\.co\.ck|\.co\.cr|\.co\.fk|\.co\.gg|\.co\.id|\.co\.il|\.co\.in|\.co\.ir|\.co\.je|\.co\.jp|\.co\.ke|\.co\.kr|\.co\.ma|\.co\.me|\.co\.mu|\.co\.mw|\.co\.mz|\.co\.na|\.co\.ni|\.co\.nz|\.co\.om|\.co\.pw|\.co\.rs|\.co\.rw|\.co\.sh|\.co\.st|\.co\.sz|\.co\.th|\.co\.tj|\.co\.tt|\.co\.tz|\.co\.ua|\.co\.ug|\.co\.uk|\.co\.ve|\.co\.vi|\.co\.ye|\.co\.yu|\.co\.za|\.co\.zm|\.com|\.com\.ac|\.com\.af|\.com\.al|\.com\.ar|\.com\.au|\.com\.ba|\.com\.bb|\.com\.bh|\.com\.bn|\.com\.bo|\.com\.br|\.com\.bs|\.com\.bz|\.com\.cn|\.com\.co|\.com\.cy|\.com\.do|\.com\.dz|\.com\.ec|\.com\.eg|\.com\.er|\.com\.es|\.com\.et|\.com\.fj|\.com\.fr|\.com\.gh|\.com\.gn|\.com\.gr|\.com\.gt|\.com\.gu|\.com\.hk|\.com\.iq|\.com\.jo|\.com\.kh|\.com\.ki|\.com\.km|\.com\.kw|\.com\.ky|\.com\.kz|\.com\.lb|\.com\.lk|\.com\.lr|\.com\.lv|\.com\.ly|\.com\.mg|\.com\.mk|\.com\.ml|\.com\.mo|\.com\.mt|\.com\.mu|\.com\.mv|\.com\.mw|\.com\.mx|\.com\.my|\.com\.na|\.com\.nf|\.com\.ng|\.com\.ni|\.com\.np|\.com\.nr|\.com\.om|\.com\.pa|\.com\.pe|\.com\.ph|\.com\.pk|\.com\.pl|\.com\.pr|\.com\.ps|\.com\.pt|\.com\.py|\.com\.qa|\.com\.re|\.com\.ro|\.com\.ru|\.com\.rw|\.com\.sa|\.com\.sb|\.com\.sc|\.com\.sd|\.com\.sg|\.com\.sh|\.com\.sl|\.com\.sn|\.com\.st|\.com\.sv|\.com\.sy|\.com\.tj|\.com\.tn|\.com\.tr|\.com\.tt|\.com\.tw|\.com\.ua|\.com\.uy|\.com\.ve|\.com\.vi|\.com\.vn|\.com\.ye|\.com\.zm|\.conf\.lv|\.consulado\.st|\.coop|\.coop\.br|\.coop\.km|\.coop\.mv|\.coop\.mw|\.coop\.tt|\.cq\.cn|\.cr|\.cri\.nz|\.crimea\.ua|\.csiro\.au|\.cu|\.cv|\.cv\.ua|\.cw|\.cx|\.cy|\.cybernet\.za|\.cym\.uk|\.cz|\.d\.se|\.daegu\.kr|\.daejeon\.kr|\.dagestan\.ru|\.db\.za|\.de|\.de\.ki|\.defense\.tn|\.dj|\.dk|\.dm|\.dn\.ua|\.dnepropetrovsk\.ua|\.dni\.us|\.dnssec\.ir|\.do|\.donetsk\.ua|\.dp\.ua|\.dr\.tr|\.dz|\.e-burg\.ru|\.e\.se|\.ebiz\.tw|\.ec|\.ecape\.school\.za|\.ecn\.br|\.ed\.ao|\.ed\.cr|\.ed\.jp|\.ed\.pw|\.edu|\.edu\.af|\.edu\.al|\.edu\.ar|\.edu\.au|\.edu\.ba|\.edu\.bb|\.edu\.bh|\.edu\.bn|\.edu\.bo|\.edu\.br|\.edu\.bs|\.edu\.bz|\.edu\.ck|\.edu\.cn|\.edu\.co|\.edu\.do|\.edu\.dz|\.edu\.ec|\.edu\.eg|\.edu\.er|\.edu\.es|\.edu\.et|\.edu\.gh|\.edu\.gr|\.edu\.gt|\.edu\.gu|\.edu\.hk|\.edu\.in|\.edu\.iq|\.edu\.it|\.edu\.jo|\.edu\.kh|\.edu\.ki|\.edu\.km|\.edu\.kn|\.edu\.kw|\.edu\.ky|\.edu\.kz|\.edu\.lb|\.edu\.lk|\.edu\.lr|\.edu\.lv|\.edu\.ly|\.edu\.me|\.edu\.mg|\.edu\.mk|\.edu\.ml|\.edu\.mn|\.edu\.mo|\.edu\.mt|\.edu\.mv|\.edu\.mw|\.edu\.mx|\.edu\.my|\.edu\.mz|\.edu\.ng|\.edu\.ni|\.edu\.np|\.edu\.nr|\.edu\.om|\.edu\.pa|\.edu\.pe|\.edu\.ph|\.edu\.pk|\.edu\.pl|\.edu\.pr|\.edu\.ps|\.edu\.pt|\.edu\.py|\.edu\.qa|\.edu\.rs|\.edu\.ru|\.edu\.rw|\.edu\.sa|\.edu\.sb|\.edu\.sc|\.edu\.sd|\.edu\.sg|\.edu\.sh|\.edu\.sl|\.edu\.sn|\.edu\.st|\.edu\.sv|\.edu\.sy|\.edu\.tj|\.edu\.tr|\.edu\.tt|\.edu\.tw|\.edu\.ua|\.edu\.uy|\.edu\.ve|\.edu\.vn|\.edu\.yu|\.edu\.za|\.edu\.zm|\.edunet\.tn|\.ee|\.eg|\.eh|\.ekloges\.cy|\.embaixada\.st|\.eng\.br|\.ens\.tn|\.er|\.ernet\.in|\.es|\.es\.kr|\.esp\.br|\.est\.pr|\.et|\.etc\.br|\.eti\.br|\.eu|\.eun\.eg|\.f\.se|\.fam\.pk|\.far\.br|\.fed\.us|\.fi|\.fi\.cr|\.fin\.ec|\.fin\.tn|\.firm\.in|\.firm\.nf|\.firm\.ro|\.fj|\.fj\.cn|\.fk|\.flog\.br|\.fm|\.fm\.br|\.fnd\.br|\.fo|\.fot\.br|\.fr|\.fs\.school\.za|\.fst\.br|\.g\.se|\.g12\.br|\.ga|\.game\.tw|\.gangwon\.kr|\.gb|\.gd|\.gd\.cn|\.gda\.pl|\.gdansk\.pl|\.ge|\.geek\.nz|\.gen\.ck|\.gen\.in|\.gen\.nz|\.gen\.tr|\.gf|\.gg|\.ggf\.br|\.gh|\.gi|\.gl|\.gm|\.gn|\.go\.cr|\.go\.id|\.go\.jp|\.go\.ke|\.go\.kr|\.go\.pw|\.go\.th|\.go\.tj|\.go\.tz|\.go\.ug|\.gob\.ar|\.gob\.bo|\.gob\.do|\.gob\.es|\.gob\.gt|\.gob\.mx|\.gob\.ni|\.gob\.pa|\.gob\.pe|\.gob\.pk|\.gob\.sv|\.gob\.ve|\.gok\.pk|\.gon\.pk|\.gop\.pk|\.gorzow\.pl|\.gos\.pk|\.gouv\.fr|\.gouv\.km|\.gouv\.rw|\.gouv\.sn|\.gov|\.gov\.ac|\.gov\.ae|\.gov\.af|\.gov\.al|\.gov\.ar|\.gov\.au|\.gov\.ba|\.gov\.bb|\.gov\.bh|\.gov\.bn|\.gov\.bo|\.gov\.br|\.gov\.bs|\.gov\.bz|\.gov\.ck|\.gov\.cn|\.gov\.co|\.gov\.cy|\.gov\.do|\.gov\.dz|\.gov\.ec|\.gov\.eg|\.gov\.er|\.gov\.et|\.gov\.fk|\.gov\.gh|\.gov\.gn|\.gov\.gr|\.gov\.gu|\.gov\.hk|\.gov\.il|\.gov\.in|\.gov\.iq|\.gov\.ir|\.gov\.it|\.gov\.jo|\.gov\.kh|\.gov\.ki|\.gov\.kn|\.gov\.kw|\.gov\.ky|\.gov\.kz|\.gov\.lb|\.gov\.lk|\.gov\.lr|\.gov\.lv|\.gov\.ly|\.gov\.ma|\.gov\.me|\.gov\.mg|\.gov\.mk|\.gov\.ml|\.gov\.mn|\.gov\.mo|\.gov\.mt|\.gov\.mu|\.gov\.mv|\.gov\.mw|\.gov\.my|\.gov\.mz|\.gov\.ng|\.gov\.np|\.gov\.nr|\.gov\.om|\.gov\.ph|\.gov\.pk|\.gov\.pl|\.gov\.pr|\.gov\.ps|\.gov\.pt|\.gov\.py|\.gov\.qa|\.gov\.rs|\.gov\.ru|\.gov\.rw|\.gov\.sa|\.gov\.sb|\.gov\.sc|\.gov\.sd|\.gov\.sg|\.gov\.sh|\.gov\.sl|\.gov\.st|\.gov\.sy|\.gov\.tj|\.gov\.tn|\.gov\.tr|\.gov\.tt|\.gov\.tw|\.gov\.ua|\.gov\.uk|\.gov\.vn|\.gov\.ye|\.gov\.yu|\.gov\.za|\.gov\.zm|\.govt\.nz|\.govt\.uk|\.gp|\.gp\.school\.za|\.gq|\.gr|\.gr\.jp|\.grondar\.za|\.grozny\.ru|\.grp\.lk|\.gs|\.gs\.cn|\.gt|\.gu|\.gub\.uy|\.gv\.ao|\.gv\.at|\.gw|\.gwangju\.kr|\.gx\.cn|\.gy|\.gyeongbuk\.kr|\.gyeonggi\.kr|\.gyeongnam\.kr|\.gz\.cn|\.h\.se|\.ha\.cn|\.hb\.cn|\.he\.cn|\.health\.nz|\.health\.vn|\.hi\.cn|\.hk|\.hl\.cn|\.hm|\.hn|\.hn\.cn|\.hotel\.lk|\.hr|\.hs\.kr|\.ht|\.hu|\.i\.ph|\.i\.se|\.iaccess\.za|\.icnet\.uk|\.id|\.id\.au|\.id\.ir|\.id\.lv|\.id\.ly|\.idf\.il|\.idn\.sg|\.idv\.hk|\.idv\.tw|\.ie|\.if\.ua|\.il|\.im|\.imb\.br|\.imt\.za|\.in|\.in\.rs|\.in\.th|\.in\.ua|\.inca\.za|\.incheon\.kr|\.ind\.br|\.ind\.er|\.ind\.gt|\.ind\.in|\.ind\.tn|\.inf\.br|\.inf\.mk|\.info|\.info\.bb|\.info\.bh|\.info\.ck|\.info\.ec|\.info\.et|\.info\.fj|\.info\.ke|\.info\.ki|\.info\.mv|\.info\.nf|\.info\.nr|\.info\.pl|\.info\.pr|\.info\.ro|\.info\.sd|\.info\.tj|\.info\.tn|\.info\.tr|\.info\.tt|\.info\.ve|\.info\.vn|\.ing\.pa|\.int|\.int\.ar|\.int\.bo|\.int\.lk|\.int\.mv|\.int\.mw|\.int\.pt|\.int\.ru|\.int\.rw|\.int\.tj|\.int\.tt|\.int\.vn|\.intl\.tn|\.io|\.iq|\.ir|\.irkutsk\.ru|\.is|\.isa\.us|\.isla\.pr|\.it|\.it\.ao|\.its\.me|\.ivano-frankivsk\.ua|\.ivanovo\.ru|\.iwi\.nz|\.izhevsk\.ru|\.jar\.ru|\.je|\.jeju\.kr|\.jeonbuk\.kr|\.jeonnam\.kr|\.jet\.uk|\.jl\.cn|\.jm|\.jo|\.jobs|\.jobs\.tt|\.jor\.br|\.joshkar-ola\.ru|\.jp|\.js\.cn|\.jus\.br|\.jx\.cn|\.k\.se|\.k12\.il|\.k12\.tr|\.k12\.vi|\.kalmykia\.ru|\.kaluga\.ru|\.kamchatka\.ru|\.karelia\.ru|\.katowice\.pl|\.kazan\.ru|\.kchr\.ru|\.ke|\.kemerovo\.ru|\.kg|\.kg\.kr|\.kh|\.kh\.ua|\.khabarovsk\.ru|\.khakassia\.ru|\.kharkov\.ua|\.kherson\.ua|\.khmelnitskiy\.ua|\.khv\.ru|\.ki|\.kids\.us|\.kiev\.ua|\.kirov\.ru|\.kirovograd\.ua|\.km|\.km\.ua|\.kn|\.koenig\.ru|\.komi\.ru|\.kostroma\.ru|\.kp|\.kr|\.kr\.ua|\.krakow\.pl|\.kranoyarsk\.ru|\.ks\.ua|\.kuban\.ru|\.kurgan\.ru|\.kursk\.ru|\.kv\.ua|\.kw|\.ky|\.kz|\.kzn\.school\.za|\.l\.se|\.la|\.landesign\.za|\.law\.za|\.lb|\.lc|\.lea\.uk|\.lel\.br|\.lg\.jp|\.lg\.ua|\.li|\.lipetsk\.ru|\.lk|\.ln\.cn|\.lodz\.pl|\.lp\.school\.za|\.lr|\.ls|\.lt|\.ltd\.cy|\.ltd\.lk|\.ltd\.uk|\.ltd\.ye|\.lu|\.lublin\.pl|\.lugansk\.ua|\.lutsk\.ua|\.lv|\.lviv\.ua|\.ly|\.m\.se|\.ma|\.magadan\.ru|\.maori\.nz|\.mari-el\.ru|\.mari\.ru|\.marine\.ru|\.mat\.br|\.mb\.ca|\.mc|\.md|\.me|\.me\.ke|\.me\.ua|\.me\.uk|\.me\.ye|\.med\.br|\.med\.ec|\.med\.ly|\.med\.om|\.med\.pa|\.med\.sa|\.med\.sd|\.medecin\.km|\.mf|\.mg|\.mh|\.mi\.th|\.mil|\.mil\.ac|\.mil\.ae|\.mil\.al|\.mil\.ar|\.mil\.ba|\.mil\.bo|\.mil\.br|\.mil\.cn|\.mil\.co|\.mil\.do|\.mil\.ec|\.mil\.eg|\.mil\.er|\.mil\.fj|\.mil\.gh|\.mil\.gr|\.mil\.gt|\.mil\.id|\.mil\.in|\.mil\.iq|\.mil\.jo|\.mil\.kh|\.mil\.km|\.mil\.kr|\.mil\.kz|\.mil\.lv|\.mil\.mg|\.mil\.mv|\.mil\.my|\.mil\.ng|\.mil\.ni|\.mil\.np|\.mil\.nz|\.mil\.om|\.mil\.pe|\.mil\.ph|\.mil\.pl|\.mil\.py|\.mil\.qa|\.mil\.ru|\.mil\.rw|\.mil\.st|\.mil\.sy|\.mil\.tj|\.mil\.tt|\.mil\.tw|\.mil\.uk|\.mil\.uy|\.mil\.ve|\.mil\.za|\.mincom\.tn|\.mk|\.mk\.ua|\.ml|\.mm|\.mn|\.mo|\.mob\.ki|\.mobi|\.mobi\.ke|\.mobi\.ng|\.mobi\.tt|\.mod\.uk|\.mordovia\.ru|\.mosreg\.ru|\.mp|\.mpm\.school\.za|\.mq|\.mr|\.ms|\.ms\.kr|\.msk\.ru|\.mt|\.mu|\.muni\.il|\.murmansk\.ru|\.mus\.br|\.museum|\.museum\.mv|\.museum\.mw|\.museum\.om|\.museum\.tt|\.mv|\.mw|\.mx|\.my|\.mz|\.n\.se|\.na|\.nalchik\.ru|\.name|\.name\.ae|\.name\.cy|\.name\.eg|\.name\.et|\.name\.fj|\.name\.jo|\.name\.mk|\.name\.mv|\.name\.my|\.name\.ng|\.name\.pr|\.name\.tj|\.name\.tr|\.name\.tt|\.name\.vn|\.nat\.tn|\.national-library-scotland\.uk|\.nb\.ca|\.nc|\.ncape\.school\.za|\.ne|\.ne\.jp|\.ne\.ke|\.ne\.kr|\.ne\.pw|\.ne\.tz|\.ne\.ug|\.nel\.uk|\.net|\.net\.ac|\.net\.ae|\.net\.af|\.net\.al|\.net\.ar|\.net\.au|\.net\.ba|\.net\.bb|\.net\.bh|\.net\.bn|\.net\.bo|\.net\.br|\.net\.bs|\.net\.bz|\.net\.ck|\.net\.cn|\.net\.co|\.net\.cy|\.net\.do|\.net\.dz|\.net\.ec|\.net\.eg|\.net\.er|\.net\.et|\.net\.fj|\.net\.fk|\.net\.gg|\.net\.gn|\.net\.gr|\.net\.gt|\.net\.gu|\.net\.hk|\.net\.id|\.net\.il|\.net\.in|\.net\.iq|\.net\.ir|\.net\.je|\.net\.jo|\.net\.kh|\.net\.ki|\.net\.kn|\.net\.kw|\.net\.ky|\.net\.kz|\.net\.lb|\.net\.lk|\.net\.lr|\.net\.lv|\.net\.ly|\.net\.ma|\.net\.me|\.net\.mk|\.net\.ml|\.net\.mo|\.net\.mt|\.net\.mu|\.net\.mv|\.net\.mw|\.net\.mx|\.net\.my|\.net\.nf|\.net\.ng|\.net\.ni|\.net\.np|\.net\.nr|\.net\.nz|\.net\.om|\.net\.pa|\.net\.pe|\.net\.ph|\.net\.pk|\.net\.pl|\.net\.pr|\.net\.ps|\.net\.pt|\.net\.py|\.net\.qa|\.net\.ru|\.net\.rw|\.net\.sa|\.net\.sb|\.net\.sc|\.net\.sd|\.net\.sg|\.net\.sh|\.net\.sl|\.net\.st|\.net\.sy|\.net\.th|\.net\.tj|\.net\.tn|\.net\.tr|\.net\.tt|\.net\.tw|\.net\.ua|\.net\.uk|\.net\.uy|\.net\.ve|\.net\.vi|\.net\.vn|\.net\.ye|\.net\.za|\.net\.zm|\.news\.sy|\.nf|\.nf\.ca|\.ng|\.ngo\.lk|\.ngo\.ph|\.ngo\.pl|\.ngo\.za|\.nhs\.uk|\.ni|\.nic\.in|\.nic\.tj|\.nic\.uk|\.nikolaev\.ua|\.nis\.za|\.nl|\.nl\.ca|\.nls\.uk|\.nm\.cn|\.nnov\.ru|\.no|\.nom\.br|\.nom\.co|\.nom\.es|\.nom\.fk|\.nom\.fr|\.nom\.km|\.nom\.mg|\.nom\.ni|\.nom\.pa|\.nom\.pe|\.nom\.re|\.nom\.ro|\.nom\.sh|\.nom\.za|\.nome\.pt|\.not\.br|\.notaires\.km|\.nov\.ru|\.novosibirsk\.ru|\.np|\.nr|\.ns\.ca|\.nsk\.ru|\.nsn\.us|\.nt\.ca|\.nt\.ro|\.ntr\.br|\.nu|\.nu\.ca|\.nw\.school\.za|\.nx\.cn|\.nz|\.o\.se|\.od\.ua|\.odessa\.ua|\.odo\.br|\.og\.ao|\.olivetti\.za|\.olsztyn\.pl|\.om|\.omsk\.ru|\.on\.ca|\.or\.at|\.or\.cr|\.or\.id|\.or\.jp|\.or\.ke|\.or\.kr|\.or\.mu|\.or\.pw|\.or\.th|\.or\.tz|\.or\.ug|\.orenburg\.ru|\.org|\.org\.ac|\.org\.ae|\.org\.af|\.org\.al|\.org\.ar|\.org\.au|\.org\.ba|\.org\.bb|\.org\.bh|\.org\.bn|\.org\.bo|\.org\.br|\.org\.bs|\.org\.bz|\.org\.ck|\.org\.cn|\.org\.co|\.org\.cy|\.org\.do|\.org\.dz|\.org\.ec|\.org\.eg|\.org\.er|\.org\.es|\.org\.et|\.org\.fj|\.org\.fk|\.org\.gg|\.org\.gh|\.org\.gn|\.org\.gr|\.org\.gt|\.org\.gu|\.org\.hk|\.org\.il|\.org\.in|\.org\.iq|\.org\.ir|\.org\.je|\.org\.jo|\.org\.kh|\.org\.ki|\.org\.kn|\.org\.kw|\.org\.ky|\.org\.kz|\.org\.lb|\.org\.lk|\.org\.lr|\.org\.lv|\.org\.ly|\.org\.ma|\.org\.me|\.org\.mg|\.org\.mk|\.org\.ml|\.org\.mn|\.org\.mo|\.org\.mt|\.org\.mu|\.org\.mv|\.org\.mw|\.org\.mx|\.org\.my|\.org\.mz|\.org\.ng|\.org\.ni|\.org\.np|\.org\.nr|\.org\.nz|\.org\.om|\.org\.pa|\.org\.pe|\.org\.ph|\.org\.pk|\.org\.pl|\.org\.pr|\.org\.ps|\.org\.pt|\.org\.py|\.org\.qa|\.org\.ro|\.org\.rs|\.org\.ru|\.org\.sa|\.org\.sb|\.org\.sc|\.org\.sd|\.org\.se|\.org\.sg|\.org\.sh|\.org\.sl|\.org\.sn|\.org\.st|\.org\.sv|\.org\.sy|\.org\.sz|\.org\.tj|\.org\.tn|\.org\.tr|\.org\.tt|\.org\.tw|\.org\.ua|\.org\.ug|\.org\.uk|\.org\.uy|\.org\.ve|\.org\.vi|\.org\.vn|\.org\.ye|\.org\.yu|\.org\.za|\.org\.zm|\.orgn\.uk|\.oryol\.ru|\.other\.nf|\.p\.se|\.pa|\.parliament\.cy|\.parliament\.nz|\.parliament\.uk|\.parti\.se|\.pb\.ao|\.pe|\.pe\.ca|\.pe\.kr|\.penza\.ru|\.per\.kh|\.per\.nf|\.per\.sg|\.perm\.ru|\.perso\.sn|\.perso\.tn|\.pf|\.pg|\.ph|\.pharmaciens\.km|\.pix\.za|\.pk|\.pl|\.pl\.ua|\.plc\.ly|\.plc\.uk|\.plc\.ye|\.plo\.ps|\.pm|\.pn|\.pol\.dz|\.pol\.tr|\.police\.uk|\.poltava\.ua|\.post|\.poznan\.pl|\.pp\.ru|\.pp\.se|\.pp\.ua|\.ppg\.br|\.pr|\.prd\.fr|\.prd\.mg|\.press\.cy|\.press\.ma|\.press\.se|\.presse\.fr|\.presse\.km|\.presse\.ml|\.principe\.st|\.priv\.me|\.pro|\.pro\.ae|\.pro\.br|\.pro\.cy|\.pro\.ec|\.pro\.fj|\.pro\.mk|\.pro\.mv|\.pro\.om|\.pro\.pr|\.pro\.tt|\.pro\.vn|\.prof\.pr|\.ps|\.psc\.br|\.psi\.br|\.pskov\.ru|\.pt|\.ptz\.ru|\.pub\.sa|\.publ\.pt|\.pw|\.pwr\.pl|\.py|\.qa|\.qc\.ca|\.qh\.cn|\.qsl\.br|\.r\.se|\.radom\.pl|\.re|\.re\.kr|\.rec\.br|\.rec\.nf|\.rec\.ro|\.red\.sv|\.res\.in|\.rnd\.ru|\.rnrt\.tn|\.rns\.tn|\.rnu\.tn|\.ro|\.rochest\.er|\.rovno\.ua|\.rs|\.rs\.ba|\.ru|\.rv\.ua|\.rw|\.ryazan\.ru|\.s\.se|\.sa|\.sa\.cr|\.sakhalin\.ru|\.samara\.ru|\.saotome\.st|\.saratov\.ru|\.sb|\.sc|\.sc\.cn|\.sc\.ke|\.sc\.kr|\.sc\.ug|\.sch\.ae|\.sch\.id|\.sch\.ir|\.sch\.jo|\.sch\.lk|\.sch\.ly|\.sch\.my|\.sch\.ng|\.sch\.om|\.sch\.sa|\.sch\.uk|\.sch\.zm|\.school\.nz|\.school\.za|\.sci\.eg|\.scot\.uk|\.sd|\.sd\.cn|\.se|\.sebastopol\.ua|\.sec\.ps|\.seoul\.kr|\.sg|\.sh|\.sh\.cn|\.si|\.simbirsk\.ru|\.sj|\.sk|\.sk\.ca|\.sl|\.sld\.do|\.sld\.pa|\.sld\.pe|\.slg\.br|\.slupsk\.pl|\.sm|\.smolensk\.ru|\.sn|\.sn\.cn|\.so|\.soc\.lk|\.soc\.uk|\.spb\.ru|\.sr|\.srv\.br|\.ss|\.st|\.stavropol\.ru|\.store\.bb|\.store\.nf|\.store\.ro|\.store\.st|\.stv\.ru|\.su|\.sumy\.ua|\.surgut\.ru|\.sv|\.sx|\.sx\.cn|\.sy|\.sz|\.szczecin\.pl|\.t\.se|\.tambov\.ru|\.tatarstan\.ru|\.tc|\.td|\.te\.ua|\.tel|\.tel\.ki|\.tel\.tr|\.tel\.tt|\.ternopil\.ua|\.test\.tj|\.tf|\.tg|\.th|\.tj|\.tj\.cn|\.tk|\.tl|\.tm|\.tm\.cy|\.tm\.fr|\.tm\.km|\.tm\.mc|\.tm\.mg|\.tm\.ro|\.tm\.se|\.tm\.za|\.tmp\.br|\.tn|\.to|\.tom\.ru|\.tomsk\.ru|\.torun\.pl|\.tourism\.tn|\.tp|\.tr|\.travel|\.travel\.tt|\.trd\.br|\.tsaritsyn\.ru|\.tsk\.ru|\.tsk\.tr|\.tt|\.tula\.ru|\.tur\.ar|\.tur\.br|\.tuva\.ru|\.tv|\.tv\.bb|\.tv\.bo|\.tv\.br|\.tv\.sd|\.tv\.tr|\.tver\.ru|\.tw|\.tw\.cn|\.tyumen\.ru|\.tz|\.u\.se|\.ua|\.udm\.ru|\.udmurtia\.ru|\.ug|\.uk|\.ulan-ude\.ru|\.ulsan\.kr|\.um|\.unbi\.ba|\.univ\.sn|\.unmo\.ba|\.unsa\.ba|\.untz\.ba|\.unze\.ba|\.us|\.uy|\.uz|\.uzhgorod\.ua|\.va|\.vc|\.ve|\.vet\.br|\.veterinaire\.km|\.vg|\.vi|\.vinnica\.ua|\.vladikavkaz\.ru|\.vladimir\.ru|\.vladivostok\.ru|\.vlog\.br|\.vn|\.vn\.ua|\.volgograd\.ru|\.vologda\.ru|\.voronezh\.ru|\.vrn\.ru|\.vu|\.vyatka\.ru|\.w\.er|\.w\.se|\.war\.net\.id|\.warszawa\.pl|\.waw\.pl|\.wcape\.school\.za|\.web\.do|\.web\.id|\.web\.lk|\.web\.nf|\.web\.pk|\.web\.tj|\.web\.tr|\.web\.ve|\.web\.za|\.wf|\.wiki\.br|\.wroc\.pl|\.wroclaw\.pl|\.ws|\.www\.ro|\.x\.se|\.xj\.cn|\.xxx|\.xz\.cn|\.y\.se|\.yakutia\.ru|\.yamal\.ru|\.ye|\.yekaterinburg\.ru|\.yk\.ca|\.yn\.cn|\.yt|\.yuzhno-sakhalinsk\.ru|\.z\.se|\.za|\.zaporizhzhe\.ua|\.zgora\.pl|\.zhitomir\.ua|\.zj\.cn|\.zlg\.br|\.zm|\.zp\.ua|\.zt\.ua|\.zw)$/ at wilcoxon2.pl line 13, <IN> line 1.

I seems richrumble was able to run it OK.  Any ideas why I'm getting the above error?  My version of Perl is 5.10.1.

Thanks.
tel2
0
 
LVL 12

Expert Comment

by:tel2
ID: 39626019
PS: Having had a further look at it, it seems to be a missing ")" in line 11, but I'm not sure where it's meant to go.  Did you have to change it to get it to work, richrumble?
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 39626152
Yes i did, well my coworker finally had a chance, I'll see if he has it still.
-rich
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39626386
Oops.  Yep.  That line is missing a ).  It should go right at the end just before the $.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39626703
Is this what you mean, wilcoxon?

$ cat wilcoxon2.pl
use strict;
use warnings;
my $fil = shift or die "Usage: $0 inputfile\n";
open IN, 'valid-domains.txt' or die "could not open valid-domains.txt: $!";
my @list = map { chomp; s{\.}{\\.}g; $_ } <IN>;
close IN;
my $rx = join '|', @list;
open IN, $fil or die "could not open $fil: $!";
while (<IN>) {
    chomp;
    if (s{^.*([^.]+\.(?:$rx))$}{$1}) { # lazy regex - could replace .* with valid char class
        print $_, "\n";
    } else {
        warn "could not match a TLD in $_";
    }
}
When I run that, I get this:

$ perl wilcoxon2.pl my-sites.txt
could not match a TLD in subdomain0.subdomain2.subdomain3.example.co.jp at wilcoxon2.pl line 14, <IN> line 1.
could not match a TLD in subdomain1.subdomain2.subdomain3.example.co.jp at wilcoxon2.pl line 14, <IN> line 2.
could not match a TLD in subdomain1.subdomain2.example.co.uk at wilcoxon2.pl line 14, <IN> line 3.
could not match a TLD in subdomain1.subdomain2.example.com.mx at wilcoxon2.pl line 14, <IN> line 4.
could not match a TLD in subdomain.example.gov.tx at wilcoxon2.pl line 14, <IN> line 5.
could not match a TLD in subdomain.example.org at wilcoxon2.pl line 14, <IN> line 6.
could not match a TLD in example.info at wilcoxon2.pl line 14, <IN> line 7.
Is that what you got?
It's not correct, is it?
Or am I doing something wrong?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39627615
Oops.  Lots of little mistakes in that one regex.  The important change is removing \. inside the match but I also changed .* to hopefully make it more efficient.
if (s{^(?:.+\.)?([^.]+(?:$rx))$}{$1}) {

Open in new window

0
 
LVL 12

Expert Comment

by:tel2
ID: 39628423
Thanks wilcoxon.

I've tried that and this is the new output:
co.jp
co.jp
co.uk
com.mx
could not match a TLD in subdomain.example.gov.tx at wilcoxon3.pl line 14, <IN> line 5.
example.org
example.info
The kind of output that Rich has requested in his first post is:
Example.info
Example.org
Example.co.uk
Example.co.jp
Example.gov.tx    [A mistake I guess, since "gov.tx" is not in valid-domains.txt]
Example.com.mx
Your first 3 lines of output are TLDs, but based on his sample output, Rich needed websites.

Also, although it wasn't explicity requested, since Rich seems to have a lot of subdomains, your current code would give a lot of duplicates in it's output.  For example, if this was the input:
subdomain1.domain1.co.uk
subdomain2.domain1.co.uk
Your code would give 2 lines of output instead of 1.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39628554
Ah.  I missed that he was listing .co.jp and .jp as TLDs.  It may take a very carefully crafted regex to work in this case.  However, I think valid-domains.txt is in error as, as far as I'm aware (though I certainly could be wrong not having dealt with international domains much), you'll never have just site.jp (it will always be site.somethingstandard.jp).

As a quick try, what happens if I make everything non-greedy?
if (s{^(?:.+?\.)?([^.]+?(?:$rx))$}{$1}) {

Open in new window

0
 
LVL 12

Expert Comment

by:tel2
ID: 39628859
> However, I think valid-domains.txt is in error as, as far as I'm aware (though I certainly could be wrong not having dealt with international domains much), you'll never have just site.jp (it will always be site.somethingstandard.jp).
I don't know about Japan or other countries, but for New Zealand, domain.nz is being proposed now:
    www.stuff.co.nz/technology/digital-living/9273374/Websites-simpler-with-a-nz
I also note that the registrar OnlyDomains.com (and probably others) sells China domains .cn, .com.cn & .cn.com.

Did you test your original script before posting it, wilcoxon?  If not, errors on only 1 line is not bad for an untested script of that size.

With your latest fix, your script sample output is now looking good:
example.co.jp
example.co.jp
example.co.uk
example.com.mx
could not match a TLD in subdomain.example.gov.tx at wilcoxon4.pl line 14, <IN> line 5.
example.org
example.info
It will still print duplicates, of course, which could be hundreds in Rich's case.
0
 
LVL 12

Expert Comment

by:tel2
ID: 39628958
But all this begs the question, Rich:
How did you (or your co-worker) get wilcoxon's solution to work?  And even if it did work, didn't it give you hundreds of duplicates because of the subdomains?
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629006
Honestly, no I didn't test it before posting.  I'm often not where I can easily get to a perl instance to test.

A modified version to omit dupes is easy...
use strict;
use warnings;
my $fil = shift or die "Usage: $0 inputfile\n";
open IN, 'valid-domaints.txt' or die "could not open valid-domains.txt: $!";
my @list = map { chomp; s{\.}{\\.}g; $_ } <IN>;
close IN;
my $rx = join '|', @list;
open IN, $fil or die "could not open $fil: $!";
my %domain;
while (<IN>) {
    chomp;
    if (s{^(?:.+?\.)?([^.]+?(?:$rx))$}{$1}) {
        $domain{$_}++;
    } else {
        warn "could not match a TLD in $_";
    }
}
print $_, "\n" for sort keys %domains; # could add a custom sort if desired

Open in new window

0
 
LVL 12

Expert Comment

by:tel2
ID: 39629023
Fair enough, wilcoxon,

> A modified version to omit dupes is easy...
I know it's easy.  My code does it.  However, your code now fails with this error:
Global symbol "%domains" requires explicit package name at wilcoxon5.pl line 18.
Execution of wilcoxon5.pl aborted due to compilation errors.
0
 
LVL 38

Author Comment

by:Rich Rumble
ID: 39629137
First:
.jp is valid for consumers, it's like many country codes, it's the TOP level domain, and other prefixes are added to them too, like co.uk, or gov.tx to the texas government sites.
http://www.iana.org/domains/root/db (google.jp is reserved but not available, google.co.jp is in use by google)
I did not have all the XX.gov domains as we don't have any really, and they are for state run domains, Dot-gov is it's own registrar, verisign or godaddy can't sell you one :)

My co-worker said she fixed the error in the first post, but used the second since I sent them in the same email to her. She said either worked, and piped them through sort|uniq anyway by force of habit :) She just didn't have time to address the issue, but quickly sorted it out once there was something to work with first.
-rich
0
 
LVL 12

Expert Comment

by:tel2
ID: 39629167
OK, thanks for that, Rich.
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 39629184
Sigh.  That's what I get for quickly editing code that I haven't tested.  The %domains in the final print/for line should be %domain...
0
 
LVL 12

Expert Comment

by:tel2
ID: 39629195
Oh.  I should have noticed that.  Here's the output:

could not match a TLD in subdomain.example.gov.tx at wilcoxon5.pl line 15, <IN> line 4.
domain1.co.uk
example.co.jp
example.co.uk
example.com.mx
example.info
example.org
I think you've "sorted" it (if you'll excuse the pun).
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

It is a general practice to get rid of old user profiles on a computer  in a LAN environment. As I have been working with a company in a LAN environment where users move from one place to some other place at times. This will make many user profil…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
The viewer will learn how to create and use a small PHP class to apply a watermark to an image. This video shows the viewer the setup for the PHP watermark as well as important coding language. Continue to Part 2 to learn the core code used in creat…
Video by: Mark
This lesson goes over how to construct ordered and unordered lists and how to create hyperlinks.

715 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question