Populate a database with domain names from an rdf file

Hello,

I use a script, provided by za-k/ (adrpo) to populate a database with URLs from an rdf file.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_23120051.html

Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

It populates my database with over 400,000 URLs and many are sub-pages of the same domain.

Now I only want domain names in my database.

http://www.example.com 

     should be included as is but
     
http://www.domain.com/subdirectory/ 

         should be placed into the database as

http://www.domain.com/

because I only want the domain names and not the full URL.

Thanks for the help!
LVL 16
hankknightAsked:
Who is Participating?
 
adrpoConnect With a Mentor Commented:

You could use this:
http://textsnippets.com/posts/show/523

Cheers,
za-k/

$_ =  $SavedLink;
    if ( /^((http[s]?|ftp):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^#?\s]+)(.*)?(#[\w\-]+)?$
/ ) 
    {
        # do the insert here
        $domainLinkOnly = "$2://$3";
        insert into url values('$domainLinkOnly');
    }

Open in new window

0
 
Adam314Commented:
Do you have a smaller version of the RDF file?

I took a quick look at the other script.  At first, I don't see a need to have the sleep.  That just slows it down.  It should be fine running without the sleep.
0
All Courses

From novice to tech pro — start learning today.