hankknight
asked on
Populate a database with domain names from an rdf file
Hello,
I use a script, provided by za-k/ (adrpo) to populate a database with URLs from an rdf file.
https://www.experts-exchange.com/questions/23120051/Placing-1-8-GB-of-data-in-database-without-hogging-resources.html
Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz
It populates my database with over 400,000 URLs and many are sub-pages of the same domain.
Now I only want domain names in my database.
http://www.example.com
should be included as is but
http://www.domain.com/subdirectory/
should be placed into the database as
http://www.domain.com/
because I only want the domain names and not the full URL.
Thanks for the help!
I use a script, provided by za-k/ (adrpo) to populate a database with URLs from an rdf file.
https://www.experts-exchange.com/questions/23120051/Placing-1-8-GB-of-data-in-database-without-hogging-resources.html
Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz
It populates my database with over 400,000 URLs and many are sub-pages of the same domain.
Now I only want domain names in my database.
http://www.example.com
should be included as is but
http://www.domain.com/subdirectory/
should be placed into the database as
http://www.domain.com/
because I only want the domain names and not the full URL.
Thanks for the help!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
I took a quick look at the other script. At first, I don't see a need to have the sleep. That just slows it down. It should be fine running without the sleep.