Placing 1.8 GB of data in database without hogging resources

Hello,

The entire dmoz directory in RDF format can be downloaded here:

http://rdf.dmoz.org/
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

I downloaded the entire thing.  It is compressed to about 300mb but when I extract it it is actually about 1.8 gigabytes.

I want a PHP script to place every single URL in  a mySQL database like this:
          $query = "INSERT INTO URL VALUES ('http://www.example.com')";

The problem is that I am worried that this could kill my poor server by hogging all its resources.

I would like your ideas on how this could be done in such a way as to preserve system resources.  I don't care if it takes two weeks to completely place all URLs in the database.

Also, please tell me if you think another language such as PERL or a linux sh script would be a better choice for this than PHP.
LVL 16
hankknightAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

adrpoCommented:

You could use perl.
See the script below.

You call it like this:
gzcat content.rdf.u8.gz | script.pl

This way you don't even need to unpack the file.

Cheers,
za-k/
#!/usr/bin/perl
use DBI;
 
my $conn = DBI->connect("dbi:MySQL:dbname=YOURDB", "password", "");
 
# read one line at a time
while (($URL=chomp($line=<STDIN>)) ne '') 
{
  my $query = $conn->prepare("insert into URL values('" . $URL . "');");
  $query->execute();
  sleep 5; # sleep for 5 seconds
}

Open in new window

0
adrpoCommented:

Ok. I disregarded the RDF format.
What do you want from there?
Here is a small example:
http://rdf.dmoz.org/rdf/content.example.txt
You want all the
<link r:resource="http://www.geocities.com/Hollywood/Boulevard/8928/" />
To be inserted in the DB? Or you want also these links:
<ExternalPage about="http://us.imdb.com/title/tt0118531/">

Let me know so I can change the script.

Cheers,
za-k/
0
hankknightAuthor Commented:
Thanks, adrpo for your response and your question.

1. I do NOT want "link r:resource" to be included
   <link r:resource="http://www.example.com/" />

2. I ONLY want "ExternalPage about" to be included
   <ExternalPage about="http://www.example.com/">

3. I have a long list of catagories that should be EXCLUDED including:

          Top/Adult/
          Top/Business/Arts_and_Entertainment/Gaming/
          Top/Games/Gambling/
          Top/Society/Sexuality/Activities_and_Practices/Swinging/
          Top/World/

Any URL in one of those catagories should NOT be included in the database.

For example, this website should NOT be included:

          <ExternalPage about="http://www.xxxxxxxx.xxx/">
            <d:Title>XXX XXX</d:Title>
            <d:Description>XXX XXX XXX</d:Description>
            <topic>Top/Adult/Image_Galleries/Fetishes/xxx/</topic>
          </ExternalPage>
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

adrpoCommented:

Ok, here we go, are you ready? :))
Just copy the snipped or take the attached file and rename
ParseAndInsert-DMOZ.pl.txt
to
ParseAndInsert-DMOZ.pl
Edit you MySQL particulars in the connect statement.
Edit your $sleep to match your compuer power.

Enjoy!

Cheers,
za-k/

Check result.txt to see how the output looks like.

#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = 5; 
 
# I HAD ONLY PostgreSQL, so you need to test this!
# my $conn = DBI->connect("dbi:Pg:dbname=assets;host=localhost;port=5432", "postgres", "YOUR_PASS");
my $conn = DBI->connect("dbi:mysql:dbname=YOUR_DATABASE;host=localhost", "YOUR_USER", "YOUR_PASS");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult", "Top/Business/Arts_and_Entertainment/Gaming", "Top/Games/Gambling", "Top/Society/Sexuality/Activities_and_Practices/Swinging", "Top/World");
    foreach $b (@banned)
    {
       if (($b =~ m/$topic/) == 1) # if topic start
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

ParseAndInsert-DMOZ.pl.txt
result.txt
0
adrpoCommented:

Hi again,

Just small details.
Make the script executable:
$ chmod +x ParseAndInsert-DMOZ.pl

Then after fixing your MySQL stuff and $sleep, just call the script like this:
$ gzcat content.rdf.u8.gz | ./ParseAndInsert-DMOZ.pl

I hope it works. For me it does very well :)
I tested it with this input:
http://rdf.dmoz.org/rdf/content.example.txt

Cheers,
za-k/
0
adrpoCommented:

One more thing.
Just add your banned topics to the "@banned" array in
the "isBanned" subroutine and if the there exists any
@banned[i] that starts with the string contained in the
current topic, then the link will NOT be inserted.

Cheers,
za-k/
0
adrpoCommented:

Small error:
if (($b =~ m/$topic/) == 1) # if topic is part of the banned
should be:
# if topic is part of the banned or banned is part of the topic
if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/) == 1)

This way:
if $b = "Top/Adult"; and $topic = "Top/Adult/MoreWeirdosHere/Etc"
the inclusion is actually $b is part of the $topic, not the other way around.

Cheers,
za-k/

0
hankknightAuthor Commented:
Thanks,  adrpo, you rock!

It works perfectly with one exception: the database is filling up with adult content!

Cheers,
Hank
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult/", "Top/Arts/Bodyart/", "Top/Business/Arts_and_Entertainment/Gaming/", "Top/Religion_and_Spirituality/", "Top/Arts/Performing_Arts/Hypnotism/", "Top/Arts/Performing_Arts/Magic/", "Top/Games/Dice/", "Top/Games/Gambling/", "Top/Games/Roleplaying/", "Top/Recreation/Nudism/", "Top/Recreation/Tobacco/", "Top/Health/Men%27s_Health/Sexuality/", "Top/Shopping/Tobacco/", "Top/Society/Gay,_Lesbian,_and_Bisexual/", "Top/Society/Issues/Abortion/", "Top/Society/Sexuality/", "Top/Society/Transgendered/", "Top/Society/Relationships/Alternative_Lifestyles/", "Top/World/");
    foreach $b (@banned)
    {
       if (($b =~ m/$topic/) == 1) # if topic start
       {
	   return 1;
       }  
    }
    return 0;
}

Open in new window

sample.rdf.txt
0
adrpoCommented:

Hehe, sorry :)
Not intentional. I'll review the script with the sample.rdf.txt you gave me.
I'll post it here asap.

Cheers,
za-k/
0
adrpoCommented:

I did also the change i talked about:
if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/) == 1)
This should fix the problem.

All the links are displayed but the adult ones are marked --BANNED--
and the one inserted are marked --INSERTED--

So here it is.

Cheers,
za-k/
#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = 5; 
 
# I HAD ONLY PostgreSQL, so you need to test this!
#my $conn = DBI->connect("dbi:Pg:dbname=assets;host=localhost;port=5432", "postgres", "PASS");
my $conn = DBI->connect("dbi:mysql:dbname=YOUR_DATABASE;host=localhost", "YOUR_USER", "YOUR_PASS");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult", "Top/Business/Arts_and_Entertainment/Gaming", "Top/Games/Gambling", "Top/Society/Sexuality/Activities_and_Practices/Swinging", "Top/World");
    foreach $b (@banned)
    {
       # if topic starts with banned or the other way around
       if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/))
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

ParseAndInsert-DMOZ.pl.txt
trace-execution.txt
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
adrpoCommented:

Look at the end in trace-execution.txt
In the db are only the Parental Filtered Ones :)

adrpo@Kafka ~/dmoz
$ psql -U postgres -h localhost assets
Password:
Welcome to psql 8.0.7, the PostgreSQL interactive terminal.

Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit

assets=# select * from url;
                 link                  
---------------------------------------
 http://www.shamrockcottages.co.uk/
 http://www.holidayrentalsireland.com/
 http://www.dreamireland.com/
 http://www.groundwork.ie
 http://homepage.tinet.ie/~knp/duchas/
(5 rows)

assets=#

Cheers,
za-k/
0
hankknightAuthor Commented:
0
adrpoCommented:

The link you gave is pointing to exactly this question :)

Cheers,
za-k/
0
hankknightAuthor Commented:
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.