Link to home
Start Free TrialLog in
Avatar of hankknight
hankknightFlag for Canada

asked on

Perl Script for placing data from DMOZ to mySQL causing grief

Hello,

I use a script, provided by za-k/ (adrpo) to populate a database from an rdf file.
https://www.experts-exchange.com/questions/23120051/Placing-1-8-GB-of-data-in-database-without-hogging-resources.html


Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

It populates my database with over 400,000 URLs however some items cause errors.

I get the following three errors:

  Wide character in print at ./getURLs4.pl line 59.

  DBD::mysql::st execute failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Isaac_La(The_Miracle_Play_of_Abraham_and_Isaac).htm')' at line 1 at ./getURLs4.pl line 58.

  Nested quantifiers in regex; marked by <-- HERE in m/Top/Business/Publishing_and_Printing/Publishing/Books/Business/O'Reilly_and_Associates/Technical_Books/C,_C++ <-- HERE / at ./getURLs4.pl line 34.

I think that the problems are cauased by character-encoding issues and un-escaped charactors but I could be wrong.

To run the script:

First:
   wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz

Then:
gzcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
   or
zcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt

Thanks for the help!
#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = .15; 
 
# Connect to MySQL
my $conn = DBI->connect("dbi:mysql:dbname=dbname_dbname;host=localhost", "user_user", "password");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult/", "Top/Arts/Bodyart/", "Top/Business/Arts_and_Entertainment/Gaming/", "Top/Religion_and_Spirituality/", "Top/Arts/Performing_Arts/Hypnotism/", "Top/Arts/Performing_Arts/Magic/", "Top/Games/Dice/", "Top/Games/Gambling/", "Top/Games/Roleplaying/", "Top/Recreation/Nudism/", "Top/Recreation/Tobacco/", "Top/Health/Men%27s_Health/Sexuality/", "Top/Shopping/Tobacco/", "Top/Society/Gay,_Lesbian,_and_Bisexual/", "Top/Society/Issues/Abortion/", "Top/Society/Sexuality/", "Top/Society/Transgendered/", "Top/Society/Relationships/Alternative_Lifestyles/", "Top/World/");
    foreach $b (@banned)
    {
       # if topic starts with banned or the other way around
       if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/))
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

SOLUTION
Avatar of Adam314
Adam314

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of hankknight

ASKER


To get rid of the SQL insert problem:
### Escape the string quotes ...
my $quotedString = $conn->quote( $string );
 
### Use quoted string as a string literal in a SQL statement
my $sth = $conn->prepare( "insert into url values($quotedString);");
$sth->execute();

To get rid of: Nested quantifiers in regex...
$quotedTopic = quotemeta($topic);
$quotedBanned = quotemeta($b);
if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
    ($quotedTopic =~ m/$quotedBanned/))

Cheers,
za-k/


To get rid of: Wide character in print at ./getURLs4.pl line 59.
Put this line, at the start of the script (line 4 or so):
binmode(STDOUT, ":utf8");

Cheers,
za-k/
Thank you.  adrpo, where should I put this:

              my $quotedString = $conn->quote( $string );
              my $sth = $conn->prepare( "insert into url values($quotedString);");
              $sth->execute();

and this:

              $quotedTopic = quotemeta($topic);
              $quotedBanned = quotemeta($b);
              if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
                   ($quotedTopic =~ m/$quotedBanned/))

?
             
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Adam314
Adam314

The script would also run faster if you moved the sth prepare to outside of the sub handle_char, and just ran the execute in the handle_char.  
Thanks!  I used parts of both your ideas to fix the errors.

Now my new script includes items from banned categories.

I have posted a question about this here:
https://www.experts-exchange.com/questions/23127636/Populate-a-database-with-from-an-rdf-file-exclude-banned-categories.html