Perl Script for placing data from DMOZ to mySQL causing grief

Hello,

I use a script, provided by za-k/ (adrpo) to populate a database from an rdf file.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_23120051.html


Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

It populates my database with over 400,000 URLs however some items cause errors.

I get the following three errors:

  Wide character in print at ./getURLs4.pl line 59.

  DBD::mysql::st execute failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Isaac_La(The_Miracle_Play_of_Abraham_and_Isaac).htm')' at line 1 at ./getURLs4.pl line 58.

  Nested quantifiers in regex; marked by <-- HERE in m/Top/Business/Publishing_and_Printing/Publishing/Books/Business/O'Reilly_and_Associates/Technical_Books/C,_C++ <-- HERE / at ./getURLs4.pl line 34.

I think that the problems are cauased by character-encoding issues and un-escaped charactors but I could be wrong.

To run the script:

First:
   wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz

Then:
gzcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
   or
zcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt

Thanks for the help!
#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = .15; 
 
# Connect to MySQL
my $conn = DBI->connect("dbi:mysql:dbname=dbname_dbname;host=localhost", "user_user", "password");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult/", "Top/Arts/Bodyart/", "Top/Business/Arts_and_Entertainment/Gaming/", "Top/Religion_and_Spirituality/", "Top/Arts/Performing_Arts/Hypnotism/", "Top/Arts/Performing_Arts/Magic/", "Top/Games/Dice/", "Top/Games/Gambling/", "Top/Games/Roleplaying/", "Top/Recreation/Nudism/", "Top/Recreation/Tobacco/", "Top/Health/Men%27s_Health/Sexuality/", "Top/Shopping/Tobacco/", "Top/Society/Gay,_Lesbian,_and_Bisexual/", "Top/Society/Issues/Abortion/", "Top/Society/Sexuality/", "Top/Society/Transgendered/", "Top/Society/Relationships/Alternative_Lifestyles/", "Top/World/");
    foreach $b (@banned)
    {
       # if topic starts with banned or the other way around
       if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/))
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

LVL 16
hankknightAsked:
Who is Participating?
 
adrpoConnect With a Mentor Commented:

Sorry, I thought it was obvious :))

Line 34 in the script you posted:
 $quotedTopic = quotemeta($topic);
 $quotedBanned = quotemeta($b);
 if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
     ($quotedTopic =~ m/$quotedBanned/))

Line 56-58 in the script you posted:
              my $quotedString = $conn->quote( $string );
              my $sth = $conn->prepare( "insert into url values($quotedString);");
              $sth->execute();

Cheers,
za-k/
0
 
Adam314Connect With a Mentor Commented:
Try this... (leaving the rest unchanged):
sub handle_char
{
	if ($WeAreInTopic eq 'YES')
	{
		my ($p, $topic) = @_;
		chomp($topic); # kill the \n
		if ($SavedLink ne '' && $topic ne '') # only when we already have a link
		{
			if(isBanned($topic))
			{
				print "link: $SavedLink --BANNED-- | topic: $topic.\n";
			}
			else
			{
				$strQuery = "insert into url values(?);";
				my $query = $conn->prepare($strQuery);
				$query->execute($SavedLink);
				print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
				# sleep for some seconds
				sleep $sleep; 
			}     
		}
	}
}# End char_handler

Open in new window

0
 
hankknightAuthor Commented:
0
Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
adrpoCommented:

To get rid of the SQL insert problem:
### Escape the string quotes ...
my $quotedString = $conn->quote( $string );
 
### Use quoted string as a string literal in a SQL statement
my $sth = $conn->prepare( "insert into url values($quotedString);");
$sth->execute();
0
 
adrpoCommented:

To get rid of: Nested quantifiers in regex...
$quotedTopic = quotemeta($topic);
$quotedBanned = quotemeta($b);
if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
    ($quotedTopic =~ m/$quotedBanned/))

Cheers,
za-k/

0
 
adrpoCommented:

To get rid of: Wide character in print at ./getURLs4.pl line 59.
Put this line, at the start of the script (line 4 or so):
binmode(STDOUT, ":utf8");

Cheers,
za-k/
0
 
hankknightAuthor Commented:
Thank you.  adrpo, where should I put this:

              my $quotedString = $conn->quote( $string );
              my $sth = $conn->prepare( "insert into url values($quotedString);");
              $sth->execute();

and this:

              $quotedTopic = quotemeta($topic);
              $quotedBanned = quotemeta($b);
              if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
                   ($quotedTopic =~ m/$quotedBanned/))

?
             
0
 
Adam314Commented:
The script would also run faster if you moved the sth prepare to outside of the sub handle_char, and just ran the execute in the handle_char.  
0
 
hankknightAuthor Commented:
Thanks!  I used parts of both your ideas to fix the errors.

Now my new script includes items from banned categories.

I have posted a question about this here:
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23127636.html
0
All Courses

From novice to tech pro — start learning today.