Perl Script for placing data from DMOZ to mySQL causing grief

Hello,

I use a script, provided by za-k/ (adrpo) to populate a database from an rdf file.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/Q_23120051.html


Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

It populates my database with over 400,000 URLs however some items cause errors.

I get the following three errors:

  Wide character in print at ./getURLs4.pl line 59.

  DBD::mysql::st execute failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Isaac_La(The_Miracle_Play_of_Abraham_and_Isaac).htm')' at line 1 at ./getURLs4.pl line 58.

  Nested quantifiers in regex; marked by <-- HERE in m/Top/Business/Publishing_and_Printing/Publishing/Books/Business/O'Reilly_and_Associates/Technical_Books/C,_C++ <-- HERE / at ./getURLs4.pl line 34.

I think that the problems are cauased by character-encoding issues and un-escaped charactors but I could be wrong.

To run the script:

First:
   wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz

Then:
gzcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
   or
zcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt

Thanks for the help!
#!/usr/bin/perl
use DBI;
use XML::Parser;
 
# how much I sleep before I attempt a new insert
my $sleep = .15; 
 
# Connect to MySQL
my $conn = DBI->connect("dbi:mysql:dbname=dbname_dbname;host=localhost", "user_user", "password");
 
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute(); 
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(End => \&handle_end,
		     Start=>\&handle_start,
		     Char=>\&handle_char);
 
$parser->parse(STDIN);
 
my $SavedLink = '';
my $WeAreInTopic = 'YES';
 
 
sub isBanned
{
    my $topic = shift(@_);
    my @banned = ("Top/Adult/", "Top/Arts/Bodyart/", "Top/Business/Arts_and_Entertainment/Gaming/", "Top/Religion_and_Spirituality/", "Top/Arts/Performing_Arts/Hypnotism/", "Top/Arts/Performing_Arts/Magic/", "Top/Games/Dice/", "Top/Games/Gambling/", "Top/Games/Roleplaying/", "Top/Recreation/Nudism/", "Top/Recreation/Tobacco/", "Top/Health/Men%27s_Health/Sexuality/", "Top/Shopping/Tobacco/", "Top/Society/Gay,_Lesbian,_and_Bisexual/", "Top/Society/Issues/Abortion/", "Top/Society/Sexuality/", "Top/Society/Transgendered/", "Top/Society/Relationships/Alternative_Lifestyles/", "Top/World/");
    foreach $b (@banned)
    {
       # if topic starts with banned or the other way around
       if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/))
       {
	   return 1;
       }  
    }
    return 0;
}
 
sub handle_char
{
    if ($WeAreInTopic eq 'YES')
    {
	my ($p, $topic) = @_;
	chomp($topic); # kill the \n
	if ($SavedLink ne '' && $topic ne '') # only when we already have a link
	{
	    if(isBanned($topic))
	    {
		print "link: $SavedLink --BANNED-- | topic: $topic.\n";
	    }
	    else
	    {
		$strQuery = "insert into url values(\'" . $SavedLink . "\');";
		my $query = $conn->prepare($strQuery);
		$query->execute();
		print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
		# sleep for some seconds
		sleep $sleep; 
	    }     
	}
    }  
}# End char_handler
 
 
sub handle_start {
    my( $expat, $element, %attrs ) = @_;
    if ($element eq 'topic')
    {
	$WeAreInTopic = 'YES';
    }
    else
    {
	$WeAreInTopic = 'NO';
    }
    if ($element eq 'ExternalPage')
    {
      # ask the expat object about our position
      my $line = $expat->current_line;
      if( %attrs ) {
          while( my( $key, $value ) = each( %attrs )) 
          {
              if(($key eq 'about') && ($value ne '')) 
	    { 
		# found our ABOUT, and the value is not null 
		# remember it for later because we need to see 
		# what is in the <topic>....</topic>
		$SavedLink = $value;
              }
          }
      }
    }
}
 
 
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}

Open in new window

LVL 16
hankknightAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Adam314Commented:
Try this... (leaving the rest unchanged):
sub handle_char
{
	if ($WeAreInTopic eq 'YES')
	{
		my ($p, $topic) = @_;
		chomp($topic); # kill the \n
		if ($SavedLink ne '' && $topic ne '') # only when we already have a link
		{
			if(isBanned($topic))
			{
				print "link: $SavedLink --BANNED-- | topic: $topic.\n";
			}
			else
			{
				$strQuery = "insert into url values(?);";
				my $query = $conn->prepare($strQuery);
				$query->execute($SavedLink);
				print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
				# sleep for some seconds
				sleep $sleep; 
			}     
		}
	}
}# End char_handler

Open in new window

0
hankknightAuthor Commented:
0
adrpoCommented:

To get rid of the SQL insert problem:
### Escape the string quotes ...
my $quotedString = $conn->quote( $string );
 
### Use quoted string as a string literal in a SQL statement
my $sth = $conn->prepare( "insert into url values($quotedString);");
$sth->execute();
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

adrpoCommented:

To get rid of: Nested quantifiers in regex...
$quotedTopic = quotemeta($topic);
$quotedBanned = quotemeta($b);
if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
    ($quotedTopic =~ m/$quotedBanned/))

Cheers,
za-k/

0
adrpoCommented:

To get rid of: Wide character in print at ./getURLs4.pl line 59.
Put this line, at the start of the script (line 4 or so):
binmode(STDOUT, ":utf8");

Cheers,
za-k/
0
hankknightAuthor Commented:
Thank you.  adrpo, where should I put this:

              my $quotedString = $conn->quote( $string );
              my $sth = $conn->prepare( "insert into url values($quotedString);");
              $sth->execute();

and this:

              $quotedTopic = quotemeta($topic);
              $quotedBanned = quotemeta($b);
              if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
                   ($quotedTopic =~ m/$quotedBanned/))

?
             
0
adrpoCommented:

Sorry, I thought it was obvious :))

Line 34 in the script you posted:
 $quotedTopic = quotemeta($topic);
 $quotedBanned = quotemeta($b);
 if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
     ($quotedTopic =~ m/$quotedBanned/))

Line 56-58 in the script you posted:
              my $quotedString = $conn->quote( $string );
              my $sth = $conn->prepare( "insert into url values($quotedString);");
              $sth->execute();

Cheers,
za-k/
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Adam314Commented:
The script would also run faster if you moved the sth prepare to outside of the sub handle_char, and just ran the execute in the handle_char.  
0
hankknightAuthor Commented:
Thanks!  I used parts of both your ideas to fix the errors.

Now my new script includes items from banned categories.

I have posted a question about this here:
http://www.experts-exchange.com/Programming/Languages/Scripting/Perl/Q_23127636.html
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.