hankknight
asked on
Perl Script for placing data from DMOZ to mySQL causing grief
Hello,
I use a script, provided by za-k/ (adrpo) to populate a database from an rdf file.
https://www.experts-exchange.com/questions/23120051/Placing-1-8-GB-of-data-in-database-without-hogging-resources.html
Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz
It populates my database with over 400,000 URLs however some items cause errors.
I get the following three errors:
Wide character in print at ./getURLs4.pl line 59.
DBD::mysql::st execute failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Isaac_La(The_Miracle_Play _of_Abraha m_and_Isaa c).htm')' at line 1 at ./getURLs4.pl line 58.
Nested quantifiers in regex; marked by <-- HERE in m/Top/Business/Publishing_ and_Printi ng/Publish ing/Books/ Business/O 'Reilly_an d_Associat es/Technic al_Books/C ,_C++ <-- HERE / at ./getURLs4.pl line 34.
I think that the problems are cauased by character-encoding issues and un-escaped charactors but I could be wrong.
To run the script:
First:
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
Then:
gzcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
or
zcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
Thanks for the help!
I use a script, provided by za-k/ (adrpo) to populate a database from an rdf file.
https://www.experts-exchange.com/questions/23120051/Placing-1-8-GB-of-data-in-database-without-hogging-resources.html
Here is the RDF file:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz
It populates my database with over 400,000 URLs however some items cause errors.
I get the following three errors:
Wide character in print at ./getURLs4.pl line 59.
DBD::mysql::st execute failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Isaac_La(The_Miracle_Play
Nested quantifiers in regex; marked by <-- HERE in m/Top/Business/Publishing_
I think that the problems are cauased by character-encoding issues and un-escaped charactors but I could be wrong.
To run the script:
First:
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
Then:
gzcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
or
zcat content.rdf.u8.gz | ./dmoz2db.pl >results.txt
Thanks for the help!
#!/usr/bin/perl
use DBI;
use XML::Parser;
# how much I sleep before I attempt a new insert
my $sleep = .15;
# Connect to MySQL
my $conn = DBI->connect("dbi:mysql:dbname=dbname_dbname;host=localhost", "user_user", "password");
print "Deleting all that is in the url table\n";
my $queryDel = $conn->prepare("delete from url;");
$queryDel->execute();
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
$parser->setHandlers(End => \&handle_end,
Start=>\&handle_start,
Char=>\&handle_char);
$parser->parse(STDIN);
my $SavedLink = '';
my $WeAreInTopic = 'YES';
sub isBanned
{
my $topic = shift(@_);
my @banned = ("Top/Adult/", "Top/Arts/Bodyart/", "Top/Business/Arts_and_Entertainment/Gaming/", "Top/Religion_and_Spirituality/", "Top/Arts/Performing_Arts/Hypnotism/", "Top/Arts/Performing_Arts/Magic/", "Top/Games/Dice/", "Top/Games/Gambling/", "Top/Games/Roleplaying/", "Top/Recreation/Nudism/", "Top/Recreation/Tobacco/", "Top/Health/Men%27s_Health/Sexuality/", "Top/Shopping/Tobacco/", "Top/Society/Gay,_Lesbian,_and_Bisexual/", "Top/Society/Issues/Abortion/", "Top/Society/Sexuality/", "Top/Society/Transgendered/", "Top/Society/Relationships/Alternative_Lifestyles/", "Top/World/");
foreach $b (@banned)
{
# if topic starts with banned or the other way around
if (($b =~ m/$topic/) == 1 || ($topic =~ m/$b/))
{
return 1;
}
}
return 0;
}
sub handle_char
{
if ($WeAreInTopic eq 'YES')
{
my ($p, $topic) = @_;
chomp($topic); # kill the \n
if ($SavedLink ne '' && $topic ne '') # only when we already have a link
{
if(isBanned($topic))
{
print "link: $SavedLink --BANNED-- | topic: $topic.\n";
}
else
{
$strQuery = "insert into url values(\'" . $SavedLink . "\');";
my $query = $conn->prepare($strQuery);
$query->execute();
print "link: $SavedLink -INSERTED- | topic: $topic. Now sleep $sleep seconds\n";
# sleep for some seconds
sleep $sleep;
}
}
}
}# End char_handler
sub handle_start {
my( $expat, $element, %attrs ) = @_;
if ($element eq 'topic')
{
$WeAreInTopic = 'YES';
}
else
{
$WeAreInTopic = 'NO';
}
if ($element eq 'ExternalPage')
{
# ask the expat object about our position
my $line = $expat->current_line;
if( %attrs ) {
while( my( $key, $value ) = each( %attrs ))
{
if(($key eq 'about') && ($value ne ''))
{
# found our ABOUT, and the value is not null
# remember it for later because we need to see
# what is in the <topic>....</topic>
$SavedLink = $value;
}
}
}
}
}
# process an end-of-element event
#
sub handle_end {
# just ignore the end of an element.
}
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
To get rid of the SQL insert problem:
### Escape the string quotes ...
my $quotedString = $conn->quote( $string );
### Use quoted string as a string literal in a SQL statement
my $sth = $conn->prepare( "insert into url values($quotedString);");
$sth->execute();
To get rid of: Nested quantifiers in regex...
$quotedTopic = quotemeta($topic);
$quotedBanned = quotemeta($b);
if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
($quotedTopic =~ m/$quotedBanned/))
Cheers,
za-k/
To get rid of: Wide character in print at ./getURLs4.pl line 59.
Put this line, at the start of the script (line 4 or so):
binmode(STDOUT, ":utf8");
Cheers,
za-k/
ASKER
Thank you. adrpo, where should I put this:
my $quotedString = $conn->quote( $string );
my $sth = $conn->prepare( "insert into url values($quotedString);");
$sth->execute();
and this:
$quotedTopic = quotemeta($topic);
$quotedBanned = quotemeta($b);
if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
($quotedTopic =~ m/$quotedBanned/))
?
my $quotedString = $conn->quote( $string );
my $sth = $conn->prepare( "insert into url values($quotedString);");
$sth->execute();
and this:
$quotedTopic = quotemeta($topic);
$quotedBanned = quotemeta($b);
if (($quotedBanned =~ m/$quotedTopic/) == 1 ||
($quotedTopic =~ m/$quotedBanned/))
?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
The script would also run faster if you moved the sth prepare to outside of the sub handle_char, and just ran the execute in the handle_char.
ASKER
Thanks! I used parts of both your ideas to fix the errors.
Now my new script includes items from banned categories.
I have posted a question about this here:
https://www.experts-exchange.com/questions/23127636/Populate-a-database-with-from-an-rdf-file-exclude-banned-categories.html
Now my new script includes items from banned categories.
I have posted a question about this here:
https://www.experts-exchange.com/questions/23127636/Populate-a-database-with-from-an-rdf-file-exclude-banned-categories.html
ASKER
https://www.experts-exchange.com/questions/23124592/Populate-a-database-with-domain-names-from-an-rdf-file.html