Converting RDF to Text with Perl: Topics corrupted

Hello,

I use the following script (Thanks Adam314) to parse DMOZ data in RDF format, found here:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

I want to gather the following information:

              URL || Title || Description || Topic \n

The problem is that Adam314's code scrambles some topics.

For example, some items are listed as being in
     "_and_Economy/Shopping"
instead of
     "Top/Regional/North_America/Canada/Ontario/Localities/O/Ottawa/Business_and_Economy/Shopping/"

How can this be resolved?

Thanks!
#!/usr/bin/perl -w 
use strict;
use DBI;
use XML::Parser;
 
# Thanks, Adam314 
binmode(STDOUT, ":utf8");
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(
  End => \&handle_end,
  Start=>\&handle_start,
  Char=>\&handle_char,
  );
open(IN, "<content.rdf.txt") or die "input: $!\n";
$parser->parse(*IN);
close(IN);
 
my $inExternalPage = 0;
my $Url;
my $inSubElement;
my %SubElement;
 
 
 
sub handle_char
{
	return unless $inSubElement;
	$SubElement{$inSubElement} = $_[1];
}# End char_handler
 
sub handle_start {
	if($_[1] eq "ExternalPage") {
		$inExternalPage = 1;
		$Url = $_[3];
		$inSubElement=0;
		return;
	}
	elsif($inExternalPage) {
		$inSubElement = $_[1];
	}
	else {
		$inSubElement=0;
	}
}
 
 
# process an end-of-element event
#
sub handle_end {
	$inSubElement=0,return if $_[1] ne 'ExternalPage';
	
	$inExternalPage = 0;
	$inSubElement = 0;
	print "$Url||" . $SubElement{'d:Title'} . '||' . $SubElement{'d:Description'} . "||$SubElement{topic}\n";
}

Open in new window

LVL 16
hankknightAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
Adam314Connect With a Mentor Commented:
You could do the same to the output file:
    binmode(OUT, ":utf8");
0
 
Adam314Commented:
The reason is the same as before... the parser is splitting the content into two pieces.  I don't know why it does that, but here is an updated script:

#!/usr/bin/perl -w 
use strict;
use XML::Parser;
 
binmode(STDOUT, ":utf8"); 
print "\n\nBegin...\n\n";
 
##### Create parser, and set handlers
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(
  End => \&handle_end,
  Start=>\&handle_start,
  Char=>\&handle_char,
  );
 
 
##### Open files and parse
open(OUT, ">output.txt") or die "output: $!\n";
open(IN, "<content.rdf.txt") or die "input: $!\n";
$parser->parse(*IN);
close(IN);
close(OUT);
 
##### Variables needed by subroutines below
my $inExternalPage = 0;
my $Url;
my %data;
my $datakey;
 
sub handle_char
{
	return unless defined($datakey);
	$data{$datakey} .= $_[1];
}
 
sub handle_start {
	if($_[1] eq "ExternalPage") {
		$inExternalPage = 1;
		$Url = $_[3];
	}
	elsif( ($inExternalPage) and !defined($datakey) ){
		$datakey = $_[1];
	}
}
 
sub handle_end {
	if($_[1] eq 'ExternalPage') {
		$inExternalPage = 0;
		print OUT "$Url||$data{'d:Title'}||$data{'d:Description'}||$data{topic}\n";
	}
	$datakey = undef;
}
 
print "\n\nDone\n\n";

Open in new window

0
 
hankknightAuthor Commented:
Awesome, thanks!

Just one thing: I get many of these errors: "Wide character in print "

The problem is that this only applies to screen output, not to the file output:
     binmode(STDOUT, ":utf8");
0
All Courses

From novice to tech pro — start learning today.