Converting RDF to Text with Perl: Topics corrupted

Hello,

I use the following script (Thanks Adam314) to parse DMOZ data in RDF format, found here:
http://rdf.dmoz.org/rdf/content.rdf.u8.gz

I want to gather the following information:

              URL || Title || Description || Topic \n

The problem is that Adam314's code scrambles some topics.

For example, some items are listed as being in
     "_and_Economy/Shopping"
instead of
     "Top/Regional/North_America/Canada/Ontario/Localities/O/Ottawa/Business_and_Economy/Shopping/"

How can this be resolved?

Thanks!
#!/usr/bin/perl -w 
use strict;
use DBI;
use XML::Parser;
 
# Thanks, Adam314 
binmode(STDOUT, ":utf8");
 
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(
  End => \&handle_end,
  Start=>\&handle_start,
  Char=>\&handle_char,
  );
open(IN, "<content.rdf.txt") or die "input: $!\n";
$parser->parse(*IN);
close(IN);
 
my $inExternalPage = 0;
my $Url;
my $inSubElement;
my %SubElement;
 
 
 
sub handle_char
{
	return unless $inSubElement;
	$SubElement{$inSubElement} = $_[1];
}# End char_handler
 
sub handle_start {
	if($_[1] eq "ExternalPage") {
		$inExternalPage = 1;
		$Url = $_[3];
		$inSubElement=0;
		return;
	}
	elsif($inExternalPage) {
		$inSubElement = $_[1];
	}
	else {
		$inSubElement=0;
	}
}
 
 
# process an end-of-element event
#
sub handle_end {
	$inSubElement=0,return if $_[1] ne 'ExternalPage';
	
	$inExternalPage = 0;
	$inSubElement = 0;
	print "$Url||" . $SubElement{'d:Title'} . '||' . $SubElement{'d:Description'} . "||$SubElement{topic}\n";
}

Open in new window

LVL 16
hankknightAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Adam314Commented:
The reason is the same as before... the parser is splitting the content into two pieces.  I don't know why it does that, but here is an updated script:

#!/usr/bin/perl -w 
use strict;
use XML::Parser;
 
binmode(STDOUT, ":utf8"); 
print "\n\nBegin...\n\n";
 
##### Create parser, and set handlers
my $parser = new XML::Parser(ErrorContext => 2, Style => 'Stream' );
 
$parser->setHandlers(
  End => \&handle_end,
  Start=>\&handle_start,
  Char=>\&handle_char,
  );
 
 
##### Open files and parse
open(OUT, ">output.txt") or die "output: $!\n";
open(IN, "<content.rdf.txt") or die "input: $!\n";
$parser->parse(*IN);
close(IN);
close(OUT);
 
##### Variables needed by subroutines below
my $inExternalPage = 0;
my $Url;
my %data;
my $datakey;
 
sub handle_char
{
	return unless defined($datakey);
	$data{$datakey} .= $_[1];
}
 
sub handle_start {
	if($_[1] eq "ExternalPage") {
		$inExternalPage = 1;
		$Url = $_[3];
	}
	elsif( ($inExternalPage) and !defined($datakey) ){
		$datakey = $_[1];
	}
}
 
sub handle_end {
	if($_[1] eq 'ExternalPage') {
		$inExternalPage = 0;
		print OUT "$Url||$data{'d:Title'}||$data{'d:Description'}||$data{topic}\n";
	}
	$datakey = undef;
}
 
print "\n\nDone\n\n";

Open in new window

0
hankknightAuthor Commented:
Awesome, thanks!

Just one thing: I get many of these errors: "Wide character in print "

The problem is that this only applies to screen output, not to the file output:
     binmode(STDOUT, ":utf8");
0
Adam314Commented:
You could do the same to the output file:
    binmode(OUT, ":utf8");
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.