sapbucket
asked on
how to parse stubborn html page with funny symbol?
Hello,
I have a html page that I want to parse but cannot get my regex's to do it. I'm a newb when it comes to regex. How can I change the 1/2 symbol to be a number .5? Here is a sample of the input and output I want:
input file: (this same format keeps repeating N times)
<tr>
<td><font size="2">6/14/2000</font>< /td>
<td align="center"><font size="2">-125</font></td>
<td align="center"><font size="2">-120</font></td>
<td align="center"><font size="2">(919)</font></td>
<td><font size="2">SEA Mariners</font></td>
<td align="center"><font size="2">4</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">1</font></td>
</tr>
<tr>
<td><font size="2">11:05 AM</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">(920)</font></td>
<td><font size="2">KC Royals</font></td>
<td align="center"><font size="2">5</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29573">D etails</a> ]</font></ td>
<td align="center"><font size="2" color="red">9</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
<tr>
<td><font size="2">6/14/2000</font>< /td>
<td align="center"><font size="2">11½ov</font></td>
<td align="center"><font size="2">12un</font></td>
<td align="center"><font size="2">(921)</font></td>
<td><font size="2">CHI White Sox</font></td>
<td align="center"><font size="2">11</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">7</font></td>
</tr>
<tr>
<td><font size="2">4:05 PM</font></td>
<td align="center"><font size="2">-145</font></td>
<td align="center"><font size="2">-160</font></td>
<td align="center"><font size="2">(922)</font></td>
<td><font size="2">CLE Indians</font></td>
<td align="center"><font size="2">4</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29577">D etails</a> ]</font></ td>
<td align="center"><font size="2" color="red">15</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
sample output:
date: 6/14/2000
time: 11:05 am
open: -125
close: -120
run open: 10.5
run close: 10.5
visitor: SEA Mariners
visitor score: 4
home: KC Royals
home score: 5
margin: 1
total: 9
date: 6/14/2000
time: 4:05 pm
open: -145
close: -160
run open: 11.5
run close: 12
visitor: CHI White Sox
visitor score: 11
home: CLE Indians
home score: 4
margin: 7
total: 15
Thanks for any help experts!
sapbucket
I have a html page that I want to parse but cannot get my regex's to do it. I'm a newb when it comes to regex. How can I change the 1/2 symbol to be a number .5? Here is a sample of the input and output I want:
input file: (this same format keeps repeating N times)
<tr>
<td><font size="2">6/14/2000</font><
<td align="center"><font size="2">-125</font></td>
<td align="center"><font size="2">-120</font></td>
<td align="center"><font size="2">(919)</font></td>
<td><font size="2">SEA Mariners</font></td>
<td align="center"><font size="2">4</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">1</font></td>
</tr>
<tr>
<td><font size="2">11:05 AM</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">(920)</font></td>
<td><font size="2">KC Royals</font></td>
<td align="center"><font size="2">5</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29573">D
<td align="center"><font size="2" color="red">9</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
<tr>
<td><font size="2">6/14/2000</font><
<td align="center"><font size="2">11½ov</font></td>
<td align="center"><font size="2">12un</font></td>
<td align="center"><font size="2">(921)</font></td>
<td><font size="2">CHI White Sox</font></td>
<td align="center"><font size="2">11</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">7</font></td>
</tr>
<tr>
<td><font size="2">4:05 PM</font></td>
<td align="center"><font size="2">-145</font></td>
<td align="center"><font size="2">-160</font></td>
<td align="center"><font size="2">(922)</font></td>
<td><font size="2">CLE Indians</font></td>
<td align="center"><font size="2">4</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29577">D
<td align="center"><font size="2" color="red">15</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
sample output:
date: 6/14/2000
time: 11:05 am
open: -125
close: -120
run open: 10.5
run close: 10.5
visitor: SEA Mariners
visitor score: 4
home: KC Royals
home score: 5
margin: 1
total: 9
date: 6/14/2000
time: 4:05 pm
open: -145
close: -160
run open: 11.5
run close: 12
visitor: CHI White Sox
visitor score: 11
home: CLE Indians
home score: 4
margin: 7
total: 15
Thanks for any help experts!
sapbucket
ASKER
How do I grab what is in between the tags on a per line basis?
output like this would work great for me:
6/14/2000,-125,-120,(919), SEA Mariners,4,1,11:05 AM,10½ov,10½ov,(920),KC Royals,5,Final© [<a href="Game.Asp?ID=29573">D etails</a> ],9
How can I go through each line and grab what is inbetween the tags and ouput as a CSV string to a file?
This seems like such a typical problem. My brain hurts when I look at it. :)
<tr>
<td><font size="2">6/14/2000</font>< /td>
<td align="center"><font size="2">-125</font></td>
<td align="center"><font size="2">-120</font></td>
<td align="center"><font size="2">(919)</font></td>
<td><font size="2">SEA Mariners</font></td>
<td align="center"><font size="2">4</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">1</font></td>
</tr>
<tr>
<td><font size="2">11:05 AM</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">(920)</font></td>
<td><font size="2">KC Royals</font></td>
<td align="center"><font size="2">5</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29573">D etails</a> ]</font></ td>
<td align="center"><font size="2" color="red">9</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
output like this would work great for me:
6/14/2000,-125,-120,(919),
How can I go through each line and grab what is inbetween the tags and ouput as a CSV string to a file?
This seems like such a typical problem. My brain hurts when I look at it. :)
<tr>
<td><font size="2">6/14/2000</font><
<td align="center"><font size="2">-125</font></td>
<td align="center"><font size="2">-120</font></td>
<td align="center"><font size="2">(919)</font></td>
<td><font size="2">SEA Mariners</font></td>
<td align="center"><font size="2">4</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">1</font></td>
</tr>
<tr>
<td><font size="2">11:05 AM</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">(920)</font></td>
<td><font size="2">KC Royals</font></td>
<td align="center"><font size="2">5</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29573">D
<td align="center"><font size="2" color="red">9</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_vers ion => 3,
start_h => [\&start_handler, "self,tagname"],
end_h => [\&end_handler, "self,tagname"],
);
sub start_handler{
my($self, $tag) = @_;
if( $tag eq "td" ){
@p=();
$self->handler(text => \@p, "dtext");
}
}
sub end_handler{
my($self, $tag) = @_;
if( $tag eq "td" ){
$self->handler(text => '');
push @c,join'',map{@$_}@p;
@p=();
}
}
$p->parse(<<END
<tr>
<td><font size="2">6/14/2000</font>< /td>
<td align="center"><font size="2">-125</font></td>
<td align="center"><font size="2">-120</font></td>
<td align="center"><font size="2">(919)</font></td>
<td><font size="2">SEA Mariners</font></td>
<td align="center"><font size="2">4</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">1</font></td>
</tr>
<tr>
<td><font size="2">11:05 AM</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">(920)</font></td>
<td><font size="2">KC Royals</font></td>
<td align="center"><font size="2">5</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29573">D etails</a> ]</font></ td>
<td align="center"><font size="2" color="red">9</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
END
);
print join",",@c;
print "\n";
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_vers
start_h => [\&start_handler, "self,tagname"],
end_h => [\&end_handler, "self,tagname"],
);
sub start_handler{
my($self, $tag) = @_;
if( $tag eq "td" ){
@p=();
$self->handler(text => \@p, "dtext");
}
}
sub end_handler{
my($self, $tag) = @_;
if( $tag eq "td" ){
$self->handler(text => '');
push @c,join'',map{@$_}@p;
@p=();
}
}
$p->parse(<<END
<tr>
<td><font size="2">6/14/2000</font><
<td align="center"><font size="2">-125</font></td>
<td align="center"><font size="2">-120</font></td>
<td align="center"><font size="2">(919)</font></td>
<td><font size="2">SEA Mariners</font></td>
<td align="center"><font size="2">4</font></td>
<td align="center"></td>
<td align="center"><font size="2" color="red">1</font></td>
</tr>
<tr>
<td><font size="2">11:05 AM</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">10½ov</font></td>
<td align="center"><font size="2">(920)</font></td>
<td><font size="2">KC Royals</font></td>
<td align="center"><font size="2">5</font></td>
<td><font size="2">Final© [<a href="Game.Asp?ID=29573">D
<td align="center"><font size="2" color="red">9</font></td>
</tr>
<tr>
<td colspan="8"><font size="1"><hr></font></td>
</tr>
END
);
print join",",@c;
print "\n";
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
ozo, thanks for the code. IS this a useful idiom for parsing? Or is this a one-shot design? Seems like I can toolbox this one...
HTML::Parser is very flexible, but it's applicability to other designs would obviously epend on what those other designs are.
ASKER
from CPAN:
"XML::Dumper dumps Perl data to XML format. XML::Dumper can also read XML data that was ***previously dumped*** by the module and convert it back to Perl."
I am concerned about "previously dumped". The XML source is from a .NET environment, not XML::Dumper. I use XMLDocumentReader class to read and write XML files in VB.NET. They are not too complicated. Any experience with using XML::Dumper with xml files *not* generated by XML::Dumper?
The way it is stated on CPAN I would think it would't work with xml generated by a non-XML::DUMPER.
"XML::Dumper dumps Perl data to XML format. XML::Dumper can also read XML data that was ***previously dumped*** by the module and convert it back to Perl."
I am concerned about "previously dumped". The XML source is from a .NET environment, not XML::Dumper. I use XMLDocumentReader class to read and write XML files in VB.NET. They are not too complicated. Any experience with using XML::Dumper with xml files *not* generated by XML::Dumper?
The way it is stated on CPAN I would think it would't work with xml generated by a non-XML::DUMPER.
use XML::Simple;
my $ref = XMLin([<xml file or string>] [, <options>]);
my $xml = XMLout($hashref [, <options>]);
my $ref = XMLin([<xml file or string>] [, <options>]);
my $xml = XMLout($hashref [, <options>]);
s/½/.5/g;
work for you then?