how to parse stubborn html page with funny symbol?

Hello,

  I have a html page that I want to parse but cannot get my regex's to do it. I'm a newb when it comes to regex. How can I change the 1/2 symbol to be a number .5? Here is a sample of the input and output I want:

input file: (this same format keeps repeating N times)
    <tr>
            <td><font size="2">6/14/2000</font></td>
            <td align="center"><font size="2">-125</font></td>
            <td align="center"><font size="2">-120</font></td>
            <td align="center"><font size="2">(919)</font></td>
            <td><font size="2">SEA Mariners</font></td>
            <td align="center"><font size="2">4</font></td>
            <td align="center"></td>
            <td align="center"><font size="2" color="red">1</font></td>
      </tr>
      <tr>
            <td><font size="2">11:05 AM</font></td>
            <td align="center"><font size="2">10½ov</font></td>
            <td align="center"><font size="2">10½ov</font></td>
            <td align="center"><font size="2">(920)</font></td>
            <td><font size="2">KC Royals</font></td>
            <td align="center"><font size="2">5</font></td>
            <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
            <td align="center"><font size="2" color="red">9</font></td>
      </tr>
      <tr>
            <td colspan="8"><font size="1"><hr></font></td>
      </tr>
      
            
   
    <tr>
            <td><font size="2">6/14/2000</font></td>
            <td align="center"><font size="2">11½ov</font></td>
            <td align="center"><font size="2">12un</font></td>
            <td align="center"><font size="2">(921)</font></td>
            <td><font size="2">CHI White Sox</font></td>
            <td align="center"><font size="2">11</font></td>
            <td align="center"></td>
            <td align="center"><font size="2" color="red">7</font></td>
      </tr>
      <tr>
            <td><font size="2">4:05 PM</font></td>
            <td align="center"><font size="2">-145</font></td>
            <td align="center"><font size="2">-160</font></td>
            <td align="center"><font size="2">(922)</font></td>
            <td><font size="2">CLE Indians</font></td>
            <td align="center"><font size="2">4</font></td>
            <td><font size="2">Final© [<a href="Game.Asp?ID=29577">Details</a>]</font></td>
            <td align="center"><font size="2" color="red">15</font></td>
      </tr>
      <tr>
            <td colspan="8"><font size="1"><hr></font></td>
      </tr>



sample output:

date: 6/14/2000
time: 11:05 am
open: -125
close: -120
run open: 10.5
run close: 10.5
visitor: SEA Mariners
visitor score: 4
home: KC Royals
home score: 5
margin: 1
total: 9

date: 6/14/2000
time: 4:05 pm
open: -145
close: -160
run open: 11.5
run close: 12
visitor: CHI White Sox
visitor score: 11
home: CLE Indians
home score: 4
margin: 7
total: 15


Thanks for any help experts!

sapbucket
LVL 3
sapbucketAsked:
Who is Participating?
 
ozoCommented:
Looking at your example again, it seems you want to ignore <font> tags, and include <a> tags.  Are there any other tags you want to ignore or include?

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_version => 3,
     start_h => [\&start_handler, "self,tagname,text"],
     end_h => [\&end_handler, "self,tagname,text"],
     );
sub start_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       @p=();
       $self->handler(text => \@p, "dtext");
   }else{
       push @p,[$_[2]];
   }
}
sub end_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       $self->handler(text => '');
       push @c,join'',map{@$_}@p;
       @p=();
   }else{
       push @p,[$_[2]];
   }
}
$p->report_tags(qw(td a));
$p->parse(<<END
    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
END
);
print join",",@c;
print "\n";
0
 
kanduraCommented:
doesn't

    s/½/.5/g;

work for you then?
0
 
sapbucketAuthor Commented:
How do I grab what is in between the tags on a per line basis?

output like this would work great for me:
6/14/2000,-125,-120,(919),SEA Mariners,4,1,11:05 AM,10½ov,10½ov,(920),KC Royals,5,Final© [<a href="Game.Asp?ID=29573">Details</a>],9

How can I go through each line and grab what is inbetween the tags and ouput as a CSV string to a file?

This seems like such a typical problem. My brain hurts when I look at it. :)


    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
ozoCommented:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_version => 3,
     start_h => [\&start_handler, "self,tagname"],
     end_h => [\&end_handler, "self,tagname"],
     );
sub start_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       @p=();
       $self->handler(text => \@p, "dtext");
    }
}
sub end_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       $self->handler(text => '');
       push @c,join'',map{@$_}@p;
       @p=();
    }
}
$p->parse(<<END
    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
END
);
print join",",@c;
print "\n";
0
 
sapbucketAuthor Commented:
ozo, thanks for the code. IS this a useful idiom for parsing? Or is this a one-shot design? Seems like I can toolbox this one...
0
 
ozoCommented:
HTML::Parser is very flexible, but it's applicability to other designs would obviously epend on what those other designs are.
0
 
sapbucketAuthor Commented:
from CPAN:

"XML::Dumper dumps Perl data to XML format. XML::Dumper can also read XML data that was ***previously dumped*** by the module and convert it back to Perl."


I am concerned about "previously dumped". The XML source is from a .NET environment, not XML::Dumper. I use XMLDocumentReader class to read and write XML files in VB.NET. They are not too complicated. Any experience with using XML::Dumper with xml files *not* generated by XML::Dumper?

The way it is stated on CPAN I would think it would't work with xml generated by a non-XML::DUMPER.

0
 
ozoCommented:
use XML::Simple;

my $ref = XMLin([<xml file or string>] [, <options>]);

my $xml = XMLout($hashref [, <options>]);
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.