Link to home
Start Free TrialLog in
Avatar of sapbucket
sapbucket

asked on

how to parse stubborn html page with funny symbol?

Hello,

  I have a html page that I want to parse but cannot get my regex's to do it. I'm a newb when it comes to regex. How can I change the 1/2 symbol to be a number .5? Here is a sample of the input and output I want:

input file: (this same format keeps repeating N times)
    <tr>
            <td><font size="2">6/14/2000</font></td>
            <td align="center"><font size="2">-125</font></td>
            <td align="center"><font size="2">-120</font></td>
            <td align="center"><font size="2">(919)</font></td>
            <td><font size="2">SEA Mariners</font></td>
            <td align="center"><font size="2">4</font></td>
            <td align="center"></td>
            <td align="center"><font size="2" color="red">1</font></td>
      </tr>
      <tr>
            <td><font size="2">11:05 AM</font></td>
            <td align="center"><font size="2">10½ov</font></td>
            <td align="center"><font size="2">10½ov</font></td>
            <td align="center"><font size="2">(920)</font></td>
            <td><font size="2">KC Royals</font></td>
            <td align="center"><font size="2">5</font></td>
            <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
            <td align="center"><font size="2" color="red">9</font></td>
      </tr>
      <tr>
            <td colspan="8"><font size="1"><hr></font></td>
      </tr>
      
            
   
    <tr>
            <td><font size="2">6/14/2000</font></td>
            <td align="center"><font size="2">11½ov</font></td>
            <td align="center"><font size="2">12un</font></td>
            <td align="center"><font size="2">(921)</font></td>
            <td><font size="2">CHI White Sox</font></td>
            <td align="center"><font size="2">11</font></td>
            <td align="center"></td>
            <td align="center"><font size="2" color="red">7</font></td>
      </tr>
      <tr>
            <td><font size="2">4:05 PM</font></td>
            <td align="center"><font size="2">-145</font></td>
            <td align="center"><font size="2">-160</font></td>
            <td align="center"><font size="2">(922)</font></td>
            <td><font size="2">CLE Indians</font></td>
            <td align="center"><font size="2">4</font></td>
            <td><font size="2">Final© [<a href="Game.Asp?ID=29577">Details</a>]</font></td>
            <td align="center"><font size="2" color="red">15</font></td>
      </tr>
      <tr>
            <td colspan="8"><font size="1"><hr></font></td>
      </tr>



sample output:

date: 6/14/2000
time: 11:05 am
open: -125
close: -120
run open: 10.5
run close: 10.5
visitor: SEA Mariners
visitor score: 4
home: KC Royals
home score: 5
margin: 1
total: 9

date: 6/14/2000
time: 4:05 pm
open: -145
close: -160
run open: 11.5
run close: 12
visitor: CHI White Sox
visitor score: 11
home: CLE Indians
home score: 4
margin: 7
total: 15


Thanks for any help experts!

sapbucket
Avatar of kandura
kandura

doesn't

    s/½/.5/g;

work for you then?
Avatar of sapbucket

ASKER

How do I grab what is in between the tags on a per line basis?

output like this would work great for me:
6/14/2000,-125,-120,(919),SEA Mariners,4,1,11:05 AM,10½ov,10½ov,(920),KC Royals,5,Final© [<a href="Game.Asp?ID=29573">Details</a>],9

How can I go through each line and grab what is inbetween the tags and ouput as a CSV string to a file?

This seems like such a typical problem. My brain hurts when I look at it. :)


    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
Avatar of ozo
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_version => 3,
     start_h => [\&start_handler, "self,tagname"],
     end_h => [\&end_handler, "self,tagname"],
     );
sub start_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       @p=();
       $self->handler(text => \@p, "dtext");
    }
}
sub end_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       $self->handler(text => '');
       push @c,join'',map{@$_}@p;
       @p=();
    }
}
$p->parse(<<END
    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
END
);
print join",",@c;
print "\n";
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ozo, thanks for the code. IS this a useful idiom for parsing? Or is this a one-shot design? Seems like I can toolbox this one...
HTML::Parser is very flexible, but it's applicability to other designs would obviously epend on what those other designs are.
from CPAN:

"XML::Dumper dumps Perl data to XML format. XML::Dumper can also read XML data that was ***previously dumped*** by the module and convert it back to Perl."


I am concerned about "previously dumped". The XML source is from a .NET environment, not XML::Dumper. I use XMLDocumentReader class to read and write XML files in VB.NET. They are not too complicated. Any experience with using XML::Dumper with xml files *not* generated by XML::Dumper?

The way it is stated on CPAN I would think it would't work with xml generated by a non-XML::DUMPER.

use XML::Simple;

my $ref = XMLin([<xml file or string>] [, <options>]);

my $xml = XMLout($hashref [, <options>]);