• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 167
  • Last Modified:

how to parse stubborn html page with funny symbol?

Hello,

  I have a html page that I want to parse but cannot get my regex's to do it. I'm a newb when it comes to regex. How can I change the 1/2 symbol to be a number .5? Here is a sample of the input and output I want:

input file: (this same format keeps repeating N times)
    <tr>
            <td><font size="2">6/14/2000</font></td>
            <td align="center"><font size="2">-125</font></td>
            <td align="center"><font size="2">-120</font></td>
            <td align="center"><font size="2">(919)</font></td>
            <td><font size="2">SEA Mariners</font></td>
            <td align="center"><font size="2">4</font></td>
            <td align="center"></td>
            <td align="center"><font size="2" color="red">1</font></td>
      </tr>
      <tr>
            <td><font size="2">11:05 AM</font></td>
            <td align="center"><font size="2">10½ov</font></td>
            <td align="center"><font size="2">10½ov</font></td>
            <td align="center"><font size="2">(920)</font></td>
            <td><font size="2">KC Royals</font></td>
            <td align="center"><font size="2">5</font></td>
            <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
            <td align="center"><font size="2" color="red">9</font></td>
      </tr>
      <tr>
            <td colspan="8"><font size="1"><hr></font></td>
      </tr>
      
            
   
    <tr>
            <td><font size="2">6/14/2000</font></td>
            <td align="center"><font size="2">11½ov</font></td>
            <td align="center"><font size="2">12un</font></td>
            <td align="center"><font size="2">(921)</font></td>
            <td><font size="2">CHI White Sox</font></td>
            <td align="center"><font size="2">11</font></td>
            <td align="center"></td>
            <td align="center"><font size="2" color="red">7</font></td>
      </tr>
      <tr>
            <td><font size="2">4:05 PM</font></td>
            <td align="center"><font size="2">-145</font></td>
            <td align="center"><font size="2">-160</font></td>
            <td align="center"><font size="2">(922)</font></td>
            <td><font size="2">CLE Indians</font></td>
            <td align="center"><font size="2">4</font></td>
            <td><font size="2">Final© [<a href="Game.Asp?ID=29577">Details</a>]</font></td>
            <td align="center"><font size="2" color="red">15</font></td>
      </tr>
      <tr>
            <td colspan="8"><font size="1"><hr></font></td>
      </tr>



sample output:

date: 6/14/2000
time: 11:05 am
open: -125
close: -120
run open: 10.5
run close: 10.5
visitor: SEA Mariners
visitor score: 4
home: KC Royals
home score: 5
margin: 1
total: 9

date: 6/14/2000
time: 4:05 pm
open: -145
close: -160
run open: 11.5
run close: 12
visitor: CHI White Sox
visitor score: 11
home: CLE Indians
home score: 4
margin: 7
total: 15


Thanks for any help experts!

sapbucket
0
sapbucket
Asked:
sapbucket
  • 4
  • 4
1 Solution
 
kanduraCommented:
doesn't

    s/½/.5/g;

work for you then?
0
 
sapbucketAuthor Commented:
How do I grab what is in between the tags on a per line basis?

output like this would work great for me:
6/14/2000,-125,-120,(919),SEA Mariners,4,1,11:05 AM,10½ov,10½ov,(920),KC Royals,5,Final© [<a href="Game.Asp?ID=29573">Details</a>],9

How can I go through each line and grab what is inbetween the tags and ouput as a CSV string to a file?

This seems like such a typical problem. My brain hurts when I look at it. :)


    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
0
 
ozoCommented:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_version => 3,
     start_h => [\&start_handler, "self,tagname"],
     end_h => [\&end_handler, "self,tagname"],
     );
sub start_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       @p=();
       $self->handler(text => \@p, "dtext");
    }
}
sub end_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       $self->handler(text => '');
       push @c,join'',map{@$_}@p;
       @p=();
    }
}
$p->parse(<<END
    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
END
);
print join",",@c;
print "\n";
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
ozoCommented:
Looking at your example again, it seems you want to ignore <font> tags, and include <a> tags.  Are there any other tags you want to ignore or include?

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @p;
my @c;
my $p = HTML::Parser->new(api_version => 3,
     start_h => [\&start_handler, "self,tagname,text"],
     end_h => [\&end_handler, "self,tagname,text"],
     );
sub start_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       @p=();
       $self->handler(text => \@p, "dtext");
   }else{
       push @p,[$_[2]];
   }
}
sub end_handler{
    my($self, $tag) = @_;
    if( $tag eq "td" ){
       $self->handler(text => '');
       push @c,join'',map{@$_}@p;
       @p=();
   }else{
       push @p,[$_[2]];
   }
}
$p->report_tags(qw(td a));
$p->parse(<<END
    <tr>
          <td><font size="2">6/14/2000</font></td>
          <td align="center"><font size="2">-125</font></td>
          <td align="center"><font size="2">-120</font></td>
          <td align="center"><font size="2">(919)</font></td>
          <td><font size="2">SEA Mariners</font></td>
          <td align="center"><font size="2">4</font></td>
          <td align="center"></td>
          <td align="center"><font size="2" color="red">1</font></td>
     </tr>
     <tr>
          <td><font size="2">11:05 AM</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">10½ov</font></td>
          <td align="center"><font size="2">(920)</font></td>
          <td><font size="2">KC Royals</font></td>
          <td align="center"><font size="2">5</font></td>
          <td><font size="2">Final© [<a href="Game.Asp?ID=29573">Details</a>]</font></td>
          <td align="center"><font size="2" color="red">9</font></td>
     </tr>
     <tr>
          <td colspan="8"><font size="1"><hr></font></td>
     </tr>
END
);
print join",",@c;
print "\n";
0
 
sapbucketAuthor Commented:
ozo, thanks for the code. IS this a useful idiom for parsing? Or is this a one-shot design? Seems like I can toolbox this one...
0
 
ozoCommented:
HTML::Parser is very flexible, but it's applicability to other designs would obviously epend on what those other designs are.
0
 
sapbucketAuthor Commented:
from CPAN:

"XML::Dumper dumps Perl data to XML format. XML::Dumper can also read XML data that was ***previously dumped*** by the module and convert it back to Perl."


I am concerned about "previously dumped". The XML source is from a .NET environment, not XML::Dumper. I use XMLDocumentReader class to read and write XML files in VB.NET. They are not too complicated. Any experience with using XML::Dumper with xml files *not* generated by XML::Dumper?

The way it is stated on CPAN I would think it would't work with xml generated by a non-XML::DUMPER.

0
 
ozoCommented:
use XML::Simple;

my $ref = XMLin([<xml file or string>] [, <options>]);

my $xml = XMLout($hashref [, <options>]);
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 4
  • 4
Tackle projects and never again get stuck behind a technical roadblock.
Join Now