Link to home
Start Free TrialLog in
Avatar of mock5c
mock5c

asked on

HTML::Parser to strip tags or delete content

I am trying to use HTML::Parser to read an html file and remove specific tags and content or just strip off tags leaving content.  HTML style comments also need to be removed.t

For example, <div id=topTab> topTab content </div> should be removed
<div id=keep> keep this content </div> should only strip off the tags.
This is my <b>body</b> would also strip off the tags and result in "This is my body".

I've posted my code.  I'm not quite getting the result I need.  For example, the <div id=keep> tag is deleted.  Also, if I move any of the div tags or ul tag below the <other> tag, then I get no output at all.  I'm setting $flag so I can keep track if I'm inside a tag that needs to be deleted.  I'm not sure that's the proper way to do that.



#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;nt

my $html;
my $flag = 0;

my $p = HTML::Parser->new(
    'api_version' => 3,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();

$html =~ s#<.+?>##g; # strip comments
print $html;


#----------------------
# S U B R O U T I N E S 
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){
      if(!$flag){
         if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'nav'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'footer'){ $flag = 1; }
         if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $flag = 1; }
      }
   }
   elsif($event eq 'end'){
      if($flag){
         if($tagname eq 'div'){ $flag = 0; }
         if($tagname eq 'ul'){ $flag = 0; }
      }
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($flag){
      $html = "";
      $flag = 0;
   }
   else{
      $html .= $origtext;
   }
}


__DATA__
<html>
This is HTML
<body> 
<div id="topTab"> topTab content </div><!---end topTab--->
<div id="nav"> nav content </div> <!--end nav div-->
<div id="keep"> keep this content </div> <!-- end keep div -->
<div id=footer> footer content </div>
<ul id="sectionNav"> section nav </ul>
   This is my <b>body</b>.
<other>This is some other content</other>
</body>
</html>

Open in new window

Avatar of Robert Schutt
Robert Schutt
Flag of Netherlands image

I'm not sure of the whole logic of this but for starters, try removing line 52. That seems to return the output you want (or closer to it). This code is called repeatedly and resetting your global $html variable at that point seems counterproductive.
Avatar of mock5c
mock5c

ASKER

(I noticed my typo in line 8.  That "nt" at the end of the line must have crept in there when I was pasting text)


My problem is that the HTML::Parser documentation is a little confusing for me.

You are correct, that when I comment out line 52, I seem to get the output I want.   At least it is working for the little bit of test data that is at the bottom of the file.  However, when I run this on a real case, I don't seem to be getting the output I want.  For example, take the source of this experts exchange page.  There is a tag:
<div id="uberContainer"> ... </div>

if I were to define the line:

if($tagname eq 'div'  and $attr->{id} eq 'uberContainer'){ $flag = 1; }

Then I would want that tag plus the content between the two div tags to be removed.  When I run this script on that html, it does remove the tags themselves but the content remains so I still see "My Account", "Log Out", etc.

Does anyone know why this is happening?
I'm not sure but instead of always executing
if($tagname eq 'div'){ $flag = 0; }

Open in new window

wouldn't you need some kind of 'stack' (or at least a counter to know how many nested div tags have been opened) to determine when the flag can be reset?
Avatar of mock5c

ASKER

Yes, it looks like I need to implement a way to handle nested tags.  I have run into a problem with so called empty tags, e.g.

<input type="hidden" name="cx" value="somevalue" />

The start tag is "input" and there is no end tag so it is difficult to handle this.  In the documentation for HTML::Parser, I found the method $p->empty_element_tags( $bool )

I inserted this line $p->empty_element_tags(1) before the $p->parser() line at line 19 but I still don't seem to get into the "end" event for that empty tag.  I assume I am using it incorrectly.  The documentation is not clear on how to use empty_element_tags method and I can't find any examples on the web.

Do you have any suggestions?
Avatar of mock5c

ASKER

the  $p->empty_element_tags option is no longer a problem.  That is working fine for me.  However, my problem now goes back to $html being reset.  If the content I want to keep appears before a tag I want to remove them I have the problem where everything is being deleted.  This makes sense.  Here is my updated code that takes care of embedded tags (I didn't say it was pretty).  How can I have this truly remove only the content that I want to remove and keep the rest, i.e. not set $html = "" like I'm doing.

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;

my $filename = shift;
my $html;
my $start = 0;
my $end = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&parse_tag, 'self, tagname, attr, event, text' ],
);
#$p->empty_element_tags(1);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

$html =~ s#<.+?>##g; # strip comments
print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag.
      if(!$start){
         if($attr->{id}){
            if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $start++; }
            if($tagname eq 'div' and $attr->{id} eq 'nav'){ $start++; }
            #if($tagname eq 'div' and $attr->{id} eq 'footer') { $start++; }
            if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $start++; }
         }
         if($attr->{class}){
            if($tagname eq 'div' and $attr->{class} eq 'additionalSection'){ $start++; }
         }
      }
      else{
         # We have already encountered a removable tag so we need to increment for embedded
         $start++;
      }
   }
   elsif($event eq 'end' and $start > 0){
      $end++;
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($start and $start==$end){
      $html = "";
      $start = $end = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

I'm not getting a grip on the structure yet. What I usually do to debug something like this of which I don't know the exact structure is put a print line in, like in the parse_tag sub as second line:
print "$start/$end/*$origtext*\n";

Open in new window

Then try to figure out why it's not showing some of the text you want to keep. Hasn't worked yet but seeing how you have already made some changes in the right direction maybe this helps you to solve it yourself?
ASKER CERTIFIED SOLUTION
Avatar of mock5c
mock5c

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Great!

Well, looking at your code the structure still seemed strange to me at first but when I stripped out all parts that aren't used (sub default_tag and var $html) I get the feeling that it's closer to what I had been thinking. Only I was trying more in the line of incrementing and decrementing a $level but it seems to do exactly the same as your $open and $close. Sorry I couldn't help with actual code but maybe you still found my hints helpful.
Avatar of mock5c

ASKER

This was my own solution.  I was able to figure out how to solve this problem.
Avatar of mock5c

ASKER

I have posted my final code.  I used $open and $close flags.