Solved

HTML::Parser to strip tags or delete content

Posted on 2013-01-06
10
423 Views
Last Modified: 2013-08-29
I am trying to use HTML::Parser to read an html file and remove specific tags and content or just strip off tags leaving content.  HTML style comments also need to be removed.t

For example, <div id=topTab> topTab content </div> should be removed
<div id=keep> keep this content </div> should only strip off the tags.
This is my <b>body</b> would also strip off the tags and result in "This is my body".

I've posted my code.  I'm not quite getting the result I need.  For example, the <div id=keep> tag is deleted.  Also, if I move any of the div tags or ul tag below the <other> tag, then I get no output at all.  I'm setting $flag so I can keep track if I'm inside a tag that needs to be deleted.  I'm not sure that's the proper way to do that.



#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;nt

my $html;
my $flag = 0;

my $p = HTML::Parser->new(
    'api_version' => 3,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();

$html =~ s#<.+?>##g; # strip comments
print $html;


#----------------------
# S U B R O U T I N E S 
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){
      if(!$flag){
         if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'nav'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'footer'){ $flag = 1; }
         if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $flag = 1; }
      }
   }
   elsif($event eq 'end'){
      if($flag){
         if($tagname eq 'div'){ $flag = 0; }
         if($tagname eq 'ul'){ $flag = 0; }
      }
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($flag){
      $html = "";
      $flag = 0;
   }
   else{
      $html .= $origtext;
   }
}


__DATA__
<html>
This is HTML
<body> 
<div id="topTab"> topTab content </div><!---end topTab--->
<div id="nav"> nav content </div> <!--end nav div-->
<div id="keep"> keep this content </div> <!-- end keep div -->
<div id=footer> footer content </div>
<ul id="sectionNav"> section nav </ul>
   This is my <b>body</b>.
<other>This is some other content</other>
</body>
</html>

Open in new window

0
Comment
Question by:mock5c
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
10 Comments
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38750807
I'm not sure of the whole logic of this but for starters, try removing line 52. That seems to return the output you want (or closer to it). This code is called repeatedly and resetting your global $html variable at that point seems counterproductive.
0
 

Author Comment

by:mock5c
ID: 38762521
(I noticed my typo in line 8.  That "nt" at the end of the line must have crept in there when I was pasting text)


My problem is that the HTML::Parser documentation is a little confusing for me.

You are correct, that when I comment out line 52, I seem to get the output I want.   At least it is working for the little bit of test data that is at the bottom of the file.  However, when I run this on a real case, I don't seem to be getting the output I want.  For example, take the source of this experts exchange page.  There is a tag:
<div id="uberContainer"> ... </div>

if I were to define the line:

if($tagname eq 'div'  and $attr->{id} eq 'uberContainer'){ $flag = 1; }

Then I would want that tag plus the content between the two div tags to be removed.  When I run this script on that html, it does remove the tags themselves but the content remains so I still see "My Account", "Log Out", etc.

Does anyone know why this is happening?
0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38763276
I'm not sure but instead of always executing
if($tagname eq 'div'){ $flag = 0; }

Open in new window

wouldn't you need some kind of 'stack' (or at least a counter to know how many nested div tags have been opened) to determine when the flag can be reset?
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:mock5c
ID: 38770022
Yes, it looks like I need to implement a way to handle nested tags.  I have run into a problem with so called empty tags, e.g.

<input type="hidden" name="cx" value="somevalue" />

The start tag is "input" and there is no end tag so it is difficult to handle this.  In the documentation for HTML::Parser, I found the method $p->empty_element_tags( $bool )

I inserted this line $p->empty_element_tags(1) before the $p->parser() line at line 19 but I still don't seem to get into the "end" event for that empty tag.  I assume I am using it incorrectly.  The documentation is not clear on how to use empty_element_tags method and I can't find any examples on the web.

Do you have any suggestions?
0
 

Author Comment

by:mock5c
ID: 38770342
the  $p->empty_element_tags option is no longer a problem.  That is working fine for me.  However, my problem now goes back to $html being reset.  If the content I want to keep appears before a tag I want to remove them I have the problem where everything is being deleted.  This makes sense.  Here is my updated code that takes care of embedded tags (I didn't say it was pretty).  How can I have this truly remove only the content that I want to remove and keep the rest, i.e. not set $html = "" like I'm doing.

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;

my $filename = shift;
my $html;
my $start = 0;
my $end = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&parse_tag, 'self, tagname, attr, event, text' ],
);
#$p->empty_element_tags(1);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

$html =~ s#<.+?>##g; # strip comments
print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag.
      if(!$start){
         if($attr->{id}){
            if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $start++; }
            if($tagname eq 'div' and $attr->{id} eq 'nav'){ $start++; }
            #if($tagname eq 'div' and $attr->{id} eq 'footer') { $start++; }
            if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $start++; }
         }
         if($attr->{class}){
            if($tagname eq 'div' and $attr->{class} eq 'additionalSection'){ $start++; }
         }
      }
      else{
         # We have already encountered a removable tag so we need to increment for embedded
         $start++;
      }
   }
   elsif($event eq 'end' and $start > 0){
      $end++;
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($start and $start==$end){
      $html = "";
      $start = $end = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38770469
I'm not getting a grip on the structure yet. What I usually do to debug something like this of which I don't know the exact structure is put a print line in, like in the parse_tag sub as second line:
print "$start/$end/*$origtext*\n";

Open in new window

Then try to figure out why it's not showing some of the text you want to keep. Hasn't worked yet but seeing how you have already made some changes in the right direction maybe this helps you to solve it yourself?
0
 

Accepted Solution

by:
mock5c earned 0 total points
ID: 38770983
I think I have if figured out.  This HTML::Parser module has been a little tricky for me, dealing with events.  As it became more clear to me, I had some better ideas on how to handle it.  Here is my final code:

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;


#----------------------------------------------
# Define tags and IDs to remove here.  All of
# these combinations will be removed from the
# html code.  Nested code will also be removed
# if outer tags have any of the defined combos.
#----------------------------------------------
my @remove = (
   [ 'div','id','topTab' ],
   [ 'div','id','nav' ],
   [ 'div','id','footer' ],
   [ 'ul','id','sectionNav' ],
   [ 'div','class','additionalSection'],
   [ 'script','type','text/javascript'],
);


my $filename = shift;
my $html;
my $open = 0;
my $close = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    #'default_h'  => [ \&default_tag, 'self, text' ],
    'text_h'     => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    'end_h'      => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&default_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

#$html =~ s#<.+?>##g; # strip comments
#print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub parse_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag (open chunk).
      if(!$open){
         foreach my $row(@remove){
            if($tagname eq $row->[0] and $attr->{$row->[1]} and $attr->{$row->[1]} eq $row->[2]){
               $open++;
               last;
            }
         }
      }
      else{
         # We have already encountered a removable tag and are in an open chunk
         # so we need to increment for embedded tags
         $open++;
      }
   }
   elsif($event eq 'end' and $open > 0){
      $close++;
   }
   elsif($event eq 'text'){
      if(!$open){
         print "$origtext";# unless $origtext =~ /^$/;
      }
      elsif($open and $open==$close){
         $html = "";
         $open = $close = 0;
      }
      else{
         # Do nothing
      }
   }
   else{
      # Default case -- do nothing
   }
}

sub default_tag{
   my ($self, $origtext) = @_;

   if($open and $open==$close){
      $html = "";
      $open = $close = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38771081
Great!

Well, looking at your code the structure still seemed strange to me at first but when I stripped out all parts that aren't used (sub default_tag and var $html) I get the feeling that it's closer to what I had been thinking. Only I was trying more in the line of incrementing and decrementing a $level but it seems to do exactly the same as your $open and $close. Sorry I couldn't help with actual code but maybe you still found my hints helpful.
0
 

Author Comment

by:mock5c
ID: 39435872
This was my own solution.  I was able to figure out how to solve this problem.
0
 

Author Closing Comment

by:mock5c
ID: 39448270
I have posted my final code.  I used $open and $close flags.
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

749 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question