Solved

HTML::Parser to strip tags or delete content

Posted on 2013-01-06
10
408 Views
Last Modified: 2013-08-29
I am trying to use HTML::Parser to read an html file and remove specific tags and content or just strip off tags leaving content.  HTML style comments also need to be removed.t

For example, <div id=topTab> topTab content </div> should be removed
<div id=keep> keep this content </div> should only strip off the tags.
This is my <b>body</b> would also strip off the tags and result in "This is my body".

I've posted my code.  I'm not quite getting the result I need.  For example, the <div id=keep> tag is deleted.  Also, if I move any of the div tags or ul tag below the <other> tag, then I get no output at all.  I'm setting $flag so I can keep track if I'm inside a tag that needs to be deleted.  I'm not sure that's the proper way to do that.



#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;nt

my $html;
my $flag = 0;

my $p = HTML::Parser->new(
    'api_version' => 3,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();

$html =~ s#<.+?>##g; # strip comments
print $html;


#----------------------
# S U B R O U T I N E S 
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){
      if(!$flag){
         if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'nav'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'footer'){ $flag = 1; }
         if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $flag = 1; }
      }
   }
   elsif($event eq 'end'){
      if($flag){
         if($tagname eq 'div'){ $flag = 0; }
         if($tagname eq 'ul'){ $flag = 0; }
      }
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($flag){
      $html = "";
      $flag = 0;
   }
   else{
      $html .= $origtext;
   }
}


__DATA__
<html>
This is HTML
<body> 
<div id="topTab"> topTab content </div><!---end topTab--->
<div id="nav"> nav content </div> <!--end nav div-->
<div id="keep"> keep this content </div> <!-- end keep div -->
<div id=footer> footer content </div>
<ul id="sectionNav"> section nav </ul>
   This is my <b>body</b>.
<other>This is some other content</other>
</body>
</html>

Open in new window

0
Comment
Question by:mock5c
  • 6
  • 4
10 Comments
 
LVL 35

Expert Comment

by:Robert Schutt
Comment Utility
I'm not sure of the whole logic of this but for starters, try removing line 52. That seems to return the output you want (or closer to it). This code is called repeatedly and resetting your global $html variable at that point seems counterproductive.
0
 

Author Comment

by:mock5c
Comment Utility
(I noticed my typo in line 8.  That "nt" at the end of the line must have crept in there when I was pasting text)


My problem is that the HTML::Parser documentation is a little confusing for me.

You are correct, that when I comment out line 52, I seem to get the output I want.   At least it is working for the little bit of test data that is at the bottom of the file.  However, when I run this on a real case, I don't seem to be getting the output I want.  For example, take the source of this experts exchange page.  There is a tag:
<div id="uberContainer"> ... </div>

if I were to define the line:

if($tagname eq 'div'  and $attr->{id} eq 'uberContainer'){ $flag = 1; }

Then I would want that tag plus the content between the two div tags to be removed.  When I run this script on that html, it does remove the tags themselves but the content remains so I still see "My Account", "Log Out", etc.

Does anyone know why this is happening?
0
 
LVL 35

Expert Comment

by:Robert Schutt
Comment Utility
I'm not sure but instead of always executing
if($tagname eq 'div'){ $flag = 0; }

Open in new window

wouldn't you need some kind of 'stack' (or at least a counter to know how many nested div tags have been opened) to determine when the flag can be reset?
0
 

Author Comment

by:mock5c
Comment Utility
Yes, it looks like I need to implement a way to handle nested tags.  I have run into a problem with so called empty tags, e.g.

<input type="hidden" name="cx" value="somevalue" />

The start tag is "input" and there is no end tag so it is difficult to handle this.  In the documentation for HTML::Parser, I found the method $p->empty_element_tags( $bool )

I inserted this line $p->empty_element_tags(1) before the $p->parser() line at line 19 but I still don't seem to get into the "end" event for that empty tag.  I assume I am using it incorrectly.  The documentation is not clear on how to use empty_element_tags method and I can't find any examples on the web.

Do you have any suggestions?
0
 

Author Comment

by:mock5c
Comment Utility
the  $p->empty_element_tags option is no longer a problem.  That is working fine for me.  However, my problem now goes back to $html being reset.  If the content I want to keep appears before a tag I want to remove them I have the problem where everything is being deleted.  This makes sense.  Here is my updated code that takes care of embedded tags (I didn't say it was pretty).  How can I have this truly remove only the content that I want to remove and keep the rest, i.e. not set $html = "" like I'm doing.

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;

my $filename = shift;
my $html;
my $start = 0;
my $end = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&parse_tag, 'self, tagname, attr, event, text' ],
);
#$p->empty_element_tags(1);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

$html =~ s#<.+?>##g; # strip comments
print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag.
      if(!$start){
         if($attr->{id}){
            if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $start++; }
            if($tagname eq 'div' and $attr->{id} eq 'nav'){ $start++; }
            #if($tagname eq 'div' and $attr->{id} eq 'footer') { $start++; }
            if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $start++; }
         }
         if($attr->{class}){
            if($tagname eq 'div' and $attr->{class} eq 'additionalSection'){ $start++; }
         }
      }
      else{
         # We have already encountered a removable tag so we need to increment for embedded
         $start++;
      }
   }
   elsif($event eq 'end' and $start > 0){
      $end++;
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($start and $start==$end){
      $html = "";
      $start = $end = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 35

Expert Comment

by:Robert Schutt
Comment Utility
I'm not getting a grip on the structure yet. What I usually do to debug something like this of which I don't know the exact structure is put a print line in, like in the parse_tag sub as second line:
print "$start/$end/*$origtext*\n";

Open in new window

Then try to figure out why it's not showing some of the text you want to keep. Hasn't worked yet but seeing how you have already made some changes in the right direction maybe this helps you to solve it yourself?
0
 

Accepted Solution

by:
mock5c earned 0 total points
Comment Utility
I think I have if figured out.  This HTML::Parser module has been a little tricky for me, dealing with events.  As it became more clear to me, I had some better ideas on how to handle it.  Here is my final code:

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;


#----------------------------------------------
# Define tags and IDs to remove here.  All of
# these combinations will be removed from the
# html code.  Nested code will also be removed
# if outer tags have any of the defined combos.
#----------------------------------------------
my @remove = (
   [ 'div','id','topTab' ],
   [ 'div','id','nav' ],
   [ 'div','id','footer' ],
   [ 'ul','id','sectionNav' ],
   [ 'div','class','additionalSection'],
   [ 'script','type','text/javascript'],
);


my $filename = shift;
my $html;
my $open = 0;
my $close = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    #'default_h'  => [ \&default_tag, 'self, text' ],
    'text_h'     => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    'end_h'      => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&default_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

#$html =~ s#<.+?>##g; # strip comments
#print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub parse_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag (open chunk).
      if(!$open){
         foreach my $row(@remove){
            if($tagname eq $row->[0] and $attr->{$row->[1]} and $attr->{$row->[1]} eq $row->[2]){
               $open++;
               last;
            }
         }
      }
      else{
         # We have already encountered a removable tag and are in an open chunk
         # so we need to increment for embedded tags
         $open++;
      }
   }
   elsif($event eq 'end' and $open > 0){
      $close++;
   }
   elsif($event eq 'text'){
      if(!$open){
         print "$origtext";# unless $origtext =~ /^$/;
      }
      elsif($open and $open==$close){
         $html = "";
         $open = $close = 0;
      }
      else{
         # Do nothing
      }
   }
   else{
      # Default case -- do nothing
   }
}

sub default_tag{
   my ($self, $origtext) = @_;

   if($open and $open==$close){
      $html = "";
      $open = $close = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

0
 
LVL 35

Expert Comment

by:Robert Schutt
Comment Utility
Great!

Well, looking at your code the structure still seemed strange to me at first but when I stripped out all parts that aren't used (sub default_tag and var $html) I get the feeling that it's closer to what I had been thinking. Only I was trying more in the line of incrementing and decrementing a $level but it seems to do exactly the same as your $open and $close. Sorry I couldn't help with actual code but maybe you still found my hints helpful.
0
 

Author Comment

by:mock5c
Comment Utility
This was my own solution.  I was able to figure out how to solve this problem.
0
 

Author Closing Comment

by:mock5c
Comment Utility
I have posted my final code.  I used $open and $close flags.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Internet Business Fax to Email Made Easy - With eFax Corporate (http://www.enterprise.efax.com), you'll receive a dedicated online fax number, which is used the same way as a typical analog fax number. You'll receive secure faxes in your email, fr…

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now