Solved

HTML::Parser to strip tags or delete content

Posted on 2013-01-06
10
422 Views
Last Modified: 2013-08-29
I am trying to use HTML::Parser to read an html file and remove specific tags and content or just strip off tags leaving content.  HTML style comments also need to be removed.t

For example, <div id=topTab> topTab content </div> should be removed
<div id=keep> keep this content </div> should only strip off the tags.
This is my <b>body</b> would also strip off the tags and result in "This is my body".

I've posted my code.  I'm not quite getting the result I need.  For example, the <div id=keep> tag is deleted.  Also, if I move any of the div tags or ul tag below the <other> tag, then I get no output at all.  I'm setting $flag so I can keep track if I'm inside a tag that needs to be deleted.  I'm not sure that's the proper way to do that.



#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;nt

my $html;
my $flag = 0;

my $p = HTML::Parser->new(
    'api_version' => 3,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();

$html =~ s#<.+?>##g; # strip comments
print $html;


#----------------------
# S U B R O U T I N E S 
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){
      if(!$flag){
         if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'nav'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'footer'){ $flag = 1; }
         if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $flag = 1; }
      }
   }
   elsif($event eq 'end'){
      if($flag){
         if($tagname eq 'div'){ $flag = 0; }
         if($tagname eq 'ul'){ $flag = 0; }
      }
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($flag){
      $html = "";
      $flag = 0;
   }
   else{
      $html .= $origtext;
   }
}


__DATA__
<html>
This is HTML
<body> 
<div id="topTab"> topTab content </div><!---end topTab--->
<div id="nav"> nav content </div> <!--end nav div-->
<div id="keep"> keep this content </div> <!-- end keep div -->
<div id=footer> footer content </div>
<ul id="sectionNav"> section nav </ul>
   This is my <b>body</b>.
<other>This is some other content</other>
</body>
</html>

Open in new window

0
Comment
Question by:mock5c
  • 6
  • 4
10 Comments
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38750807
I'm not sure of the whole logic of this but for starters, try removing line 52. That seems to return the output you want (or closer to it). This code is called repeatedly and resetting your global $html variable at that point seems counterproductive.
0
 

Author Comment

by:mock5c
ID: 38762521
(I noticed my typo in line 8.  That "nt" at the end of the line must have crept in there when I was pasting text)


My problem is that the HTML::Parser documentation is a little confusing for me.

You are correct, that when I comment out line 52, I seem to get the output I want.   At least it is working for the little bit of test data that is at the bottom of the file.  However, when I run this on a real case, I don't seem to be getting the output I want.  For example, take the source of this experts exchange page.  There is a tag:
<div id="uberContainer"> ... </div>

if I were to define the line:

if($tagname eq 'div'  and $attr->{id} eq 'uberContainer'){ $flag = 1; }

Then I would want that tag plus the content between the two div tags to be removed.  When I run this script on that html, it does remove the tags themselves but the content remains so I still see "My Account", "Log Out", etc.

Does anyone know why this is happening?
0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38763276
I'm not sure but instead of always executing
if($tagname eq 'div'){ $flag = 0; }

Open in new window

wouldn't you need some kind of 'stack' (or at least a counter to know how many nested div tags have been opened) to determine when the flag can be reset?
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 

Author Comment

by:mock5c
ID: 38770022
Yes, it looks like I need to implement a way to handle nested tags.  I have run into a problem with so called empty tags, e.g.

<input type="hidden" name="cx" value="somevalue" />

The start tag is "input" and there is no end tag so it is difficult to handle this.  In the documentation for HTML::Parser, I found the method $p->empty_element_tags( $bool )

I inserted this line $p->empty_element_tags(1) before the $p->parser() line at line 19 but I still don't seem to get into the "end" event for that empty tag.  I assume I am using it incorrectly.  The documentation is not clear on how to use empty_element_tags method and I can't find any examples on the web.

Do you have any suggestions?
0
 

Author Comment

by:mock5c
ID: 38770342
the  $p->empty_element_tags option is no longer a problem.  That is working fine for me.  However, my problem now goes back to $html being reset.  If the content I want to keep appears before a tag I want to remove them I have the problem where everything is being deleted.  This makes sense.  Here is my updated code that takes care of embedded tags (I didn't say it was pretty).  How can I have this truly remove only the content that I want to remove and keep the rest, i.e. not set $html = "" like I'm doing.

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;

my $filename = shift;
my $html;
my $start = 0;
my $end = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&parse_tag, 'self, tagname, attr, event, text' ],
);
#$p->empty_element_tags(1);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

$html =~ s#<.+?>##g; # strip comments
print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag.
      if(!$start){
         if($attr->{id}){
            if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $start++; }
            if($tagname eq 'div' and $attr->{id} eq 'nav'){ $start++; }
            #if($tagname eq 'div' and $attr->{id} eq 'footer') { $start++; }
            if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $start++; }
         }
         if($attr->{class}){
            if($tagname eq 'div' and $attr->{class} eq 'additionalSection'){ $start++; }
         }
      }
      else{
         # We have already encountered a removable tag so we need to increment for embedded
         $start++;
      }
   }
   elsif($event eq 'end' and $start > 0){
      $end++;
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($start and $start==$end){
      $html = "";
      $start = $end = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38770469
I'm not getting a grip on the structure yet. What I usually do to debug something like this of which I don't know the exact structure is put a print line in, like in the parse_tag sub as second line:
print "$start/$end/*$origtext*\n";

Open in new window

Then try to figure out why it's not showing some of the text you want to keep. Hasn't worked yet but seeing how you have already made some changes in the right direction maybe this helps you to solve it yourself?
0
 

Accepted Solution

by:
mock5c earned 0 total points
ID: 38770983
I think I have if figured out.  This HTML::Parser module has been a little tricky for me, dealing with events.  As it became more clear to me, I had some better ideas on how to handle it.  Here is my final code:

#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;


#----------------------------------------------
# Define tags and IDs to remove here.  All of
# these combinations will be removed from the
# html code.  Nested code will also be removed
# if outer tags have any of the defined combos.
#----------------------------------------------
my @remove = (
   [ 'div','id','topTab' ],
   [ 'div','id','nav' ],
   [ 'div','id','footer' ],
   [ 'ul','id','sectionNav' ],
   [ 'div','class','additionalSection'],
   [ 'script','type','text/javascript'],
);


my $filename = shift;
my $html;
my $open = 0;
my $close = 0;

open(DATA, "$filename");

my $p = HTML::Parser->new(
    'api_version' => 3,
    'empty_element_tags' => 1,
    'start_h'    => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    #'default_h'  => [ \&default_tag, 'self, text' ],
    'text_h'     => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    'end_h'      => [ \&parse_tag, 'self, tagname, attr, event, text' ],
    #'comment_h'  => [ \&default_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();
close(DATA);

#$html =~ s#<.+?>##g; # strip comments
#print $html;

#----------------------
# S U B R O U T I N E S
#----------------------
sub parse_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){

      # We are not currently in a removable tag (open chunk).
      if(!$open){
         foreach my $row(@remove){
            if($tagname eq $row->[0] and $attr->{$row->[1]} and $attr->{$row->[1]} eq $row->[2]){
               $open++;
               last;
            }
         }
      }
      else{
         # We have already encountered a removable tag and are in an open chunk
         # so we need to increment for embedded tags
         $open++;
      }
   }
   elsif($event eq 'end' and $open > 0){
      $close++;
   }
   elsif($event eq 'text'){
      if(!$open){
         print "$origtext";# unless $origtext =~ /^$/;
      }
      elsif($open and $open==$close){
         $html = "";
         $open = $close = 0;
      }
      else{
         # Do nothing
      }
   }
   else{
      # Default case -- do nothing
   }
}

sub default_tag{
   my ($self, $origtext) = @_;

   if($open and $open==$close){
      $html = "";
      $open = $close = 0;
   }
   else{
      $html .= $origtext;
   }
}

Open in new window

0
 
LVL 35

Expert Comment

by:Robert Schutt
ID: 38771081
Great!

Well, looking at your code the structure still seemed strange to me at first but when I stripped out all parts that aren't used (sub default_tag and var $html) I get the feeling that it's closer to what I had been thinking. Only I was trying more in the line of incrementing and decrementing a $level but it seems to do exactly the same as your $open and $close. Sorry I couldn't help with actual code but maybe you still found my hints helpful.
0
 

Author Comment

by:mock5c
ID: 39435872
This was my own solution.  I was able to figure out how to solve this problem.
0
 

Author Closing Comment

by:mock5c
ID: 39448270
I have posted my final code.  I used $open and $close flags.
0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question