troubleshooting Question

HTML::Parser to strip tags or delete content

Avatar of mock5c
mock5c asked on
Perl
10 Comments1 Solution515 ViewsLast Modified:
I am trying to use HTML::Parser to read an html file and remove specific tags and content or just strip off tags leaving content.  HTML style comments also need to be removed.t

For example, <div id=topTab> topTab content </div> should be removed
<div id=keep> keep this content </div> should only strip off the tags.
This is my <b>body</b> would also strip off the tags and result in "This is my body".

I've posted my code.  I'm not quite getting the result I need.  For example, the <div id=keep> tag is deleted.  Also, if I move any of the div tags or ul tag below the <other> tag, then I get no output at all.  I'm setting $flag so I can keep track if I'm inside a tag that needs to be deleted.  I'm not sure that's the proper way to do that.



#!/usr/bin/perl

#-----------------
# Required modules
#-----------------
use strict;
use warnings;
use HTML::Parser;nt

my $html;
my $flag = 0;

my $p = HTML::Parser->new(
    'api_version' => 3,
    'start_h'    => [ \&read_tag, 'self, tagname, attr, event, text' ],
    'default_h'  => [ \&parse_tag, 'self, text' ],
    'end_h'      => [ \&read_tag, 'self, tagname, attr, event, text' ],
);
$p->parse( do { local $/; <DATA> } );
$p->eof();

$html =~ s#<.+?>##g; # strip comments
print $html;


#----------------------
# S U B R O U T I N E S 
#----------------------
sub read_tag{
   my ($self, $tagname, $attr, $event, $origtext) = @_;

   if($event eq 'start'){
      if(!$flag){
         if($tagname eq 'div' and $attr->{id} eq 'topTab'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'nav'){ $flag = 1; }
         if($tagname eq 'div' and $attr->{id} eq 'footer'){ $flag = 1; }
         if($tagname eq 'ul'  and $attr->{id} eq 'sectionNav'){ $flag = 1; }
      }
   }
   elsif($event eq 'end'){
      if($flag){
         if($tagname eq 'div'){ $flag = 0; }
         if($tagname eq 'ul'){ $flag = 0; }
      }
   }
}

sub parse_tag{
   my ($self, $origtext) = @_;

   if($flag){
      $html = "";
      $flag = 0;
   }
   else{
      $html .= $origtext;
   }
}


__DATA__
<html>
This is HTML
<body> 
<div id="topTab"> topTab content </div><!---end topTab--->
<div id="nav"> nav content </div> <!--end nav div-->
<div id="keep"> keep this content </div> <!-- end keep div -->
<div id=footer> footer content </div>
<ul id="sectionNav"> section nav </ul>
   This is my <b>body</b>.
<other>This is some other content</other>
</body>
</html>
ASKER CERTIFIED SOLUTION
mock5c

Our community of experts have been thoroughly vetted for their expertise and industry experience.

Join our community to see this answer!
Unlock 1 Answer and 10 Comments.
Start Free Trial
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 1 Answer and 10 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros