Improve company productivity with a Business Account.Sign Up

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 150
  • Last Modified:

perl: Cleaning meta tags using RegEX

I  need to clean the mata tags post building a html page by removing the forward slash at end of the line
I cannot  use a global replace =~s{\/>}{>}  as <hr /> and <src="example.com" /> are both valid so need to ensure i'm only removing from the end of the outputted meta tag

#!/usr/bin/perl


use strict; use warnings;
use HTML::TreeBuilder;
use HTML::Element;


my $body =HTML::TreeBuilder->new_from_file(*DATA);
#print $body->as_HTML('<>&','    ',{}) . "\n";

my %meta= (
"Author"=>"J K Rolling",
"title","Harry Potter and the Philosopher's Stone"

);
my $head = $body -> find_by_tag_name('_tag', 'head'); 
for my $m (sort keys %meta)
{
  my $m_el = HTML::Element->new('meta');
    # keep name content in correct order
     $m_el->attr('0name',$m);
     $m_el->attr('1content',$meta{$m});
     $head->push_content($m_el);
}
my $CloneOut = $body->as_HTML('<>&','      ',{});
    # clean up / remove 0 & 1
   $CloneOut =~ s/0name/name/ig;
   $CloneOut =~ s/1content/content/ig;  

   

   while(<$CloneOut>){  ##  Errors here on test script readline() on unopened filehandle
     my $line = $1;
     if ($line =~ m/meta/i){
        $line =~ s{\"\s+\/>}{\">};  ## is this correct?
     }
     print $line;
   }
    


__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  <title>Hello World</title>
  </head>
  <body>
    <h1> Books by J K Rolling</h1>
  </body>
</html>

Open in new window


this Outputs

<meta name="Author" content="J K Rolling" />

Open in new window


I need this

<meta name="Author" content="J K Rolling">

Open in new window

0
trevor1940
Asked:
trevor1940
  • 5
  • 4
  • 3
2 Solutions
 
Rgonzo1971Commented:
HI,

pls try
\s*\/(>)

Open in new window

Regards
0
 
Rgonzo1971Commented:
Maybe try

$y = "<meta name="Author" content="J K Rolling" />";
$y =~ s/\s*\/(>)/$1/;

Open in new window

0
 
trevor1940Author Commented:
Your regex seems to work however I'm still getting this error

while(<$CloneOut>){  ##  Errors here readline() on unopened filehandle

Open in new window


$CloneOut isn't a filehandle it's a scaler so how do I  ensure i'm only changing the meta data and not html body?
0
Free Tool: IP Lookup

Get more info about an IP address or domain name, such as organization, abuse contacts and geolocation.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
Rgonzo1971Commented:
Sorry can't help further Perl not my speciality
0
 
FishMongerCommented:
Why are you using the diamond operator and why are you using a while loop?

If you remove the diamond operator, that will fix the "readline() on unopened filehandle" error; then you'll need to fix the infinite loop that your while loop creates.
0
 
trevor1940Author Commented:
fix the infinite loop that your while loop creates

How do I do that?
0
 
FishMongerCommented:
You first need to ask yourself why you are using a loop.

$CloneOut is a scalar which holds a string of html and when used in the while condition, you're testing for truthfulness and since it never changes, it will always evaluate to true and becomes an infinite loop.

Instead of the loop, you could simply apply the regex to the scalar (making sure you use the g modifier).  If you want to use a loop, then you need to split the string into separate lines (i.e. turn it into an array or list) and loop over each of them.
0
 
trevor1940Author Commented:
Instead of the loop, you could simply apply the regex to the scalar
$CloneOut  =~ s/\s*\/(>)/$1/g;

Open in new window


doesn't work because  

<hr /> and <img src="mypic.jpg" /> are both valid 

Open in new window


I'm  guessing
split the string into separate lines (i.e. turn it into an array or list) and loop over each of them.
I'd do something like this?

 
my @CloneOut = split  /$/m, $CloneOut;

Open in new window

0
 
FishMongerCommented:
You need to make your regex more specific so that it only matches the meta tag.

If you're going to split the string into its separate lines, then do this:
my @CloneOut = split  /\n/, $CloneOut;

Open in new window

0
 
FishMongerCommented:
$CloneOut =~ s!(<meta [^/]+) /!$1!g;

Open in new window

0
 
trevor1940Author Commented:
Thank You  for your help
0
 
FishMongerCommented:
You're welcome, glad I was able to help.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

  • 5
  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now