Solved

perl: Cleaning meta tags using RegEX

Posted on 2016-10-18
12
103 Views
Last Modified: 2016-10-19
I  need to clean the mata tags post building a html page by removing the forward slash at end of the line
I cannot  use a global replace =~s{\/>}{>}  as <hr /> and <src="example.com" /> are both valid so need to ensure i'm only removing from the end of the outputted meta tag

#!/usr/bin/perl


use strict; use warnings;
use HTML::TreeBuilder;
use HTML::Element;


my $body =HTML::TreeBuilder->new_from_file(*DATA);
#print $body->as_HTML('<>&','    ',{}) . "\n";

my %meta= (
"Author"=>"J K Rolling",
"title","Harry Potter and the Philosopher's Stone"

);
my $head = $body -> find_by_tag_name('_tag', 'head'); 
for my $m (sort keys %meta)
{
  my $m_el = HTML::Element->new('meta');
    # keep name content in correct order
     $m_el->attr('0name',$m);
     $m_el->attr('1content',$meta{$m});
     $head->push_content($m_el);
}
my $CloneOut = $body->as_HTML('<>&','      ',{});
    # clean up / remove 0 & 1
   $CloneOut =~ s/0name/name/ig;
   $CloneOut =~ s/1content/content/ig;  

   

   while(<$CloneOut>){  ##  Errors here on test script readline() on unopened filehandle
     my $line = $1;
     if ($line =~ m/meta/i){
        $line =~ s{\"\s+\/>}{\">};  ## is this correct?
     }
     print $line;
   }
    


__DATA__
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  <title>Hello World</title>
  </head>
  <body>
    <h1> Books by J K Rolling</h1>
  </body>
</html>

Open in new window


this Outputs

<meta name="Author" content="J K Rolling" />

Open in new window


I need this

<meta name="Author" content="J K Rolling">

Open in new window

0
Comment
Question by:trevor1940
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 4
  • 3
12 Comments
 
LVL 52

Expert Comment

by:Rgonzo1971
ID: 41848186
HI,

pls try
\s*\/(>)

Open in new window

Regards
0
 
LVL 52

Expert Comment

by:Rgonzo1971
ID: 41848197
Maybe try

$y = "<meta name="Author" content="J K Rolling" />";
$y =~ s/\s*\/(>)/$1/;

Open in new window

0
 
LVL 1

Author Comment

by:trevor1940
ID: 41848383
Your regex seems to work however I'm still getting this error

while(<$CloneOut>){  ##  Errors here readline() on unopened filehandle

Open in new window


$CloneOut isn't a filehandle it's a scaler so how do I  ensure i'm only changing the meta data and not html body?
0
[Live Webinar] The Cloud Skills Gap

As Cloud technologies come of age, business leaders grapple with the impact it has on their team's skills and the gap associated with the use of a cloud platform.

Join experts from 451 Research and Concerto Cloud Services on July 27th where we will examine fact and fiction.

 
LVL 52

Expert Comment

by:Rgonzo1971
ID: 41848519
Sorry can't help further Perl not my speciality
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 41848647
Why are you using the diamond operator and why are you using a while loop?

If you remove the diamond operator, that will fix the "readline() on unopened filehandle" error; then you'll need to fix the infinite loop that your while loop creates.
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41848900
fix the infinite loop that your while loop creates

How do I do that?
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 41849049
You first need to ask yourself why you are using a loop.

$CloneOut is a scalar which holds a string of html and when used in the while condition, you're testing for truthfulness and since it never changes, it will always evaluate to true and becomes an infinite loop.

Instead of the loop, you could simply apply the regex to the scalar (making sure you use the g modifier).  If you want to use a loop, then you need to split the string into separate lines (i.e. turn it into an array or list) and loop over each of them.
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41849156
Instead of the loop, you could simply apply the regex to the scalar
$CloneOut  =~ s/\s*\/(>)/$1/g;

Open in new window


doesn't work because  

<hr /> and <img src="mypic.jpg" /> are both valid 

Open in new window


I'm  guessing
split the string into separate lines (i.e. turn it into an array or list) and loop over each of them.
I'd do something like this?

 
my @CloneOut = split  /$/m, $CloneOut;

Open in new window

0
 
LVL 28

Assisted Solution

by:FishMonger
FishMonger earned 500 total points
ID: 41849176
You need to make your regex more specific so that it only matches the meta tag.

If you're going to split the string into its separate lines, then do this:
my @CloneOut = split  /\n/, $CloneOut;

Open in new window

0
 
LVL 28

Accepted Solution

by:
FishMonger earned 500 total points
ID: 41849283
$CloneOut =~ s!(<meta [^/]+) /!$1!g;

Open in new window

0
 
LVL 1

Author Closing Comment

by:trevor1940
ID: 41849965
Thank You  for your help
0
 
LVL 28

Expert Comment

by:FishMonger
ID: 41850298
You're welcome, glad I was able to help.
0

Featured Post

[Live Webinar] The Cloud Skills Gap

As Cloud technologies come of age, business leaders grapple with the impact it has on their team's skills and the gap associated with the use of a cloud platform.

Join experts from 451 Research and Concerto Cloud Services on July 27th where we will examine fact and fiction.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

636 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question