Solved

How to Filter on unwanted text using RE

Posted on 2002-05-22
6
186 Views
Last Modified: 2010-03-05
Im trying to filter out unwanted html start and end tags that do not have text or other empty tags inside them. For example, if the following html string contained:

FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text 1P>
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>
FONT FACE="Bembo" SIZE="+1">
/FONT>

Then I would expect the following returned:

FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text 1P>
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>

Does anyone have any ideas?

Thanks rj2 and ozo, your solution was bang on! :-)

I should have been more clearer in the question. What I really meant was the html text could be any valid html tags that need filtering i.e. table tags with no content,bold tags with no content,etc. (assuming that the document is fully html 4.0 compliant). The above was mearly an example of what may need filtering.
0
Comment
Question by:pdistant
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
6 Comments
 
LVL 12

Expert Comment

by:lexxwern
ID: 7026586
which editor made this crappy peice of html...
0
 
LVL 1

Expert Comment

by:GorGor1
ID: 7028064
The perl script is simple.  Coming up with your filter criteria is the hard part.  It's almost pointless to show the script without the proper filter (i.e. search/rejection criteria).  I'll keep thinking about this one...
0
 
LVL 10

Expert Comment

by:rj2
ID: 7028223
#!/usr/bin/perl

my $html=<<ENDHTML;
<FONT FACE="Arial Black" SIZE="+1"><img src="../gd.gif"></FONT>Text 1<P>
<FONT FACE="Arial Narrow" SIZE="+1">Text 2</FONT>
<FONT FACE="Bembo" SIZE="+1"></FONT>
<FONT FACE="Bembo" SIZE="+1"><br>Text 3</FONT>
<FONT FACE="Bembo" SIZE="+1"><br></FONT>
ENDHTML

print "Before: $html\n";

$html =~ s/<FONT[^>]*?>(<br>)+<\/FONT>//gmi; # first remove font tags with only <br> inside
$html =~ s/<FONT[^>]*?><\/FONT>//gmi; # then remove font tags with nothing inside
$html =~ s/\n\n/\n/gm; #replace two consecutive linefeeds with one

print "After: $html";
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 84

Expert Comment

by:ozo
ID: 7028989
or
$html =~ s/<FONT[^>]*>(<br>)*<\/FONT>\s*//gi;
0
 
LVL 10

Accepted Solution

by:
rj2 earned 100 total points
ID: 7030145
0
 

Author Comment

by:pdistant
ID: 7038613
That'll do!
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
HTTP Error 502.2 - Bad Gateway 3 233
Excel to CSV conversion with specific columns 5 98
Parse csv file and generate graphs in HTML in bash 8 277
quoting a comma separated list 20 91
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question