Link to home
Start Free TrialLog in
Avatar of pdistant
pdistant

asked on

How to Filter on unwanted text using RE

Im trying to filter out unwanted html start and end tags that do not have text or other empty tags inside them. For example, if the following html string contained:

FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text 1P>
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>
FONT FACE="Bembo" SIZE="+1">
/FONT>

Then I would expect the following returned:

FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text 1P>
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>

Does anyone have any ideas?

Thanks rj2 and ozo, your solution was bang on! :-)

I should have been more clearer in the question. What I really meant was the html text could be any valid html tags that need filtering i.e. table tags with no content,bold tags with no content,etc. (assuming that the document is fully html 4.0 compliant). The above was mearly an example of what may need filtering.
Avatar of lexxwern
lexxwern
Flag of Netherlands image

which editor made this crappy peice of html...
Avatar of GorGor1
GorGor1

The perl script is simple.  Coming up with your filter criteria is the hard part.  It's almost pointless to show the script without the proper filter (i.e. search/rejection criteria).  I'll keep thinking about this one...
#!/usr/bin/perl

my $html=<<ENDHTML;
<FONT FACE="Arial Black" SIZE="+1"><img src="../gd.gif"></FONT>Text 1<P>
<FONT FACE="Arial Narrow" SIZE="+1">Text 2</FONT>
<FONT FACE="Bembo" SIZE="+1"></FONT>
<FONT FACE="Bembo" SIZE="+1"><br>Text 3</FONT>
<FONT FACE="Bembo" SIZE="+1"><br></FONT>
ENDHTML

print "Before: $html\n";

$html =~ s/<FONT[^>]*?>(<br>)+<\/FONT>//gmi; # first remove font tags with only <br> inside
$html =~ s/<FONT[^>]*?><\/FONT>//gmi; # then remove font tags with nothing inside
$html =~ s/\n\n/\n/gm; #replace two consecutive linefeeds with one

print "After: $html";
Avatar of ozo
or
$html =~ s/<FONT[^>]*>(<br>)*<\/FONT>\s*//gi;
ASKER CERTIFIED SOLUTION
Avatar of rj2
rj2

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of pdistant

ASKER

That'll do!