pdistant
asked on
How to Filter on unwanted text using RE
Im trying to filter out unwanted html start and end tags that do not have text or other empty tags inside them. For example, if the following html string contained:
FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text 1P>
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>
FONT FACE="Bembo" SIZE="+1">
/FONT>
Then I would expect the following returned:
FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text 1P>
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>
Does anyone have any ideas?
Thanks rj2 and ozo, your solution was bang on! :-)
I should have been more clearer in the question. What I really meant was the html text could be any valid html tags that need filtering i.e. table tags with no content,bold tags with no content,etc. (assuming that the document is fully html 4.0 compliant). The above was mearly an example of what may need filtering.
FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>
FONT FACE="Bembo" SIZE="+1">
/FONT>
Then I would expect the following returned:
FONT FACE="Arial Black" SIZE="+1">img src="../gd.gif">/FONT>Text
FONT FACE="Arial Narrow" SIZE="+1">Text 2/FONT>
FONT FACE="Bembo" SIZE="+1">
Text 3/FONT>
Does anyone have any ideas?
Thanks rj2 and ozo, your solution was bang on! :-)
I should have been more clearer in the question. What I really meant was the html text could be any valid html tags that need filtering i.e. table tags with no content,bold tags with no content,etc. (assuming that the document is fully html 4.0 compliant). The above was mearly an example of what may need filtering.
which editor made this crappy peice of html...
The perl script is simple. Coming up with your filter criteria is the hard part. It's almost pointless to show the script without the proper filter (i.e. search/rejection criteria). I'll keep thinking about this one...
#!/usr/bin/perl
my $html=<<ENDHTML;
<FONT FACE="Arial Black" SIZE="+1"><img src="../gd.gif"></FONT>Tex t 1<P>
<FONT FACE="Arial Narrow" SIZE="+1">Text 2</FONT>
<FONT FACE="Bembo" SIZE="+1"></FONT>
<FONT FACE="Bembo" SIZE="+1"><br>Text 3</FONT>
<FONT FACE="Bembo" SIZE="+1"><br></FONT>
ENDHTML
print "Before: $html\n";
$html =~ s/<FONT[^>]*?>(<br>)+<\/FO NT>//gmi; # first remove font tags with only <br> inside
$html =~ s/<FONT[^>]*?><\/FONT>//gm i; # then remove font tags with nothing inside
$html =~ s/\n\n/\n/gm; #replace two consecutive linefeeds with one
print "After: $html";
my $html=<<ENDHTML;
<FONT FACE="Arial Black" SIZE="+1"><img src="../gd.gif"></FONT>Tex
<FONT FACE="Arial Narrow" SIZE="+1">Text 2</FONT>
<FONT FACE="Bembo" SIZE="+1"></FONT>
<FONT FACE="Bembo" SIZE="+1"><br>Text 3</FONT>
<FONT FACE="Bembo" SIZE="+1"><br></FONT>
ENDHTML
print "Before: $html\n";
$html =~ s/<FONT[^>]*?>(<br>)+<\/FO
$html =~ s/<FONT[^>]*?><\/FONT>//gm
$html =~ s/\n\n/\n/gm; #replace two consecutive linefeeds with one
print "After: $html";
or
$html =~ s/<FONT[^>]*>(<br>)*<\/FON T>\s*//gi;
$html =~ s/<FONT[^>]*>(<br>)*<\/FON
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
That'll do!