Need regex to strip all HTML tags, except these

Hi all,

I need a javascript function that will accept a string of up to 8000 characters, and remove all HTML tags except the following:

<EM></EM>
<STRONG></STRONG>
<U></U>
<em></em>
<strong></strong>
<u></u>

Optionally, the function should also strip attributes from any <P></P> tags, leaving behind the <P> tags without attributes. Example:

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 20pt"><U>Test</U></SPAN><FONT size=3> Hello. This is the </FONT><SPAN style="FONT-SIZE: 8pt"><STRONG>example.</STRONG></SPAN></P>

Becomes:

<P><U>Test</U> Hello. This is the <STRONG>example.</STRONG></P>

I'll increase points for a working solution that includes the <p> attributes portion.

Thanks,
SquareHead
LVL 18
SquareHeadAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

RoonaanCommented:
Here you go:

<html>
<head>
<script type="text/javascript">
  function stripHtml(str) {
    var newstr = str;
    var regExp = /<\/?(\w+)(.*?)>/ig;
    i = 0;
    while(i++ < 10 && (mt = regExp.exec(str))) {
      oldstr = mt[0];
      tag    = mt[1];
      pars   = mt[2];
      if(tag.match(/em|strong|u|p/i)) {
        repl = oldstr.replace(pars,'');
      } else {
        repl = '';
      }
      newstr = newstr.replace(oldstr, repl);
    }
    return newstr;
  }
</script>
</head>
<body>
<form>
<p>
  <b>In</b>
  <textarea name="myIn" rows="10" cols="60"><P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 20pt"><U>Test</U></SPAN><FONT size=3> Hello. This is the </FONT><SPAN style="FONT-SIZE: 8pt"><STRONG>example.</STRONG></SPAN></P></textarea>
  <br/><input type="button" value="go" onclick="this.form.myText.value=stripHtml(this.form.myIn.value);" />
</p>
<p>
  <b>Out</b>
  <textarea name="myText" rows="10" cols="60"></textarea>
</p>
</form>
</html>

-r-
0
SquareHeadAuthor Commented:
Thanks. Close. Tried it and the SPAN tags remain...
0
RoonaanCommented:
Yes, my second regexp was errorous.

/..|p|../i also matches span. Should have been /^..|p|..$/i

<html>
<head>
<script type="text/javascript">
  function stripHtml(str) {
    var newstr = str;
    var regExp = /<\/?(\w+)(.*?)>/ig;
    i = 0;
    while(i++ < 10 && (mt = regExp.exec(str))) {
      oldstr = mt[0];
      tag    = mt[1];
      pars   = mt[2];
      if(tag.match(/^(em|strong|u|p)$/i)) {
        repl = oldstr.replace(pars,'');
      } else {
        repl = '';
      }
      newstr = newstr.replace(oldstr, repl, "g");
    }
    regexp = null;
    return newstr;
  }
</script>
</head>
<body>
<form>
<p>
  <b>In</b>
  <textarea name="myIn" rows="10" cols="60"><P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 20pt"><U>Test</U></SPAN><FONT size=3> Hello. This is the </FONT><SPAN style="FONT-SIZE: 8pt"><STRONG>example.</STRONG></SPAN></P></textarea>
  <br/><input type="button" value="go" onclick="this.form.myText.value=stripHtml(this.form.myIn.value);" />
</p>
<p>
  <b>Out</b>
  <textarea name="myText" rows="10" cols="60"></textarea>
</p>
</form>
</html>

-r-
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

SquareHeadAuthor Commented:
Thanks Roonaan, we are getting closer. I tested with your example, and the first opening SPAN tag is removed, but the closing /SPAN tag remains, as does any other SPAN pairs...
0
RoonaanCommented:
Did you also re-copy the below line?
newstr = newstr.replace(oldstr, repl, "g");

-r-
0
SquareHeadAuthor Commented:
Example:

<P class=MsoNormal style="MARGIN: 0in 0in 0pt" bla="sdjhhs f dhjfs j sdfhksj"><SPAN style="FONT-SIZE: 20pt"><U>Test</U></SPAN><FONT size=3> Hello. This is the </FONT><SPAN style="FONT-SIZE: 8pt"><STRONG>example.</STRONG></SPAN> <span><em><strong>howdy</strong></em></span></P>


Result:

<P><U>Test</U> Hello. This is the <STRONG>example.</STRONG></SPAN> <span><em><strong>howdy</strong></em></span></P>
0
SquareHeadAuthor Commented:
Yes, that line is in there.
0
RoonaanCommented:
Pff, I'm being stupid.

Please remove the i++<10 part from the while() line.

-r-
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
SquareHeadAuthor Commented:
Perfect -- thanks!
0
SquareHeadAuthor Commented:
Thanks Roonaan, I increased points from 250 to 500 -- as if you needed them ;-)
0
RoonaanCommented:
Thanx,

Points are nice to get me to my next certification earlier. Even 1000 pts help me work through the 460,390 pts I need for my next certification :-D

-r-
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Fonts Typography

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.