Posted on 2011-10-03
The basic premise is that the string will be markup text - HTML, XML, or something with custom tags.
I want to remove ALL markup from the string - everything inclosed in < and >. However, I want to allow certain strings to remain. <b> and </b> for instance are ok. As is <i> and </i>, and a few others.
So, I know it's pretty simple to apply a RegEx that removes everything between < and >. But how do I create a library of tags I want it to ignore?
I would appreciate the answer in code so there's no ambiguity. For the RegEx, I was using this from RegEx Library:
A little long, but it matches tags with or without attribute(s) enclosed in single or double quotes. If you know of a better one for this purpose, please use it.
I would show my code ... but it's a mess. I'm over-thinking it, and it's not working. I know there's a simpler way to do this.
One thought I had was to change < and > to [ and ] for all the tags I wanted to keep, then run the RegEx replace, and then change them back. HOWEVER, that would also change ordinary [ and ], possibly messing up the original text.