Link to home
Create AccountLog in
Avatar of elfeffe

asked on

Rremove all the html tags from a string except the tags I want to show

Hi all,
I need to remove all the html tags from a string except the tags I want to show.
I am programming in - Visual Basic.

Avatar of skiltz
Flag of New Zealand image

As you are aware there are heaps of HTML tags to consider, i am pretty sure there are some already built DLL's you can download.  Other than that you can use Regular Expression to find a match and replace a string.
Avatar of elfeffe


Both solutions should be ok for my needs, the matter is I can't find any DLL / regular expression wich works as I need.
Avatar of elfeffe


That solution doesn't get a "white list" of allowed tags.
i.e: I want to remove all html tags except: <h1>,<h1/>, <br>,<br /> , <span>,<span/>, etc...
Thanks anyway.
I created this one.

copy to bin directory  

Add reference to DLL

Dim clean As New CleanHTML.clean
clean.RemoveH1Tags = false

Dim result as String
result = clean(someugulyText)

I haven't tested it to much but serves my purpose.
Avatar of elfeffe


Thank you for the effort, but the clean class has some disadvantages:

1- We have to implement a boolean member for each tag: clean.RemoveH1Tags, clean.RemoveDIVTags, etc.. and so on for each tag I want to use.
2- The method has to remove all html tags except the white list by default, I dont want to populate each boolean with "true" value. It would be a pain in the current way.

What sould be ok is a method like:  RemoveHtmlTags(text As String, allowedTags As ArrayList) in example.

I appreciate your help but need a more accurate solution.

Why don't you wrap skiltz dll in your own shared/static wrapper class, that way you can define your own input to it, and then only have to map that once... The alternative is to probably write your own using regex replaces as I've not heard of a "selective html stripper" in my .net days/years...
Avatar of Zvonko
I would do it in three steps:


var theText = "<div>i.e: I want to remove all html tags except:<div><p></p><h1>,<h1/>, <br>,<br /> , <span>,<span/>, etc...";

var whiteList = /<((p|h1|br|span)\s*\/?>)/gi;

theText = theText.replace(whiteList, "\x01$1");
theText = theText.replace(/<[^>]+>/g,"");
theText = theText.replace(/\x01/g, "<");


Unless I'm mistaken, this seems to be dead-easy using a regular expression:

Which takes a |-sperated whitelist where currently div and span are white-listed.
Avatar of elfeffe


mreuring: you are very close to get correct answer but I have a couple of doubts:

1-Testing the regular expression I realized the iframe tag passed the filter when It should'nt, the problem is that i include the <i> tag in my white list, and probably it allows iframe cause it begings with "i" letter, doesn't it?
2- How to remove the self closed tags in the way: <input type="typebox" id="text1" />, dont mind if additional regular expression si required, I can do the filter in more than 1 step.

Zvonko: is it possible you provide the source un vb or c# .net lang? I need the treatment in server side.

Thanks for responses!
Avatar of elfeffe


Forget about the second point, it failed cause the same issue of point 1, it begins with "i"
Avatar of elfeffe


I am trying to modify the mreuring solution with a 2 regular expresions approach, but I dont success.:

The first one just look for <tag>
ie: <br>  or <br/>
open + tag + close
Result: It pass

If first one doesnt pass, try second:

The second one for <tag ....>
ie: <a href="url"> or <strong class="myCSS">text</strong>
The tag plus at least one blank space
Result: it pass

This way it doesnt mistake words beginning like others. (The iframe problem mentioned before)
 Maybe this idea helps.
Avatar of Zvonko
Flag of North Macedonia image

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of elfeffe



You got it, that is the right answer for my question. Ty very much!!!


I had to change the code because the last replace didn't work as expected, I wrote this in VB:

Dim theText As String = "<i>italic word here</i> lorem ipsum <iframe>iframe here</iframe>"
theText = System.Text.RegularExpressions.Regex.Replace(theText, "<((i|h1|br|span)\s*/?>)", "\x01$1", RegexOptions.IgnoreCase)
theText = System.Text.RegularExpressions.Regex.Replace(theText, "<[^>]+>", "")
theText = theText.Replace("\x01", "<") 'Here I put a simple string replace because it didn't work for me in the old way

the solution provided failed with the next example:
text = "<i>italic words here</i> lorem ipsum <iframe>iframe here</iframe>"
the result was:
<i>italic word here lorem ipsum iframe here

As you can see, it let the opening tag <i>, resulting in a malformed text for my page.
Thanks anyway for trying it, and thanks for the JavaScript -> C# conversion. It's a pity I can't give you the points.

Kind regards.
In all fairness, if Zvonko's code helped you work towards a solution for your problem, at least a point split would be fair. Please considder it, if you would want to split the points you can request a re-open from Comunity Support: 
No problem for me at all for points. Thank you very much for the Feedback. That is more worth aspect of EE for me then the bilion of points ;-)
Avatar of elfeffe


I am newbie EE, I really didn't know if a split point assignment was the best choice since meuring got the exact answer, anyway I have no problem about split the points. It's true Zvonko tried to resolve the problem and I appreciate his effort. I'll request a re-open as soon as possible.

I'm not an expert on how to divide points, but I'll give you my reasoning :)

When I ask a question myself, a comment that was usefull to me, even though it wasn't specifically answering my question, would get a small portion of the points as an assist. It's a simple way to express some gratitude. I'm personally not fussed too much, like Zvonko I rather get a written compliment/feedback, that way I know what I did/didn't do right :)

In this case, more specifically, I may have provided the regular expression, but it seems that the translation to C# was quite usefull to you as well, sounds like an assist to me ;)

We solved your problem, we're all happy with the outcome so far, good on ya

Avatar of elfeffe


Done :)
Thanks elfeffe for doing the extra mile for me and granting me points. I do appreciate your care.

And special thanks to you mreuring. As long as we have Experts like you on EE I am sure that EE will survive :)