?
Solved

Rremove all the html tags from a string except the tags I want to show

Posted on 2007-10-02
22
Medium Priority
?
381 Views
Last Modified: 2008-01-09
Hi all,
I need to remove all the html tags from a string except the tags I want to show.
I am programming in asp.net - Visual Basic.

Thanks.
0
Comment
Question by:elfeffe
  • 9
  • 4
  • 4
  • +2
21 Comments
 
LVL 7

Expert Comment

by:skiltz
ID: 19997442
As you are aware there are heaps of HTML tags to consider, i am pretty sure there are some already built DLL's you can download.  Other than that you can use Regular Expression to find a match and replace a string.
0
 

Author Comment

by:elfeffe
ID: 19997466
Both solutions should be ok for my needs, the matter is I can't find any DLL / regular expression wich works as I need.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:elfeffe
ID: 19997508
That solution doesn't get a "white list" of allowed tags.
i.e: I want to remove all html tags except: <h1>,<h1/>, <br>,<br /> , <span>,<span/>, etc...
Thanks anyway.
0
 
LVL 7

Expert Comment

by:skiltz
ID: 19997547
I created this one. www.erealestate.co.nz/cleanHtml.dll

copy to bin directory  

Add reference to DLL

Dim clean As New CleanHTML.clean
clean.RemoveH1Tags = false
etc

Dim result as String
result = clean(someugulyText)

I haven't tested it to much but serves my purpose.
0
 

Author Comment

by:elfeffe
ID: 19998177
Thank you for the effort, but the clean class has some disadvantages:

1- We have to implement a boolean member for each tag: clean.RemoveH1Tags, clean.RemoveDIVTags, etc.. and so on for each tag I want to use.
2- The method has to remove all html tags except the white list by default, I dont want to populate each boolean with "true" value. It would be a pain in the current way.

What sould be ok is a method like:  RemoveHtmlTags(text As String, allowedTags As ArrayList) in example.

I appreciate your help but need a more accurate solution.
Regards.
0
 
LVL 7

Expert Comment

by:SimonBlake
ID: 19999189

Why don't you wrap skiltz dll in your own shared/static wrapper class, that way you can define your own input to it, and then only have to map that once... The alternative is to probably write your own using regex replaces as I've not heard of a "selective html stripper" in my .net days/years...
0
 
LVL 63

Expert Comment

by:Zvonko
ID: 20001694
I would do it in three steps:

<script>

var theText = "<div>i.e: I want to remove all html tags except:<div><p></p><h1>,<h1/>, <br>,<br /> , <span>,<span/>, etc...";

var whiteList = /<((p|h1|br|span)\s*\/?>)/gi;

theText = theText.replace(whiteList, "\x01$1");
theText = theText.replace(/<[^>]+>/g,"");
theText = theText.replace(/\x01/g, "<");

alert(theText);
</script>

0
 
LVL 17

Expert Comment

by:mreuring
ID: 20002894
Unless I'm mistaken, this seems to be dead-easy using a regular expression:
<(?!/?(div|span))[^>]+>

Which takes a |-sperated whitelist where currently div and span are white-listed.
0
 

Author Comment

by:elfeffe
ID: 20004923
mreuring: you are very close to get correct answer but I have a couple of doubts:

1-Testing the regular expression I realized the iframe tag passed the filter when It should'nt, the problem is that i include the <i> tag in my white list, and probably it allows iframe cause it begings with "i" letter, doesn't it?
2- How to remove the self closed tags in the way: <input type="typebox" id="text1" />, dont mind if additional regular expression si required, I can do the filter in more than 1 step.

Zvonko: is it possible you provide the source un vb or c# .net lang? I need the treatment in server side.

Thanks for responses!
0
 

Author Comment

by:elfeffe
ID: 20005034
Forget about the second point, it failed cause the same issue of point 1, it begins with "i"
0
 

Author Comment

by:elfeffe
ID: 20005507
I am trying to modify the mreuring solution with a 2 regular expresions approach, but I dont success.:

The first one just look for <tag>
ie: <br>  or <br/>
open + tag + close
Result: It pass

If first one doesnt pass, try second:

The second one for <tag ....>
ie: <a href="url"> or <strong class="myCSS">text</strong>
The tag plus at least one blank space
Result: it pass

This way it doesnt mistake words beginning like others. (The iframe problem mentioned before)
 Maybe this idea helps.
0
 
LVL 63

Assisted Solution

by:Zvonko
Zvonko earned 700 total points
ID: 20009532
Like this:

 string theText = "<div>i.e: I want to remove all html tags except:<div><p></p><h1>,<h1/>, <br>,<br /> , <span>,<span/>, etc...";
  theText = RegExp.Replace(theText, @"<((p|h1|br|span)\s*/?>)", @"\x01$1",RegexOptions.IgnoreCase);
  theText = RegExp.Replace(theText, "<[^>]+>","");
  theText = RegExp.Replace(theText, @"\x01","<");


0
 
LVL 17

Accepted Solution

by:
mreuring earned 1300 total points
ID: 20009622
You were right, it matched a little to widely, I started using it myself last night on some old documents I wanted to clean up and I think this one's got it right:
<(?!/?(html|head|meta|title|body|p|b|i)\b)[^>]+>

The difference being that I included a \b (wordbreak) after the whitelist. In effect it'll only white list <i> or <i/> etc, but not <iframe>.

This way the regular expression matched all opening, closing and empty tags, except the ones you whitelist. It won't remove the content within those tags though.
0
 

Author Comment

by:elfeffe
ID: 20012989
mreuring:

You got it, that is the right answer for my question. Ty very much!!!


Zvonko:

I had to change the code because the last replace didn't work as expected, I wrote this in VB:

Dim theText As String = "<i>italic word here</i> lorem ipsum <iframe>iframe here</iframe>"
theText = System.Text.RegularExpressions.Regex.Replace(theText, "<((i|h1|br|span)\s*/?>)", "\x01$1", RegexOptions.IgnoreCase)
theText = System.Text.RegularExpressions.Regex.Replace(theText, "<[^>]+>", "")
theText = theText.Replace("\x01", "<") 'Here I put a simple string replace because it didn't work for me in the old way

the solution provided failed with the next example:
text = "<i>italic words here</i> lorem ipsum <iframe>iframe here</iframe>"
the result was:
<i>italic word here lorem ipsum iframe here

As you can see, it let the opening tag <i>, resulting in a malformed text for my page.
Thanks anyway for trying it, and thanks for the JavaScript -> C# conversion. It's a pity I can't give you the points.

Kind regards.
0
 
LVL 17

Expert Comment

by:mreuring
ID: 20017588
In all fairness, if Zvonko's code helped you work towards a solution for your problem, at least a point split would be fair. Please considder it, if you would want to split the points you can request a re-open from Comunity Support: http://www.experts-exchange.com/Community_Support/General/ 
0
 
LVL 63

Expert Comment

by:Zvonko
ID: 20018307
No problem for me at all for points. Thank you very much for the Feedback. That is more worth aspect of EE for me then the bilion of points ;-)
0
 

Author Comment

by:elfeffe
ID: 20020366
I am newbie EE, I really didn't know if a split point assignment was the best choice since meuring got the exact answer, anyway I have no problem about split the points. It's true Zvonko tried to resolve the problem and I appreciate his effort. I'll request a re-open as soon as possible.

Cheers.
0
 
LVL 17

Expert Comment

by:mreuring
ID: 20020583
I'm not an expert on how to divide points, but I'll give you my reasoning :)

When I ask a question myself, a comment that was usefull to me, even though it wasn't specifically answering my question, would get a small portion of the points as an assist. It's a simple way to express some gratitude. I'm personally not fussed too much, like Zvonko I rather get a written compliment/feedback, that way I know what I did/didn't do right :)

In this case, more specifically, I may have provided the regular expression, but it seems that the translation to C# was quite usefull to you as well, sounds like an assist to me ;)

We solved your problem, we're all happy with the outcome so far, good on ya

 Martin
0
 

Author Comment

by:elfeffe
ID: 20025851
Done :)
0
 
LVL 63

Expert Comment

by:Zvonko
ID: 20026143
Thanks elfeffe for doing the extra mile for me and granting me points. I do appreciate your care.

And special thanks to you mreuring. As long as we have Experts like you on EE I am sure that EE will survive :)
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Sometimes in DotNetNuke module development you want to swap controls within the same module definition.  In doing this DNN (somewhat annoyingly) swaps the Skin and Container definitions to the default admin selections.  To get around this you need t…
A quick way to get a menu to work on our website, is using the Menu control and assign it to a web.sitemap using SiteMapDataSource. Example of web.sitemap file: (CODE) Sample code to add to the page menu: (CODE) Running the application, we wi…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Suggested Courses
Course of the Month13 days, 10 hours left to enroll

750 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question