HTML stripper in java

errang
errang used Ask the Experts™
on
Hey,

       I had a question about building a HTML stripper in Java.  I know that it would involve parsing the HTML code, and then removing the "untrusted" elements from that code, but how would you see if something like links or attributes, that are not obvious Javascript elements in HTML code?

Appreciate any help on this.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2016

Commented:
What is an html stripper?

Author

Commented:
An HTML stripper is a program that would read in an HTML file and get rid of the harmful Javascript elements from it.

Like... if I somehow put a onmouseover tag on my text, and someone gets a bug... that kinda thing.

Its basically untrusted user data, because the attacker uses valid html, and javascript to cause problems.
Should you be charging more for IT Services?

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Top Expert 2016
Commented:
An other solution would be to clean the html
using Tidy (http://jtidy.sourceforge.net/)


Then use XSLT (transformation stylesheet ) on the cleaned html to just output the wanted html elements.

HTML --> tidy --> XHTML (strict html) +  XSL --> XSLT --> XHTML without the unwanted tags/attributes etc.

might be a bit overkill but once implemented it would be easy to change the XSL style sheet to add or remove elements without having to recompile anything


Author

Commented:
Thanks

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial