Link to home
Start Free TrialLog in
Avatar of atifraees
atifraees

asked on

how build a an html to text converter in java

can somebody plz help me regarding building a html to plain text converter which takes an html file as an input and save it in txt format
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

I can certainly give you a recommendation as to how *i* would do this.

The main problem is that the readily available Java parsers for html will usually fall over unless the quality of html markup is high. This is usually not the case.

I would therefore get hold of 'Tidy for Java' (could be called JTidy). This will turn the html into well-formed xml. It is then simple to SAX parse that xml leaving you with the plain text.
Avatar of antons061400
antons061400

Do you need some formated text on input or you need only to extract the text in any form?
Avatar of atifraees

ASKER

output text formatted or unformatted both allright for me
ASKER CERTIFIED SOLUTION
Avatar of antons061400
antons061400

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That won't circumvent the problem, antons; that is of ill- formed markup. Html is frequently written with unpaired tags, one of the things that makes parsers fall over.

>>System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());

is Java 1.4 specific, but doesn't seem to be essential.

>>BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));

is encoding-specific. What is this encoding - is it the same as Cp1252?

System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());
 is mistake forget this.....

the process omit all tags and does not care if they are paired or not.

Just try it .... you will see.
atifraees:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept antons' comment as answer.

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

TimYates
EE Cleanup Volunteer