atifraees
asked on
how build a an html to text converter in java
can somebody plz help me regarding building a html to plain text converter which takes an html file as an input and save it in txt format
Do you need some formated text on input or you need only to extract the text in any form?
ASKER
output text formatted or unformatted both allright for me
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
That won't circumvent the problem, antons; that is of ill- formed markup. Html is frequently written with unpaired tags, one of the things that makes parsers fall over.
>>System.out.println("---- ---------- ---------- " + java.nio.charset.Charset.a vailableCh arsets());
is Java 1.4 specific, but doesn't seem to be essential.
>>BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));
is encoding-specific. What is this encoding - is it the same as Cp1252?
>>System.out.println("----
is Java 1.4 specific, but doesn't seem to be essential.
>>BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));
is encoding-specific. What is this encoding - is it the same as Cp1252?
System.out.println("------ ---------- --------" + java.nio.charset.Charset.a vailableCh arsets());
is mistake forget this.....
the process omit all tags and does not care if they are paired or not.
Just try it .... you will see.
is mistake forget this.....
the process omit all tags and does not care if they are paired or not.
Just try it .... you will see.
atifraees:
This old question needs to be finalized -- accept an answer, split points, or get a refund. For information on your options, please click here-> http:/help/closing.jsp#1
EXPERTS:
Post your closing recommendations! No comment means you don't care.
This old question needs to be finalized -- accept an answer, split points, or get a refund. For information on your options, please click here-> http:/help/closing.jsp#1
EXPERTS:
Post your closing recommendations! No comment means you don't care.
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
Accept antons' comment as answer.
Please leave any comments here within the next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
TimYates
EE Cleanup Volunteer
I will leave a recommendation in the Cleanup topic area that this question is:
Accept antons' comment as answer.
Please leave any comments here within the next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
TimYates
EE Cleanup Volunteer
The main problem is that the readily available Java parsers for html will usually fall over unless the quality of html markup is high. This is usually not the case.
I would therefore get hold of 'Tidy for Java' (could be called JTidy). This will turn the html into well-formed xml. It is then simple to SAX parse that xml leaving you with the plain text.