?
Solved

how build a an html to text converter in java

Posted on 2003-03-02
9
Medium Priority
?
374 Views
Last Modified: 2012-06-22
can somebody plz help me regarding building a html to plain text converter which takes an html file as an input and save it in txt format
0
Comment
Question by:atifraees
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 8053786
I can certainly give you a recommendation as to how *i* would do this.

The main problem is that the readily available Java parsers for html will usually fall over unless the quality of html markup is high. This is usually not the case.

I would therefore get hold of 'Tidy for Java' (could be called JTidy). This will turn the html into well-formed xml. It is then simple to SAX parse that xml leaving you with the plain text.
0
 
LVL 4

Expert Comment

by:antons061400
ID: 8055351
Do you need some formated text on input or you need only to extract the text in any form?
0
 

Author Comment

by:atifraees
ID: 8059705
output text formatted or unformatted both allright for me
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 4

Accepted Solution

by:
antons061400 earned 120 total points
ID: 8062519
If you don't need to have formated text, you can write an input stream, which omits the html tags.
following implementation omits also special characters. but you can modify it.


import java.util.*;
import java.io.*;

public class HtmlStream extends InputStream {

    InputStream is = null;
    private boolean inHtmlTag = false;
    private boolean inHtmlChar = false;
    private boolean isFinished = false;
   
    public HtmlStream(InputStream is) {
        this.is = is;
    }
       
    public void close() throws IOException {
        is.close();
    }

    public int read() throws IOException {
        if(isFinished) return -1;
        int c = is.read();
        while((c == '<') || (c == '&')) {
            if(c == '<') inHtmlTag = true;
            else inHtmlChar = true;
            while((inHtmlTag || inHtmlChar) && (c != -1)) {
                c = is.read();
                if(inHtmlTag && (c == '>')) {
                    inHtmlTag = false;
                    c = is.read();
                }
                if(inHtmlChar && (c == ';')) {
                    inHtmlChar = false;
                    c = is.read();
                }
            }
       }
       if(c == -1) isFinished = true;
       return c;
    }
   
    public static void main (String[] argv) {
            String file = argv[0];
            System.out.println("start test for file " + file);
            try {
                FileInputStream fis = new FileInputStream(file);
                HtmlStream hs = new HtmlStream(fis);
                BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));
                String line = r.readLine();
                while(line != null) {
                    System.out.println(line);
                    line = r.readLine();
                }
               
               
            } catch(IOException e) {
                System.out.println("The erropr " + e);
                e.printStackTrace();
            }
            System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());
    }
}
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 8063477
That won't circumvent the problem, antons; that is of ill- formed markup. Html is frequently written with unpaired tags, one of the things that makes parsers fall over.

>>System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());

is Java 1.4 specific, but doesn't seem to be essential.

>>BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));

is encoding-specific. What is this encoding - is it the same as Cp1252?

0
 
LVL 4

Expert Comment

by:antons061400
ID: 8064018
System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());
 is mistake forget this.....

the process omit all tags and does not care if they are paired or not.

Just try it .... you will see.
0
 

Expert Comment

by:CleanupPing
ID: 9058975
atifraees:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.
0
 
LVL 35

Expert Comment

by:TimYates
ID: 9721228
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept antons' comment as answer.

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

TimYates
EE Cleanup Volunteer
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
Suggested Courses
Course of the Month10 days, 14 hours left to enroll

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question