Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 382
  • Last Modified:

how build a an html to text converter in java

can somebody plz help me regarding building a html to plain text converter which takes an html file as an input and save it in txt format
0
atifraees
Asked:
atifraees
1 Solution
 
CEHJCommented:
I can certainly give you a recommendation as to how *i* would do this.

The main problem is that the readily available Java parsers for html will usually fall over unless the quality of html markup is high. This is usually not the case.

I would therefore get hold of 'Tidy for Java' (could be called JTidy). This will turn the html into well-formed xml. It is then simple to SAX parse that xml leaving you with the plain text.
0
 
antons061400Commented:
Do you need some formated text on input or you need only to extract the text in any form?
0
 
atifraeesAuthor Commented:
output text formatted or unformatted both allright for me
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
antons061400Commented:
If you don't need to have formated text, you can write an input stream, which omits the html tags.
following implementation omits also special characters. but you can modify it.


import java.util.*;
import java.io.*;

public class HtmlStream extends InputStream {

    InputStream is = null;
    private boolean inHtmlTag = false;
    private boolean inHtmlChar = false;
    private boolean isFinished = false;
   
    public HtmlStream(InputStream is) {
        this.is = is;
    }
       
    public void close() throws IOException {
        is.close();
    }

    public int read() throws IOException {
        if(isFinished) return -1;
        int c = is.read();
        while((c == '<') || (c == '&')) {
            if(c == '<') inHtmlTag = true;
            else inHtmlChar = true;
            while((inHtmlTag || inHtmlChar) && (c != -1)) {
                c = is.read();
                if(inHtmlTag && (c == '>')) {
                    inHtmlTag = false;
                    c = is.read();
                }
                if(inHtmlChar && (c == ';')) {
                    inHtmlChar = false;
                    c = is.read();
                }
            }
       }
       if(c == -1) isFinished = true;
       return c;
    }
   
    public static void main (String[] argv) {
            String file = argv[0];
            System.out.println("start test for file " + file);
            try {
                FileInputStream fis = new FileInputStream(file);
                HtmlStream hs = new HtmlStream(fis);
                BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));
                String line = r.readLine();
                while(line != null) {
                    System.out.println(line);
                    line = r.readLine();
                }
               
               
            } catch(IOException e) {
                System.out.println("The erropr " + e);
                e.printStackTrace();
            }
            System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());
    }
}
0
 
CEHJCommented:
That won't circumvent the problem, antons; that is of ill- formed markup. Html is frequently written with unpaired tags, one of the things that makes parsers fall over.

>>System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());

is Java 1.4 specific, but doesn't seem to be essential.

>>BufferedReader r = new BufferedReader(new InputStreamReader(hs, "windows-1250"));

is encoding-specific. What is this encoding - is it the same as Cp1252?

0
 
antons061400Commented:
System.out.println("------------------------" + java.nio.charset.Charset.availableCharsets());
 is mistake forget this.....

the process omit all tags and does not care if they are paired or not.

Just try it .... you will see.
0
 
CleanupPingCommented:
atifraees:
This old question needs to be finalized -- accept an answer, split points, or get a refund.  For information on your options, please click here-> http:/help/closing.jsp#1 
EXPERTS:
Post your closing recommendations!  No comment means you don't care.
0
 
TimYatesCommented:
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:

Accept antons' comment as answer.

Please leave any comments here within the next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!

TimYates
EE Cleanup Volunteer
0

Featured Post

[Webinar On Demand] Database Backup and Recovery

Does your company store data on premises, off site, in the cloud, or a combination of these? If you answered “yes”, you need a data backup recovery plan that fits each and every platform. Watch now as as Percona teaches us how to build agile data backup recovery plan.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now