HTMLEditorKit with a String of html text

I am a little confused on how to use  this class: javax.swing.text.html.HTMLEditorKit

If I have the html text stored in a string, like this:

String html = "<html><script blah blah>function blah blah{}</script><body>this</body></html>";

and I want to get get the parsed text, String parsedText = "this";

I've seen the example on: http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html, but I'm not sure how to apply it to my application which seems much simpler.
polkadotAsked:
Who is Participating?
 
GrandSchtroumpfCommented:
You first must make sure that the html page is well formed.
To do that, you can use somthing like "Tidy".
There is an implementation for java but still under development though :-(
http://jtidy.sourceforge.net/
0
 
sigmaconCommented:
The implementation of the kit essentially require the approache they showed in the code sample. To get it working for your case, you need to provide a reader from your string.

Replace line

Reader rd = new InputStreamReader(conn.getInputStream());

with

Reader rd = new StringReader(html);

and delete three lines in front of it.

Try the result:

public static String parseText(String html) throws Exception {
        final StringBuffer buf = new StringBuffer(4096);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            Reader rd = new StringReader(html);
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
sigmaconCommented:
Sorry, either take of throws Exception or the try / catch. I did not test the code so there may be minor syntax errors:


public static String parseText(String html) {
        final StringBuffer buf = new StringBuffer(4096);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            Reader rd = new StringReader(html);
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
Cloud Class® Course: Python 3 Fundamentals

This course will teach participants about installing and configuring Python, syntax, importing, statements, types, strings, booleans, files, lists, tuples, comprehensions, functions, and classes.

 
polkadotAuthor Commented:
some problems:

when I use  your code as is with:  kit.read(rd, doc, 0);
javax.swing.text.ChangedCharSetException


when I switch the bit: kit.read(rd, doc, 1);
javax.swing.text.BadLocationException: Invalid location


What do the errors mean and how can I fix it?
0
 
polkadotAuthor Commented:
the code compiles, when it runs it produces an empty string (return buf.toString)

and returns the exception errors above
0
 
polkadotAuthor Commented:
Also, I have verified that String html = "<html .... >"
0
 
sigmaconCommented:
The kit is very picky about parsing. Since the sample code actually tries to create a document, you get:

javax.swing.text.ChangedCharSetException -- This is usually thrown if there is a content-type attribute or a charset attribute. So I don't know why its thrown here. My interpretation is that the HTMLDocument could not determine the character set of the HTML you're trying to parse. Make sure the HTML is well-formed, probably needs a head tag, and so on and declares which charater set it uses.

javax.swing.text.BadLocationException -- The last parameter to read determines where in the document you want to insert text. Since the document is empty, 1 is not valid, only 0.

Please be aware of the fact that, AFAIK, the HTMLKit is for HTML 3.2, so it may not help with current HTML versions.
0
 
objectsCommented:
> when I use  your code as is with:  kit.read(rd, doc, 0);
> javax.swing.text.ChangedCharSetException

add the following:

doc.putProperties("IgnoreCharacterSet", Boolean.TRUE);
0
 
polkadotAuthor Commented:
actually my string is just a the html code behind a url, im using NASA web page to test ...
works on some pages actually, others returns that error

Here is the problem I have with it, it didn't filter out all the javascript functions, am I doing something wrong?




objects, im sorry your snippet of code with no explaination isn't any help, putProperties is not a method of HTMLDocument
0
 
objectsCommented:
sorry for the type, it should have been:

doc.putProperty("IgnoreCharacterSet", Boolean.TRUE);
0
 
GrandSchtroumpfCommented:
<:°)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.