Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

HTMLEditorKit with a String of html text

Posted on 2004-10-23
11
Medium Priority
?
301 Views
Last Modified: 2012-08-14
I am a little confused on how to use  this class: javax.swing.text.html.HTMLEditorKit

If I have the html text stored in a string, like this:

String html = "<html><script blah blah>function blah blah{}</script><body>this</body></html>";

and I want to get get the parsed text, String parsedText = "this";

I've seen the example on: http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html, but I'm not sure how to apply it to my application which seems much simpler.
0
Comment
Question by:polkadot
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 3
  • 2
  • +1
11 Comments
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1400 total points
ID: 12390606
The implementation of the kit essentially require the approache they showed in the code sample. To get it working for your case, you need to provide a reader from your string.

Replace line

Reader rd = new InputStreamReader(conn.getInputStream());

with

Reader rd = new StringReader(html);

and delete three lines in front of it.

Try the result:

public static String parseText(String html) throws Exception {
        final StringBuffer buf = new StringBuffer(4096);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            Reader rd = new StringReader(html);
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
LVL 8

Expert Comment

by:sigmacon
ID: 12390609
Sorry, either take of throws Exception or the try / catch. I did not test the code so there may be minor syntax errors:


public static String parseText(String html) {
        final StringBuffer buf = new StringBuffer(4096);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            Reader rd = new StringReader(html);
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 

Author Comment

by:polkadot
ID: 12390680
some problems:

when I use  your code as is with:  kit.read(rd, doc, 0);
javax.swing.text.ChangedCharSetException


when I switch the bit: kit.read(rd, doc, 1);
javax.swing.text.BadLocationException: Invalid location


What do the errors mean and how can I fix it?
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:polkadot
ID: 12390703
the code compiles, when it runs it produces an empty string (return buf.toString)

and returns the exception errors above
0
 

Author Comment

by:polkadot
ID: 12390707
Also, I have verified that String html = "<html .... >"
0
 
LVL 8

Expert Comment

by:sigmacon
ID: 12390746
The kit is very picky about parsing. Since the sample code actually tries to create a document, you get:

javax.swing.text.ChangedCharSetException -- This is usually thrown if there is a content-type attribute or a charset attribute. So I don't know why its thrown here. My interpretation is that the HTMLDocument could not determine the character set of the HTML you're trying to parse. Make sure the HTML is well-formed, probably needs a head tag, and so on and declares which charater set it uses.

javax.swing.text.BadLocationException -- The last parameter to read determines where in the document you want to insert text. Since the document is empty, 1 is not valid, only 0.

Please be aware of the fact that, AFAIK, the HTMLKit is for HTML 3.2, so it may not help with current HTML versions.
0
 
LVL 92

Expert Comment

by:objects
ID: 12390775
> when I use  your code as is with:  kit.read(rd, doc, 0);
> javax.swing.text.ChangedCharSetException

add the following:

doc.putProperties("IgnoreCharacterSet", Boolean.TRUE);
0
 

Author Comment

by:polkadot
ID: 12390888
actually my string is just a the html code behind a url, im using NASA web page to test ...
works on some pages actually, others returns that error

Here is the problem I have with it, it didn't filter out all the javascript functions, am I doing something wrong?




objects, im sorry your snippet of code with no explaination isn't any help, putProperties is not a method of HTMLDocument
0
 
LVL 30

Accepted Solution

by:
GrandSchtroumpf earned 600 total points
ID: 12390946
You first must make sure that the html page is well formed.
To do that, you can use somthing like "Tidy".
There is an implementation for java but still under development though :-(
http://jtidy.sourceforge.net/
0
 
LVL 92

Expert Comment

by:objects
ID: 12391127
sorry for the type, it should have been:

doc.putProperty("IgnoreCharacterSet", Boolean.TRUE);
0
 
LVL 30

Expert Comment

by:GrandSchtroumpf
ID: 12392601
<:°)
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
The viewer will learn how to implement Singleton Design Pattern in Java.
Suggested Courses

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question