[Webinar] Streamline your web hosting managementRegister Today

x
?
Solved

HTMLEditorKit with a String of html text

Posted on 2004-10-23
11
Medium Priority
?
304 Views
Last Modified: 2012-08-14
I am a little confused on how to use  this class: javax.swing.text.html.HTMLEditorKit

If I have the html text stored in a string, like this:

String html = "<html><script blah blah>function blah blah{}</script><body>this</body></html>";

and I want to get get the parsed text, String parsedText = "this";

I've seen the example on: http://www.javaalmanac.com/egs/javax.swing.text.html/GetText.html, but I'm not sure how to apply it to my application which seems much simpler.
0
Comment
Question by:polkadot
  • 4
  • 3
  • 2
  • +1
11 Comments
 
LVL 8

Assisted Solution

by:sigmacon
sigmacon earned 1400 total points
ID: 12390606
The implementation of the kit essentially require the approache they showed in the code sample. To get it working for your case, you need to provide a reader from your string.

Replace line

Reader rd = new InputStreamReader(conn.getInputStream());

with

Reader rd = new StringReader(html);

and delete three lines in front of it.

Try the result:

public static String parseText(String html) throws Exception {
        final StringBuffer buf = new StringBuffer(4096);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            Reader rd = new StringReader(html);
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
LVL 8

Expert Comment

by:sigmacon
ID: 12390609
Sorry, either take of throws Exception or the try / catch. I did not test the code so there may be minor syntax errors:


public static String parseText(String html) {
        final StringBuffer buf = new StringBuffer(4096);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            Reader rd = new StringReader(html);
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 

Author Comment

by:polkadot
ID: 12390680
some problems:

when I use  your code as is with:  kit.read(rd, doc, 0);
javax.swing.text.ChangedCharSetException


when I switch the bit: kit.read(rd, doc, 1);
javax.swing.text.BadLocationException: Invalid location


What do the errors mean and how can I fix it?
0
The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

 

Author Comment

by:polkadot
ID: 12390703
the code compiles, when it runs it produces an empty string (return buf.toString)

and returns the exception errors above
0
 

Author Comment

by:polkadot
ID: 12390707
Also, I have verified that String html = "<html .... >"
0
 
LVL 8

Expert Comment

by:sigmacon
ID: 12390746
The kit is very picky about parsing. Since the sample code actually tries to create a document, you get:

javax.swing.text.ChangedCharSetException -- This is usually thrown if there is a content-type attribute or a charset attribute. So I don't know why its thrown here. My interpretation is that the HTMLDocument could not determine the character set of the HTML you're trying to parse. Make sure the HTML is well-formed, probably needs a head tag, and so on and declares which charater set it uses.

javax.swing.text.BadLocationException -- The last parameter to read determines where in the document you want to insert text. Since the document is empty, 1 is not valid, only 0.

Please be aware of the fact that, AFAIK, the HTMLKit is for HTML 3.2, so it may not help with current HTML versions.
0
 
LVL 92

Expert Comment

by:objects
ID: 12390775
> when I use  your code as is with:  kit.read(rd, doc, 0);
> javax.swing.text.ChangedCharSetException

add the following:

doc.putProperties("IgnoreCharacterSet", Boolean.TRUE);
0
 

Author Comment

by:polkadot
ID: 12390888
actually my string is just a the html code behind a url, im using NASA web page to test ...
works on some pages actually, others returns that error

Here is the problem I have with it, it didn't filter out all the javascript functions, am I doing something wrong?




objects, im sorry your snippet of code with no explaination isn't any help, putProperties is not a method of HTMLDocument
0
 
LVL 30

Accepted Solution

by:
GrandSchtroumpf earned 600 total points
ID: 12390946
You first must make sure that the html page is well formed.
To do that, you can use somthing like "Tidy".
There is an implementation for java but still under development though :-(
http://jtidy.sourceforge.net/
0
 
LVL 92

Expert Comment

by:objects
ID: 12391127
sorry for the type, it should have been:

doc.putProperty("IgnoreCharacterSet", Boolean.TRUE);
0
 
LVL 30

Expert Comment

by:GrandSchtroumpf
ID: 12392601
<:°)
0

Featured Post

Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

INTRODUCTION Working with files is a moderately common task in Java.  For most projects hard coding the file names, using parameters in configuration files, or using command-line arguments is sufficient.   However, when your application has vi…
Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Suggested Courses
Course of the Month9 days, 3 hours left to enroll

590 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question