Link to home
Start Free TrialLog in
Avatar of Sarge516
Sarge516

asked on

Is there a way to convert a HTML string to text using plain java?

I need the ability to take an HTML formatted string and convert it to straight text. I need to do this in Java and would like to do it natively ( no extra jars , no gui ) if possible.

Any suggestions? The stub below 'works' but the resulting string is not accessible outside the class.

  Reader reader = new StringReader(
            "  <html><p>A <foo>xx</foo><a href=test>link</a>");
 
      String yo = "";
      try
      {
         {
 
           
            HTMLEditorKit.ParserCallback callback =
            new HTMLEditorKit.ParserCallback()
            {
         
               public void handleText(char[] data, int pos)
               {
                  System.out.println(data);
                //  yo = data.toString();
               }
            };
            new ParserDelegator().parse(reader, callback, false);
         }
      }
      catch (IOException e)
      {
                e.printStackTrace();
      }
 
Avatar of a_b
a_b

"not accessible outside the class." I am not sure I follow. Can you please explain??
Avatar of Sarge516

ASKER

If I try to use the "YO" string in the HTMLEditor inner class, I get a run error in eclipse:

Cannot access a non-final variable from an inner-class ....

public class Test2 {
      String test = "TESTING";

      public static void main(String args[]) {
            new Test2().hello();
      }

      private  void hello() {
            new InnerClass().sayHello();
            
      }

      class InnerClass {
            public void sayHello() {
                  System.out.println(Test2.this.test);
            }
      }

}
Sorry if I was vague. Here is my complete code with runtime error.  I can't seem to return a value this way. It doesn't like return data.toString. Callback returns void and not sure how to work around that.

The task is to convert the HMTL line to text.  I am open to other ways if this route is not possible.


import java.io.Reader;
import java.io.StringReader;

import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HtmlToText {

    /**
     * @param args
     */

    private String getText(String yo) {
        Reader reader = new StringReader(yo);
       
       
        try {
            {

                HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {

                    public void handleText(char[] data, int pos) {
                        System.out.println(data);
                        return data.toString();
                    //    return "OK";
                    }
                };
                new ParserDelegator().parse(reader, callback, false);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        return "OK1";
    }

    public static void main(String[] args) {

        HtmlToText ht = new HtmlToText();
        System.out.println(ht.getText("<html><p>A <foo>xx</foo><a href=test>link</a>"));
    }

}

Exception in thread "main" java.lang.Error: Unresolved compilation problem:
    Void methods cannot return a value

    at HtmlToText.getText(HtmlToText.java:24)
    at HtmlToText.main(HtmlToText.java:39)




ASKER CERTIFIED SOLUTION
Avatar of a_b
a_b

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Taking the visbility a little higher does solve the major problem I was having. Thanks for a creative solution!
The code still needed a couple more tweaks, as the result was only the returning the final word in the html. I'm posting the final to help others in the future.

import java.io.Reader;
import java.io.StringReader;

import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;

public class HTML_to_Text {

      StringBuilder text = new StringBuilder("");
      StringBuilder temp = new StringBuilder("");

      public String getText(String yo) {
            Reader reader = new StringReader(yo);

            try {
                  {

                        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {

                              public void handleText(char[] data, int pos) {
                                    // System.out.println(data);
                                    temp = new StringBuilder(new String(data).trim()).append(" ");
                                    text = text.append(temp);
                              }
                        };
                        new ParserDelegator().parse(reader, callback, false);
                  }
            } catch (Exception e) {
                  e.printStackTrace();
            }
            return text.toString().trim();
      }

      public static void main(String[] args) {

            HTML_to_Text ht = new HTML_to_Text();
            System.out.println(ht
                        .getText("<html><p>A <foo>xx</foo><a href=test>link</a>"));
      }

}



Thanks for the help! Another set of eyes and brains is what I needed. :}