We help IT Professionals succeed at work.

Open Source API for HTML,RTF formats

sudhakar_koundinya
on
Medium Priority
1,066 Views
Last Modified: 2012-06-21
Hi all,

Can any body provide me links to open source API that parses RTF to text and HTML to  text

thanks
Sudhakar
Comment
Watch Question

Check Apache POI

You can find that at http://www.apache.org/dyn/closer.cgi/jakarta/poi/



Regards,
Muruga

Author

Commented:
Hi

POI doesn't support RTF formats. It supports OLE Documents

Thanks to helping me
Sudha
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:
Java comes with parsers already in HTMLEditorKit and RTFEditorKit.

Author

Commented:
I tried with them

I have failed to parse the text. Can I have sample examples

Thanks
Sudha
StringTokenizer strtok = new StringTokenizer(inputString);
StringBuffer text = new StringBuffer();
while (strtok.hasMoreTokens()) {
   strtok.nextToken(">");
   text.append(strtok.nextToken("<"));
}
Short and sweet! :-)
Mayank SPrincipal Technologist
CERTIFIED EXPERT

Commented:
By the way, I guess that RTFs can be directly read using readers/ writers.

There are several sub-versions for the RTF standard. Probably the RTFs created with Word don't work, otherwise you should be able to read/ write with RTFs directly.

Or else, you can try these:

http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/rtf/RTFEditorKit.html

http://api.openoffice.org
Java Developer
CERTIFIED EXPERT
Top Expert 2010
Commented:
Unlock this solution and get a sample of our free trial.
(No credit card required)
UNLOCK SOLUTION
Mayank SPrincipal Technologist
CERTIFIED EXPERT

Commented:
>> directly read using readers/ writers

= directly read/ written using readers/ writers

Author

Commented:
And i am to parse the RTF document also

here is the code

import javax.swing.text.rtf.RTFEditorKit;
import javax.swing.text.*;
import java.io.*;
class RTF2Text
{
      public static void main(String[] args) throws Exception
      {
            System.out.println(getText(args[0]));
      }
      public static String getText(String file) throws Exception
      {
            FileInputStream stream = new FileInputStream(file);
            RTFEditorKit kit = new RTFEditorKit();
            javax.swing.text.Document doc = kit.createDefaultDocument();
            kit.read(stream, doc, 0);

            String plainText = doc.getText(0, doc.getLength());
            return plainText;



      }
}


Thanks to all of the guys. Especially objects.

Thanks One And All

Sudhakar
Mick BarryJava Developer
CERTIFIED EXPERT
Top Expert 2010

Commented:

Commented:
Hi guys:

I tried the code posted by as shown below.

 public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }


The above code does not work for some reason. The callback to handleText( ) method is somehow not being called by the HTMLEditor. Does anyone know of another way or some better way to parse an HTML file and get ONLY the TEXT WITHOUT any HTML-TAGS ?

Your helps is very much appreciated.
Mayank SPrincipal Technologist
CERTIFIED EXPERT

Commented:
Ask your own problems in your own question-pages. Don't disturb questions which are already closed. You get your question-points only for that purpose. Don't try to save them this way. Use them correctly.
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a sample view!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.