Open Source API for HTML,RTF formats

Hi all,

Can any body provide me links to open source API that parses RTF to text and HTML to  text

thanks
Sudhakar
LVL 14
sudhakar_koundinyaAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
objectsConnect With a Mentor Commented:
   public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
mmuruganandamCommented:
Check Apache POI

You can find that at http://www.apache.org/dyn/closer.cgi/jakarta/poi/



Regards,
Muruga
0
 
sudhakar_koundinyaAuthor Commented:
Hi

POI doesn't support RTF formats. It supports OLE Documents

Thanks to helping me
Sudha
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
objectsCommented:
Java comes with parsers already in HTMLEditorKit and RTFEditorKit.
0
 
sudhakar_koundinyaAuthor Commented:
I tried with them

I have failed to parse the text. Can I have sample examples

Thanks
Sudha
0
 
Tommy BraasCommented:
StringTokenizer strtok = new StringTokenizer(inputString);
StringBuffer text = new StringBuffer();
while (strtok.hasMoreTokens()) {
   strtok.nextToken(">");
   text.append(strtok.nextToken("<"));
}
0
 
Tommy BraasCommented:
Short and sweet! :-)
0
 
Mayank SAssociate Director - Product EngineeringCommented:
By the way, I guess that RTFs can be directly read using readers/ writers.

There are several sub-versions for the RTF standard. Probably the RTFs created with Word don't work, otherwise you should be able to read/ write with RTFs directly.

Or else, you can try these:

http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/rtf/RTFEditorKit.html

http://api.openoffice.org
0
 
Mayank SAssociate Director - Product EngineeringCommented:
>> directly read using readers/ writers

= directly read/ written using readers/ writers
0
 
sudhakar_koundinyaAuthor Commented:
And i am to parse the RTF document also

here is the code

import javax.swing.text.rtf.RTFEditorKit;
import javax.swing.text.*;
import java.io.*;
class RTF2Text
{
      public static void main(String[] args) throws Exception
      {
            System.out.println(getText(args[0]));
      }
      public static String getText(String file) throws Exception
      {
            FileInputStream stream = new FileInputStream(file);
            RTFEditorKit kit = new RTFEditorKit();
            javax.swing.text.Document doc = kit.createDefaultDocument();
            kit.read(stream, doc, 0);

            String plainText = doc.getText(0, doc.getLength());
            return plainText;



      }
}


Thanks to all of the guys. Especially objects.

Thanks One And All

Sudhakar
0
 
objectsCommented:
0
 
VenkicCommented:
Hi guys:

I tried the code posted by as shown below.

 public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }


The above code does not work for some reason. The callback to handleText( ) method is somehow not being called by the HTMLEditor. Does anyone know of another way or some better way to parse an HTML file and get ONLY the TEXT WITHOUT any HTML-TAGS ?

Your helps is very much appreciated.
0
 
Mayank SAssociate Director - Product EngineeringCommented:
Ask your own problems in your own question-pages. Don't disturb questions which are already closed. You get your question-points only for that purpose. Don't try to save them this way. Use them correctly.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.