asked on

Open Source API for HTML,RTF formats

Hi all,

Can any body provide me links to open source API that parses RTF to text and HTML to text

thanks
Sudhakar

mmuruganandam

Check Apache POI

You can find that at http://www.apache.org/dyn/closer.cgi/jakarta/poi/

Regards,
Muruga

sudhakar_koundinya

ASKER

Hi

POI doesn't support RTF formats. It supports OLE Documents

Thanks to helping me
Sudha

Mick Barry

Java comes with parsers already in HTMLEditorKit and RTFEditorKit.

sudhakar_koundinya

ASKER

I tried with them

I have failed to parse the text. Can I have sample examples

Thanks
Sudha

Tommy Braas

StringTokenizer strtok = new StringTokenizer(inputString);
StringBuffer text = new StringBuffer();
while (strtok.hasMoreTokens()) {
strtok.nextToken(">");
text.append(strtok.nextToken("<"));
}

Tommy Braas

Short and sweet! :-)

Mayank S

By the way, I guess that RTFs can be directly read using readers/ writers.

There are several sub-versions for the RTF standard. Probably the RTFs created with Word don't work, otherwise you should be able to read/ write with RTFs directly.

Or else, you can try these:

http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/rtf/RTFEditorKit.html

http://api.openoffice.org

ASKER CERTIFIED SOLUTION

Mick Barry

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Mayank S

>> directly read using readers/ writers

= directly read/ written using readers/ writers

sudhakar_koundinya

ASKER

And i am to parse the RTF document also

here is the code

import javax.swing.text.rtf.RTFEditorKit;
import javax.swing.text.*;
import java.io.*;
class RTF2Text
{
      public static void main(String[] args) throws Exception
      {
            System.out.println(getText(args[0]));
      }
      public static String getText(String file) throws Exception
      {
            FileInputStream stream = new FileInputStream(file);
            RTFEditorKit kit = new RTFEditorKit();
            javax.swing.text.Document doc = kit.createDefaultDocument();
            kit.read(stream, doc, 0);

            String plainText = doc.getText(0, doc.getLength());
            return plainText;

      }
}

Thanks to all of the guys. Especially objects.

Thanks One And All

Sudhakar

Mick Barry

:-)

http://www.objects.com.au/staff/mick

Venkic

Hi guys:

I tried the code posted by as shown below.

public static String getText(String uriStr) {
final StringBuffer buf = new StringBuffer(1000);

try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is whenever text is encountered in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data);
buf.append('\n');
}
};
}
};

// Create a reader on the HTML content
URL url = new URI(uriStr).toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.getInputStream());

// Parse the HTML
EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (MalformedURLException e) {
} catch (URISyntaxException e) {
} catch (BadLocationException e) {
} catch (IOException e) {
}

// Return the text
return buf.toString();
}

The above code does not work for some reason. The callback to handleText( ) method is somehow not being called by the HTMLEditor. Does anyone know of another way or some better way to parse an HTML file and get ONLY the TEXT WITHOUT any HTML-TAGS ?

Your helps is very much appreciated.

Mayank S

Ask your own problems in your own question-pages. Don't disturb questions which are already closed. You get your question-points only for that purpose. Don't try to save them this way. Use them correctly.