sudhakar_koundinya
asked on
Open Source API for HTML,RTF formats
Hi all,
Can any body provide me links to open source API that parses RTF to text and HTML to text
thanks
Sudhakar
Can any body provide me links to open source API that parses RTF to text and HTML to text
thanks
Sudhakar
ASKER
Hi
POI doesn't support RTF formats. It supports OLE Documents
Thanks to helping me
Sudha
POI doesn't support RTF formats. It supports OLE Documents
Thanks to helping me
Sudha
Java comes with parsers already in HTMLEditorKit and RTFEditorKit.
ASKER
I tried with them
I have failed to parse the text. Can I have sample examples
Thanks
Sudha
I have failed to parse the text. Can I have sample examples
Thanks
Sudha
StringTokenizer strtok = new StringTokenizer(inputStrin g);
StringBuffer text = new StringBuffer();
while (strtok.hasMoreTokens()) {
strtok.nextToken(">");
text.append(strtok.nextTok en("<"));
}
StringBuffer text = new StringBuffer();
while (strtok.hasMoreTokens()) {
strtok.nextToken(">");
text.append(strtok.nextTok
}
Short and sweet! :-)
By the way, I guess that RTFs can be directly read using readers/ writers.
There are several sub-versions for the RTF standard. Probably the RTFs created with Word don't work, otherwise you should be able to read/ write with RTFs directly.
Or else, you can try these:
http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/rtf/RTFEditorKit.html
http://api.openoffice.org
There are several sub-versions for the RTF standard. Probably the RTFs created with Word don't work, otherwise you should be able to read/ write with RTFs directly.
Or else, you can try these:
http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/rtf/RTFEditorKit.html
http://api.openoffice.org
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
>> directly read using readers/ writers
= directly read/ written using readers/ writers
= directly read/ written using readers/ writers
ASKER
And i am to parse the RTF document also
here is the code
import javax.swing.text.rtf.RTFEd itorKit;
import javax.swing.text.*;
import java.io.*;
class RTF2Text
{
public static void main(String[] args) throws Exception
{
System.out.println(getText (args[0])) ;
}
public static String getText(String file) throws Exception
{
FileInputStream stream = new FileInputStream(file);
RTFEditorKit kit = new RTFEditorKit();
javax.swing.text.Document doc = kit.createDefaultDocument( );
kit.read(stream, doc, 0);
String plainText = doc.getText(0, doc.getLength());
return plainText;
}
}
Thanks to all of the guys. Especially objects.
Thanks One And All
Sudhakar
here is the code
import javax.swing.text.rtf.RTFEd
import javax.swing.text.*;
import java.io.*;
class RTF2Text
{
public static void main(String[] args) throws Exception
{
System.out.println(getText
}
public static String getText(String file) throws Exception
{
FileInputStream stream = new FileInputStream(file);
RTFEditorKit kit = new RTFEditorKit();
javax.swing.text.Document doc = kit.createDefaultDocument(
kit.read(stream, doc, 0);
String plainText = doc.getText(0, doc.getLength());
return plainText;
}
}
Thanks to all of the guys. Especially objects.
Thanks One And All
Sudhakar
Hi guys:
I tried the code posted by as shown below.
public static String getText(String uriStr) {
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallba ck getReader(int pos) {
return new HTMLEditorKit.ParserCallba ck() {
// This method is whenever text is encountered in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data);
buf.append('\n');
}
};
}
};
// Create a reader on the HTML content
URL url = new URI(uriStr).toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.get InputStrea m());
// Parse the HTML
EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (MalformedURLException e) {
} catch (URISyntaxException e) {
} catch (BadLocationException e) {
} catch (IOException e) {
}
// Return the text
return buf.toString();
}
The above code does not work for some reason. The callback to handleText( ) method is somehow not being called by the HTMLEditor. Does anyone know of another way or some better way to parse an HTML file and get ONLY the TEXT WITHOUT any HTML-TAGS ?
Your helps is very much appreciated.
I tried the code posted by as shown below.
public static String getText(String uriStr) {
final StringBuffer buf = new StringBuffer(1000);
try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallba
return new HTMLEditorKit.ParserCallba
// This method is whenever text is encountered in the HTML file
public void handleText(char[] data, int pos) {
buf.append(data);
buf.append('\n');
}
};
}
};
// Create a reader on the HTML content
URL url = new URI(uriStr).toURL();
URLConnection conn = url.openConnection();
Reader rd = new InputStreamReader(conn.get
// Parse the HTML
EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (MalformedURLException e) {
} catch (URISyntaxException e) {
} catch (BadLocationException e) {
} catch (IOException e) {
}
// Return the text
return buf.toString();
}
The above code does not work for some reason. The callback to handleText( ) method is somehow not being called by the HTMLEditor. Does anyone know of another way or some better way to parse an HTML file and get ONLY the TEXT WITHOUT any HTML-TAGS ?
Your helps is very much appreciated.
Ask your own problems in your own question-pages. Don't disturb questions which are already closed. You get your question-points only for that purpose. Don't try to save them this way. Use them correctly.
You can find that at http://www.apache.org/dyn/closer.cgi/jakarta/poi/
Regards,
Muruga