Solved

Open Source API for HTML,RTF formats

Posted on 2004-04-05
13
985 Views
Last Modified: 2012-06-21
Hi all,

Can any body provide me links to open source API that parses RTF to text and HTML to  text

thanks
Sudhakar
0
Comment
Question by:sudhakar_koundinya
  • 3
  • 3
  • 3
  • +3
13 Comments
 
LVL 9

Expert Comment

by:mmuruganandam
ID: 10763245
Check Apache POI

You can find that at http://www.apache.org/dyn/closer.cgi/jakarta/poi/



Regards,
Muruga
0
 
LVL 14

Author Comment

by:sudhakar_koundinya
ID: 10763247
Hi

POI doesn't support RTF formats. It supports OLE Documents

Thanks to helping me
Sudha
0
 
LVL 92

Expert Comment

by:objects
ID: 10763251
Java comes with parsers already in HTMLEditorKit and RTFEditorKit.
0
Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

 
LVL 14

Author Comment

by:sudhakar_koundinya
ID: 10763255
I tried with them

I have failed to parse the text. Can I have sample examples

Thanks
Sudha
0
 
LVL 14

Expert Comment

by:Tommy Braas
ID: 10763257
StringTokenizer strtok = new StringTokenizer(inputString);
StringBuffer text = new StringBuffer();
while (strtok.hasMoreTokens()) {
   strtok.nextToken(">");
   text.append(strtok.nextToken("<"));
}
0
 
LVL 14

Expert Comment

by:Tommy Braas
ID: 10763258
Short and sweet! :-)
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 10763269
By the way, I guess that RTFs can be directly read using readers/ writers.

There are several sub-versions for the RTF standard. Probably the RTFs created with Word don't work, otherwise you should be able to read/ write with RTFs directly.

Or else, you can try these:

http://java.sun.com/j2se/1.4.2/docs/api/javax/swing/text/rtf/RTFEditorKit.html

http://api.openoffice.org
0
 
LVL 92

Accepted Solution

by:
objects earned 50 total points
ID: 10763272
   public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 10763275
>> directly read using readers/ writers

= directly read/ written using readers/ writers
0
 
LVL 14

Author Comment

by:sudhakar_koundinya
ID: 10763520
And i am to parse the RTF document also

here is the code

import javax.swing.text.rtf.RTFEditorKit;
import javax.swing.text.*;
import java.io.*;
class RTF2Text
{
      public static void main(String[] args) throws Exception
      {
            System.out.println(getText(args[0]));
      }
      public static String getText(String file) throws Exception
      {
            FileInputStream stream = new FileInputStream(file);
            RTFEditorKit kit = new RTFEditorKit();
            javax.swing.text.Document doc = kit.createDefaultDocument();
            kit.read(stream, doc, 0);

            String plainText = doc.getText(0, doc.getLength());
            return plainText;



      }
}


Thanks to all of the guys. Especially objects.

Thanks One And All

Sudhakar
0
 
LVL 92

Expert Comment

by:objects
ID: 10763536
0
 

Expert Comment

by:Venkic
ID: 11157303
Hi guys:

I tried the code posted by as shown below.

 public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }


The above code does not work for some reason. The callback to handleText( ) method is somehow not being called by the HTMLEditor. Does anyone know of another way or some better way to parse an HTML file and get ONLY the TEXT WITHOUT any HTML-TAGS ?

Your helps is very much appreciated.
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 11162264
Ask your own problems in your own question-pages. Don't disturb questions which are already closed. You get your question-points only for that purpose. Don't try to save them this way. Use them correctly.
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
javap not working 8 43
type mismatch (Object[] to double[] 4 23
servlet web applications   metadata-complete="true" or false 3 36
eclipse buid path vs tomcat lib path 10 22
Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
This was posted to the Netbeans forum a Feb, 2010 and I also sent it to Verisign. Who didn't help much in my struggles to get my application signed. ------------------------- Start The idea here is to target your cell phones with the correct…
Viewers learn how to read error messages and identify possible mistakes that could cause hours of frustration. Coding is as much about debugging your code as it is about writing it. Define Error Message: Line Numbers: Type of Error: Break Down…
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now