Solved

opening an online web document html page, and parsing the text

Posted on 2004-10-09
15
199 Views
Last Modified: 2010-03-31
What are the java classes that do this? is there any sample code?

For example I want ot be able to open www.experts-exchange.com and have my program read it in as  a string that I could parse with stringtokenizer or something like that .
0
Comment
Question by:polkadot
  • 10
  • 5
15 Comments
 
LVL 14

Accepted Solution

by:
sudhakar_koundinya earned 500 total points
ID: 12267121
// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all text in the document.
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267123
The above example parses the html to text
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267131
import javax.swing.text.*;
import javax.swing.text.html.*;

for above example
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267137
>>that I could parse with stringtokenizer or something like that .

What do you want to parse from downloaded HTML after parsing it into text??
0
 

Author Comment

by:polkadot
ID: 12267167
so the above code removes tags?

I just want to be able to get the text out of the page and search through it.
0
 

Author Comment

by:polkadot
ID: 12267170
btw, thanks so much for answering all my questions today, I feel like I have a personal tutor by my side :)
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267181
>>so the above code removes tags?

Yes
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267214
forget to mention

You should import

java.io.*;
java.net.*;

also

regards
Sudhakar
0
 

Author Comment

by:polkadot
ID: 12267226
Also, can you explain what URI is: a Uniform Resource Identifier (URI) reference and how it is different from URL. I didn't really get the definition in API


what is the relationship, as in this line:     URL url = new URI(uriStr).toURL();
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267255
In the XML world, the URI simply denotes a globally unique identifier. It need not be a web page or a service. Infact the processors( aka parsers ) will not attempt to resolve the URI references. Being able to parse a document from a computer that is not connected to the Internet stands as the best evidence for this behaviour. An example of an URI can be http://www.abcdefgh.com/ijklmnop/qrstuvwzyz.rtf

An URL on the other hand, is an URI that can be resolved by a standard web browser. The URL typically contains some data (html, htm, jsp, doc, txt etc ) that can be downloaded to the client machine using a standard application protocol like ftp or http.

Hope that clarifies. Again, my explanation is purely based on what I understand about URI/URL in the context of XML. There may be more to it in the grand scheme of things. I'd love to hear from the gurus of the web-world.
0
 

Author Comment

by:polkadot
ID: 12267257
a little problem, with above code, it missed javascript functions such as that on www.yahoo.com:

function PBopenWindow(){
window.open('http://us.ard.yahoo.com/SIG=129f215mu/M=294867.4949874.6085259.1288581/D=yahoo_top/S=2716149:PB/_ylt=Alwro7wCtUFqJslafEat85X1cSkA/EXP=1097431572/A=2359840/R=0/SIG=10tt88gbl/*http://poweredby.hpidea.com','NewWin','height=590,width=790');
}

is ther a parameter in the editorkit that I can set to look for funciton or even just to look for stuff contained in {}

otherwise its great! just was hoping for a little more explaination to url and uri
0
 

Author Comment

by:polkadot
ID: 12267272
thanks so much!

if you have any good references for xml, uri, url, that would be great too!
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267273
>>URL url = new URI(uriStr).toURL();

Can be elaborated as

String urlStr="http://abcd.com";
URI uri = new URI(uriStr); -->Creates the URI object for that url string
URL url=uri.toURL();   ------>Creates the URL object from URI object
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267285
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267298
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
hibernate example using maven 12 77
by zero exception 10 70
jsp login check 12 52
Problem to Alipay 10 70
Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
This tutorial covers a practical example of lazy loading technique and early loading technique in a Singleton Design Pattern.
This video teaches viewers about errors in exception handling.

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question