opening an online web document html page, and parsing the text

What are the java classes that do this? is there any sample code?

For example I want ot be able to open www.experts-exchange.com and have my program read it in as  a string that I could parse with stringtokenizer or something like that .
polkadotAsked:
Who is Participating?
 
sudhakar_koundinyaConnect With a Mentor Commented:
// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all text in the document.
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
sudhakar_koundinyaCommented:
The above example parses the html to text
0
 
sudhakar_koundinyaCommented:
import javax.swing.text.*;
import javax.swing.text.html.*;

for above example
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
sudhakar_koundinyaCommented:
>>that I could parse with stringtokenizer or something like that .

What do you want to parse from downloaded HTML after parsing it into text??
0
 
polkadotAuthor Commented:
so the above code removes tags?

I just want to be able to get the text out of the page and search through it.
0
 
polkadotAuthor Commented:
btw, thanks so much for answering all my questions today, I feel like I have a personal tutor by my side :)
0
 
sudhakar_koundinyaCommented:
>>so the above code removes tags?

Yes
0
 
sudhakar_koundinyaCommented:
forget to mention

You should import

java.io.*;
java.net.*;

also

regards
Sudhakar
0
 
polkadotAuthor Commented:
Also, can you explain what URI is: a Uniform Resource Identifier (URI) reference and how it is different from URL. I didn't really get the definition in API


what is the relationship, as in this line:     URL url = new URI(uriStr).toURL();
0
 
sudhakar_koundinyaCommented:
In the XML world, the URI simply denotes a globally unique identifier. It need not be a web page or a service. Infact the processors( aka parsers ) will not attempt to resolve the URI references. Being able to parse a document from a computer that is not connected to the Internet stands as the best evidence for this behaviour. An example of an URI can be http://www.abcdefgh.com/ijklmnop/qrstuvwzyz.rtf

An URL on the other hand, is an URI that can be resolved by a standard web browser. The URL typically contains some data (html, htm, jsp, doc, txt etc ) that can be downloaded to the client machine using a standard application protocol like ftp or http.

Hope that clarifies. Again, my explanation is purely based on what I understand about URI/URL in the context of XML. There may be more to it in the grand scheme of things. I'd love to hear from the gurus of the web-world.
0
 
polkadotAuthor Commented:
a little problem, with above code, it missed javascript functions such as that on www.yahoo.com:

function PBopenWindow(){
window.open('http://us.ard.yahoo.com/SIG=129f215mu/M=294867.4949874.6085259.1288581/D=yahoo_top/S=2716149:PB/_ylt=Alwro7wCtUFqJslafEat85X1cSkA/EXP=1097431572/A=2359840/R=0/SIG=10tt88gbl/*http://poweredby.hpidea.com','NewWin','height=590,width=790');
}

is ther a parameter in the editorkit that I can set to look for funciton or even just to look for stuff contained in {}

otherwise its great! just was hoping for a little more explaination to url and uri
0
 
polkadotAuthor Commented:
thanks so much!

if you have any good references for xml, uri, url, that would be great too!
0
 
sudhakar_koundinyaCommented:
>>URL url = new URI(uriStr).toURL();

Can be elaborated as

String urlStr="http://abcd.com";
URI uri = new URI(uriStr); -->Creates the URI object for that url string
URL url=uri.toURL();   ------>Creates the URL object from URI object
0
 
sudhakar_koundinyaCommented:
0
 
sudhakar_koundinyaCommented:
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.