Solved

opening an online web document html page, and parsing the text

Posted on 2004-10-09
15
194 Views
Last Modified: 2010-03-31
What are the java classes that do this? is there any sample code?

For example I want ot be able to open www.experts-exchange.com and have my program read it in as  a string that I could parse with stringtokenizer or something like that .
0
Comment
Question by:polkadot
  • 10
  • 5
15 Comments
 
LVL 14

Accepted Solution

by:
sudhakar_koundinya earned 500 total points
ID: 12267121
// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all text in the document.
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267123
The above example parses the html to text
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267131
import javax.swing.text.*;
import javax.swing.text.html.*;

for above example
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267137
>>that I could parse with stringtokenizer or something like that .

What do you want to parse from downloaded HTML after parsing it into text??
0
 

Author Comment

by:polkadot
ID: 12267167
so the above code removes tags?

I just want to be able to get the text out of the page and search through it.
0
 

Author Comment

by:polkadot
ID: 12267170
btw, thanks so much for answering all my questions today, I feel like I have a personal tutor by my side :)
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267181
>>so the above code removes tags?

Yes
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267214
forget to mention

You should import

java.io.*;
java.net.*;

also

regards
Sudhakar
0
 

Author Comment

by:polkadot
ID: 12267226
Also, can you explain what URI is: a Uniform Resource Identifier (URI) reference and how it is different from URL. I didn't really get the definition in API


what is the relationship, as in this line:     URL url = new URI(uriStr).toURL();
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267255
In the XML world, the URI simply denotes a globally unique identifier. It need not be a web page or a service. Infact the processors( aka parsers ) will not attempt to resolve the URI references. Being able to parse a document from a computer that is not connected to the Internet stands as the best evidence for this behaviour. An example of an URI can be http://www.abcdefgh.com/ijklmnop/qrstuvwzyz.rtf

An URL on the other hand, is an URI that can be resolved by a standard web browser. The URL typically contains some data (html, htm, jsp, doc, txt etc ) that can be downloaded to the client machine using a standard application protocol like ftp or http.

Hope that clarifies. Again, my explanation is purely based on what I understand about URI/URL in the context of XML. There may be more to it in the grand scheme of things. I'd love to hear from the gurus of the web-world.
0
 

Author Comment

by:polkadot
ID: 12267257
a little problem, with above code, it missed javascript functions such as that on www.yahoo.com:

function PBopenWindow(){
window.open('http://us.ard.yahoo.com/SIG=129f215mu/M=294867.4949874.6085259.1288581/D=yahoo_top/S=2716149:PB/_ylt=Alwro7wCtUFqJslafEat85X1cSkA/EXP=1097431572/A=2359840/R=0/SIG=10tt88gbl/*http://poweredby.hpidea.com','NewWin','height=590,width=790');
}

is ther a parameter in the editorkit that I can set to look for funciton or even just to look for stuff contained in {}

otherwise its great! just was hoping for a little more explaination to url and uri
0
 

Author Comment

by:polkadot
ID: 12267272
thanks so much!

if you have any good references for xml, uri, url, that would be great too!
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267273
>>URL url = new URI(uriStr).toURL();

Can be elaborated as

String urlStr="http://abcd.com";
URI uri = new URI(uriStr); -->Creates the URI object for that url string
URL url=uri.toURL();   ------>Creates the URL object from URI object
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267285
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267298
0

Featured Post

What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

Join & Write a Comment

Suggested Solutions

For beginner Java programmers or at least those new to the Eclipse IDE, the following tutorial will show some (four) ways in which you can import your Java projects to your Eclipse workbench. Introduction While learning Java can be done with…
Java Flight Recorder and Java Mission Control together create a complete tool chain to continuously collect low level and detailed runtime information enabling after-the-fact incident analysis. Java Flight Recorder is a profiling and event collectio…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now