Solved

opening an online web document html page, and parsing the text

Posted on 2004-10-09
15
197 Views
Last Modified: 2010-03-31
What are the java classes that do this? is there any sample code?

For example I want ot be able to open www.experts-exchange.com and have my program read it in as  a string that I could parse with stringtokenizer or something like that .
0
Comment
Question by:polkadot
  • 10
  • 5
15 Comments
 
LVL 14

Accepted Solution

by:
sudhakar_koundinya earned 500 total points
ID: 12267121
// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all text in the document.
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267123
The above example parses the html to text
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267131
import javax.swing.text.*;
import javax.swing.text.html.*;

for above example
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267137
>>that I could parse with stringtokenizer or something like that .

What do you want to parse from downloaded HTML after parsing it into text??
0
 

Author Comment

by:polkadot
ID: 12267167
so the above code removes tags?

I just want to be able to get the text out of the page and search through it.
0
 

Author Comment

by:polkadot
ID: 12267170
btw, thanks so much for answering all my questions today, I feel like I have a personal tutor by my side :)
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267181
>>so the above code removes tags?

Yes
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267214
forget to mention

You should import

java.io.*;
java.net.*;

also

regards
Sudhakar
0
 

Author Comment

by:polkadot
ID: 12267226
Also, can you explain what URI is: a Uniform Resource Identifier (URI) reference and how it is different from URL. I didn't really get the definition in API


what is the relationship, as in this line:     URL url = new URI(uriStr).toURL();
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267255
In the XML world, the URI simply denotes a globally unique identifier. It need not be a web page or a service. Infact the processors( aka parsers ) will not attempt to resolve the URI references. Being able to parse a document from a computer that is not connected to the Internet stands as the best evidence for this behaviour. An example of an URI can be http://www.abcdefgh.com/ijklmnop/qrstuvwzyz.rtf

An URL on the other hand, is an URI that can be resolved by a standard web browser. The URL typically contains some data (html, htm, jsp, doc, txt etc ) that can be downloaded to the client machine using a standard application protocol like ftp or http.

Hope that clarifies. Again, my explanation is purely based on what I understand about URI/URL in the context of XML. There may be more to it in the grand scheme of things. I'd love to hear from the gurus of the web-world.
0
 

Author Comment

by:polkadot
ID: 12267257
a little problem, with above code, it missed javascript functions such as that on www.yahoo.com:

function PBopenWindow(){
window.open('http://us.ard.yahoo.com/SIG=129f215mu/M=294867.4949874.6085259.1288581/D=yahoo_top/S=2716149:PB/_ylt=Alwro7wCtUFqJslafEat85X1cSkA/EXP=1097431572/A=2359840/R=0/SIG=10tt88gbl/*http://poweredby.hpidea.com','NewWin','height=590,width=790');
}

is ther a parameter in the editorkit that I can set to look for funciton or even just to look for stuff contained in {}

otherwise its great! just was hoping for a little more explaination to url and uri
0
 

Author Comment

by:polkadot
ID: 12267272
thanks so much!

if you have any good references for xml, uri, url, that would be great too!
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267273
>>URL url = new URI(uriStr).toURL();

Can be elaborated as

String urlStr="http://abcd.com";
URI uri = new URI(uriStr); -->Creates the URI object for that url string
URL url=uri.toURL();   ------>Creates the URL object from URI object
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267285
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
ID: 12267298
0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
topping3 challenge 14 79
Apps blocked by Java 9 79
more than one jdk and one jre 1 41
tomcat not starting 6 45
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Video by: Michael
Viewers learn about how to reduce the potential repetitiveness of coding in main by developing methods to perform specific tasks for their program. Additionally, objects are introduced for the purpose of learning how to call methods in Java. Define …
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…

815 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now