Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 206
  • Last Modified:

opening an online web document html page, and parsing the text

What are the java classes that do this? is there any sample code?

For example I want ot be able to open www.experts-exchange.com and have my program read it in as  a string that I could parse with stringtokenizer or something like that .
0
polkadot
Asked:
polkadot
  • 10
  • 5
1 Solution
 
sudhakar_koundinyaCommented:
// This method takes a URI which can be either a filename (e.g. file://c:/dir/file.html)
    // or a URL (e.g. http://host.com/page.html) and returns all text in the document.
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
   
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
   
            // Create a reader on the HTML content
            URL url = new URI(uriStr).toURL();
            URLConnection conn = url.openConnection();
            Reader rd = new InputStreamReader(conn.getInputStream());
   
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (MalformedURLException e) {
        } catch (URISyntaxException e) {
        } catch (BadLocationException e) {
        } catch (IOException e) {
        }
   
        // Return the text
        return buf.toString();
    }
0
 
sudhakar_koundinyaCommented:
The above example parses the html to text
0
 
sudhakar_koundinyaCommented:
import javax.swing.text.*;
import javax.swing.text.html.*;

for above example
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
sudhakar_koundinyaCommented:
>>that I could parse with stringtokenizer or something like that .

What do you want to parse from downloaded HTML after parsing it into text??
0
 
polkadotAuthor Commented:
so the above code removes tags?

I just want to be able to get the text out of the page and search through it.
0
 
polkadotAuthor Commented:
btw, thanks so much for answering all my questions today, I feel like I have a personal tutor by my side :)
0
 
sudhakar_koundinyaCommented:
>>so the above code removes tags?

Yes
0
 
sudhakar_koundinyaCommented:
forget to mention

You should import

java.io.*;
java.net.*;

also

regards
Sudhakar
0
 
polkadotAuthor Commented:
Also, can you explain what URI is: a Uniform Resource Identifier (URI) reference and how it is different from URL. I didn't really get the definition in API


what is the relationship, as in this line:     URL url = new URI(uriStr).toURL();
0
 
sudhakar_koundinyaCommented:
In the XML world, the URI simply denotes a globally unique identifier. It need not be a web page or a service. Infact the processors( aka parsers ) will not attempt to resolve the URI references. Being able to parse a document from a computer that is not connected to the Internet stands as the best evidence for this behaviour. An example of an URI can be http://www.abcdefgh.com/ijklmnop/qrstuvwzyz.rtf

An URL on the other hand, is an URI that can be resolved by a standard web browser. The URL typically contains some data (html, htm, jsp, doc, txt etc ) that can be downloaded to the client machine using a standard application protocol like ftp or http.

Hope that clarifies. Again, my explanation is purely based on what I understand about URI/URL in the context of XML. There may be more to it in the grand scheme of things. I'd love to hear from the gurus of the web-world.
0
 
polkadotAuthor Commented:
a little problem, with above code, it missed javascript functions such as that on www.yahoo.com:

function PBopenWindow(){
window.open('http://us.ard.yahoo.com/SIG=129f215mu/M=294867.4949874.6085259.1288581/D=yahoo_top/S=2716149:PB/_ylt=Alwro7wCtUFqJslafEat85X1cSkA/EXP=1097431572/A=2359840/R=0/SIG=10tt88gbl/*http://poweredby.hpidea.com','NewWin','height=590,width=790');
}

is ther a parameter in the editorkit that I can set to look for funciton or even just to look for stuff contained in {}

otherwise its great! just was hoping for a little more explaination to url and uri
0
 
polkadotAuthor Commented:
thanks so much!

if you have any good references for xml, uri, url, that would be great too!
0
 
sudhakar_koundinyaCommented:
>>URL url = new URI(uriStr).toURL();

Can be elaborated as

String urlStr="http://abcd.com";
URI uri = new URI(uriStr); -->Creates the URI object for that url string
URL url=uri.toURL();   ------>Creates the URL object from URI object
0
 
sudhakar_koundinyaCommented:
0
 
sudhakar_koundinyaCommented:
0

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

  • 10
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now