Solved

reading html pages, url, uri ... doesn't work now?

Posted on 2004-10-09
45
418 Views
Last Modified: 2008-02-01
My original question was asked:

http://www.experts-exchange.com/Programming/Programming_Languages/Java/Q_21162135.html

I used the same code a few hours ago and it worked, what is going on?
No complie errors, but it doesn't print out any text, why?
I am using it like this:

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;

public class testurl{

public static void main(String[] args)
{

            System.out.println(getText("http://www.cnn.com"));

}


      public static String getText(String uriStr) {
                  final StringBuffer buf = new StringBuffer(1000);

                  try {
                        // Create an HTML document that appends all text to buf
                        HTMLDocument doc = new HTMLDocument() {
                              public HTMLEditorKit.ParserCallback getReader(int pos) {
                                    return new HTMLEditorKit.ParserCallback() {
                                          // This method is whenever text is encountered in the HTML file
                                          public void handleText(char[] data, int pos) {
                                                buf.append(data);
                                                buf.append('\n');
                                          }
                                    };
                              }
                        };

                        // Create a reader on the HTML content
                        URL url = new URI(uriStr).toURL();
                        URLConnection conn = url.openConnection();
                        Reader rd = new InputStreamReader(conn.getInputStream());

                        // Parse the HTML
                        EditorKit kit = new HTMLEditorKit();
                        kit.read(rd, doc, 0);
                  } catch (MalformedURLException e) {
                  } catch (URISyntaxException e) {
                  } catch (BadLocationException e) {
                  } catch (IOException e) {
                  }

                  // Return the text
                  return buf.toString();
            }
}

0
Comment
Question by:polkadot
  • 18
  • 13
  • 11
  • +2
45 Comments
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
You certainly won't know why if you have empty exception blocks! Fill them with printStackTrace
0
 

Author Comment

by:polkadot
Comment Utility
don't know what printstacktrace is
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
Replace

>>

                   EditorKit kit = new HTMLEditorKit();
                   kit.read(rd, doc, 0);
              } catch (MalformedURLException e) {
              } catch (URISyntaxException e) {
              } catch (BadLocationException e) {
              } catch (IOException e) {
              }
>>

with


                   EditorKit kit = new HTMLEditorKit();
                   kit.read(rd, doc, 0);
              } catch (Exception e) {
                  e.printStackTrace();
              }
0
 

Author Comment

by:polkadot
Comment Utility
well I tried system.out.println and the error i get is  java.net.UnknownHostException: www.cnn.com

but what does that mean? and how do i fix it?
0
 

Author Comment

by:polkadot
Comment Utility
java.net.UnknownHostException: www.cnn.com
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:153)
        at java.net.Socket.connect(Socket.java:452)
        at java.net.Socket.connect(Socket.java:402)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:139)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:402)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:618)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:306)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:267)
        at sun.net.www.http.HttpClient.New(HttpClient.java:339)
        at sun.net.www.http.HttpClient.New(HttpClient.java:320)
        at sun.net.www.http.HttpClient.New(HttpClient.java:315)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConne
ction.java:521)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection
.java:498)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
nection.java:626)
        at testurl.getText(testurl.java:36)
        at testurl.main(testurl.java:11)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
It means the address isn't valid. Try

http://www.cnn.com
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
javax.swing.text.ChangedCharSetException
        at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
        at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
        at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
        at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
        at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
        at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
        at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
        at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:230)
        at testurrl.getText(testurrl.java:41)
        at testurrl.main(testurrl.java:12)

and finally get this result
CNN.com


0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
0
 

Author Comment

by:polkadot
Comment Utility
did you see the source code, that's what i have!

I swear it worked earlier today. Why now? and cnn site is up and running
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
I don't know where you got the code from with those empty catch blocks, but you should know that the Java classes you're using are quite flaky
0
 

Author Comment

by:polkadot
Comment Utility
nope! same errors as for cnn.com

I've tried other sites, same thing, same error
0
 

Author Comment

by:polkadot
Comment Utility
can you suggest some other code for what I want to do?
0
 

Author Comment

by:polkadot
Comment Utility
sudhakar_koundinya I'm not sure what you are suggesting
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>can you suggest some other code for what I want to do?

Not from personal experience. All i can tell you is that the library classes are flaky, and/or intolerant to imprecision in markup. Try the Neko html parser
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
>>All i can tell you is that the library classes are flaky

They are from standard HTML library


The reason for not getting the text is because of java.net.UnknownHostException: www.cnn.com which is related java.net API


0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
and when I try to parse the text from my side

the downloaded text what i get is "CNN.com"

0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
So we need to find the solution why it is not downloading the entire HTML. there is nothing to do with HTML parser. That is perferct code.

polkadot,
  bear with me for some time. I will try why it is not downloading
0
 
LVL 2

Expert Comment

by:gen718
Comment Utility
I just tried the code as is. It worked fine. Perhaps the machine you are running it from has a network problem. Sounds like from your java.net.UnknownHostException that the machine's network configuration/state is wrong. It can't resolve www.cnn.com .  Try opening up a shell on the machine you're running the program on and do a ping to www.cnn.com. If the ping fails then your network on that machine is not working.

Good Luck :)
0
 

Author Comment

by:polkadot
Comment Utility
gen718, how do I do a ping?

My Mozilla browswer and IE are both working ok with the sites I've tried to use in parser.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>That is perferct code.

So how do you account for this?:

>>javax.swing.text.ChangedCharSetException

etc




>>The reason for not getting the text is because of java.net.UnknownHostException

That's true. I suggested adding a file to the url as it might help. There could be many reasons for a momentary error of this kind. That doesn't detract from the fact the html classes are flaky though

0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
try this,

working fine my side


import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;

public class testurl{
   
    public static void main(String[] args) {
       
        System.out.println(getText("http://www.cnn.com/"));
       
    }
   
   
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
       
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            System.err.println("Hello :"+new String(data));
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
           
            // Create a reader on the HTML content
            //URL url = new URI(uriStr).toURL();
           // URLConnection conn = url.openConnection();
            //Reader rd = new InputStreamReader(conn.getInputStream());
            ByteArrayInputStream stream=new ByteArrayInputStream(getHTML(uriStr).getBytes());
            Reader rd=new InputStreamReader(stream);
           
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (Exception e) {
            e.printStackTrace();
        }
       
        // Return the text
        return buf.toString();
    }
   
    static String getHTML(String _url) {
        StringBuffer sb=new StringBuffer();
        try {
            // Create a URL for the desired page
            URL url = new URL(_url);
           
            // Read all the text returned by the server
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String str;
            while ((str = in.readLine()) != null) {
                sb.append(str).append("\r\n");
            }
            in.close();
        } catch (MalformedURLException e) {
        } catch (IOException e) {
        }
        return sb.toString();
    }
}
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
>>So how do you account for this?:


That is obvious because any html parser can't parse non html strings

Before downloading the url content,  I just get (CNN.com) which is non HTML String
0
Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
This is what i get with that code:

Hello :CNN.com
javax.swing.text.ChangedCharSetException
        at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(Unknown Source)
        at javax.swing.text.html.parser.Parser.startTag(Unknown Source)
        at javax.swing.text.html.parser.Parser.parseTag(Unknown Source)
        at javax.swing.text.html.parser.Parser.parseContent(Unknown Source)
        at javax.swing.text.html.parser.Parser.parse(Unknown Source)
        at javax.swing.text.html.parser.DocumentParser.parse(Unknown Source)
        at javax.swing.text.html.parser.ParserDelegator.parse(Unknown Source)
        at javax.swing.text.html.HTMLEditorKit.read(Unknown Source)
        at testurl.getText(testurl.java:42)
        at testurl.main(testurl.java:10)
CNN.com
0
 

Author Comment

by:polkadot
Comment Utility
yeah that's what I get too ...

I thought it may have been my firewall (which is a bit flaky) but I have it turned off now, hope pc doesn't explode :o
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>That is obvious because any html parser can't parse non html strings

Well it certainly doesn't make IE fall over does it? Or wget for that matter, which tends to support what i was saying earlier
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
>> This is what i get with that code:

With my new code??

It is working fine my side
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>With my new code??

Yes
0
 

Author Comment

by:polkadot
Comment Utility
um ... maybe cnn has blocked us somehow, because i'm getting it to work for less popular sites:

http://archives.math.utk.edu/
http://squid-docs.sourceforge.net/latest/book-full.html#AEN16

etc....

cnn yahoo att don't work
0
 

Author Comment

by:polkadot
Comment Utility
CEHJ, on some level I appreciate your remaks, but your really not helping :)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>maybe cnn has blocked us somehow,

Well can you get it with your browser?
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
>>but your really not helping

Well i'm trying to. I just asked you a question by way of trying to help
0
 

Author Comment

by:polkadot
Comment Utility
yes, I can get it with my browser, that's what  I said before ...
0
 

Author Comment

by:polkadot
Comment Utility
Sudhakar, it does work sometimes. But what is the cause of when it works and when it doesn't. I may actually consider using the google API and just retreiving documents from its cache.

0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
try this

But you need to download HttpClient Api from
http://jakarta.apache.org/commons/httpclient/downloads.html

Let me know the result. This is also working fine my side.

Make Sure that what is the Status code you are getting here



import java.io.FileOutputStream;
import java.io.IOException;
import java.io.*;
import java.net.*;
import javax.swing.text.html.*;
import javax.swing.text.*;

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.GetMethod;

public class URLDownload {

    private static String url =
         "http://www.cnn.com";

    public static String getHtmlText(String url) {

        //Instantiate an HttpClient
        HttpClient client = new HttpClient();

        //Instantiate a GET HTTP method
        HttpMethod method = new GetMethod(url);

        try{
            int statusCode = client.executeMethod(method);

            System.out.println("Status Text>>>"
                  +HttpStatus.getStatusText(statusCode));

            //Get data as a String
            System.out.println(method.getResponseBodyAsString());

       
            //release connection
            method.releaseConnection();
        }
        catch(IOException e) {
            e.printStackTrace();
        }
    }
   
   
   
   
   
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
       
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            System.err.println("Hello :"+new String(data));
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
           
            // Create a reader on the HTML content
            //URL url = new URI(uriStr).toURL();
           // URLConnection conn = url.openConnection();
            //Reader rd = new InputStreamReader(conn.getInputStream());
            ByteArrayInputStream stream=new ByteArrayInputStream(getHtmlText(uriStr).getBytes());
            Reader rd=new InputStreamReader(stream);
           
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (Exception e) {
            e.printStackTrace();
        }
       
        // Return the text
        return buf.toString();
    }
}
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
That's a better suggestion, but of course won't help if there's a network problem to cnn. You can only make an educated guess as to the reasons for that after gathering as many data as possible
0
 

Author Comment

by:polkadot
Comment Utility
im sorry, im really tired and really confused, what exactly am I downloading, because there are a few things on that site
0
 

Author Comment

by:polkadot
Comment Utility
actually, can you tell me how to just get the html source of a page? I can't seem to get the code to just extract the source
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
0
 

Author Comment

by:polkadot
Comment Utility
like can I just get a string s= "... <html> ... <font> blah blah ...</html> ..." from a site like cnn.com
0
 
LVL 14

Accepted Solution

by:
sudhakar_koundinya earned 500 total points
Comment Utility
using java.net api and java.io api

you can test this, for checking whether page is downloading or not

 static String getHTML(String _url) {
        StringBuffer sb=new StringBuffer();
        try {
            // Create a URL for the desired page
            URL url = new URL(_url);
           
            // Read all the text returned by the server
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String str;
            while ((str = in.readLine()) != null) {
                sb.append(str).append("\r\n");
            }
            in.close();
        } catch (MalformedURLException e) {
        } catch (IOException e) {
        }
        return sb.toString();
    }
0
 
LVL 16

Expert Comment

by:krakatoa
Comment Utility
Sites like CNN - essentially a news service - change all the time, and this probably affects the way your parser is performing. Point your code at a less dynamic site and see if it suddenly becomes more stable.
0
 

Author Comment

by:polkadot
Comment Utility
yes, krakatoa I think you are right, other sites are working
0
 
LVL 14

Expert Comment

by:sudhakar_koundinya
Comment Utility
try any one of individual urls some thing like this and test with it

"http://www.cnn.com/2004/WORLD/asiapcf/10/09/afghanistan.elections/index.html"
0
 

Author Comment

by:polkadot
Comment Utility
the getHTML code works to extract the source html of the cnn page and all the other pages ok, maybe it is the parser after all

I will try the apache, and let you know how it worked
0
 
LVL 16

Expert Comment

by:krakatoa
Comment Utility

            protected timerT ttaskk;

                ttaskk = new timerT();
                timerr = new java.util.Timer();
                timerr.schedule(ttaskk, 0, 15000);










    //--------------------- **t i m e r T** INNER CLASS ----------------------------------//
    public class timerT extends TimerTask {


        private int deltaa,yeltaa;
        protected String urlpointer;
        protected String questiondata;
        protected String eeqpoints;
        protected String eeqtitle;
        protected String oldeeqpoints;
        protected String oldeeqtitle;
        protected String eemembername;

        public timerT() {

            super();
            URLcontent = "";
            keystring = "";
          backkeystring="";      
            questiondata = "";
            eeqpoints = "";
            eeqtitle = "";
            eemembername = "";
            oldeeqpoints = "";
            oldeeqtitle = "";
            urlpointer = txtField.getText().trim();

        }//end constructor


        public void run() {

            System.gc();

            if (urlpointer.equals("")) {
                alertURLchange = false;
                mymenuitemdownload.setLabel("Alert URL changes");
            }

            if (alertURLchange == false) {
                timerr.cancel();
                System.gc();
            }//end if

            txtArea.setText("");
            try {
                URL u = new URL(urlpointer);
                try {
                    Object o = u.getContent();


                    if (o instanceof InputStream) {
                        showTextt((InputStream) o);
                    } else {

                        showTextt(urlpointer);

                    }

                } catch (IOException e) {
                    e.printStackTrace();
                    showTextt("Could not connect to " + u.getHost());
                } catch (NullPointerException e) {
                    e.printStackTrace();
                    showTextt("There was a problem with the content." + '\n');
                }

            } catch (MalformedURLException e) {
                e.printStackTrace();
                showTextt(urlpointer + " is not a valid URL" + '\n');
            }


            if (URLcontent.indexOf(keystring) < 0) {


                if (!((eeqpoints.equals(oldeeqpoints)) && (eeqtitle.equals(oldeeqtitle)))) {

                    generalDialog am = new generalDialog(thisxcomms, "NEW QUESTION ARRIVED", false);
                    am.setSize(new Dimension(600, 47));
                    am.setLayout(new BorderLayout());
                    JLabel msgtf1 = new JLabel("Question worth " + eeqpoints + " points just posted : ");
                    JLabel msgtf2 = new JLabel(eeqtitle.trim());
                    msgtf2.setForeground(java.awt.Color.blue);
                    msgtf2.setBackground(java.awt.Color.blue);

                    am.add(msgtf1, BorderLayout.WEST);
                    am.add(msgtf2, BorderLayout.CENTER);
                    //msgtf1.setEditable(false);
                    //msgtf2.setEditable(false);

                    am.show();

                    txtArea.appendText(keystring + '\n');//"URL changed "+new java.util.GregorianCalendar().getTime()+'\n');

                }//end if

            }

            oldeeqpoints = eeqpoints;
            oldeeqtitle = eeqtitle;

            keystring = "";
          backkeystring="";
            questiondata = "";
            //eeqpoints="";
            //eeqtitle="";
            eemembername = "";

            URLcontent = URLtext.toString().trim();

        }//end run timerT


        public void showTextt(InputStream is) {

            String nextline = null;
            URLtext = new StringBuffer();


            try {
                DataInputStream dis = new DataInputStream(is);

                while ((nextline = dis.readLine()) != null) {
                    URLtext.append(nextline.trim());

                }

            } catch (IOException e) {
                e.printStackTrace();
                txtArea.appendText(e.toString());
            }
            try {

//this keystring locates the Questions Awaiting Answers string.

                //keystring = URLtext.substring(URLtext.indexOf("Questions Awaiting Answers:"), (URLtext.indexOf("</nobr>", URLtext.indexOf("<nobr>Questions Awaiting Answers:"))));//old EE format
            backkeystring=URLtext.substring(URLtext.indexOf("Questions Awaiting Answers:"));
            keystring = URLtext.substring(URLtext.indexOf("title="));

//this extracts the guts of the Q data from the keystring.

                //questiondata = URLtext.substring(URLtext.indexOf("<b>Member</b>", URLtext.indexOf(keystring)), URLtext.indexOf("viewMember", URLtext.indexOf("<b>Member</b>", URLtext.indexOf(keystring))));//old ee format
            questiondata = keystring.substring(keystring.indexOf("title=")+7,keystring.indexOf((char)34,keystring.indexOf("title=")+7));

            String[] bits = questiondata.split("&nbsp;");
            StringBuffer stb = new StringBuffer("");
            for(int a=0;a<bits.length;a++){stb.append(bits[a]);}
            if(!stb.toString().equals("")){
            questiondata = stb.toString();
            }            

            //System.out.println(questiondata);


//this isolates the points for the Q.



                //eeqpoints = questiondata.substring(questiondata.indexOf(">", questiondata.indexOf("center")) + 1, questiondata.indexOf("<", questiondata.indexOf("center")));//old ee format

            eeqpoints = backkeystring.substring(backkeystring.indexOf((char)34+" align=center>")+15,backkeystring.indexOf("</td>",backkeystring.indexOf((char)34+" align=center>")+15));

                //eeqtitle = questiondata.substring(questiondata.indexOf("title") + 7, questiondata.indexOf(">", questiondata.indexOf("title") + 7) - 1);//old ee format

            eeqtitle = questiondata;

//System.out.println(eeqpoints);

//System.out.println(eeqtitle);

            } catch (StringIndexOutOfBoundsException sio) {
                if (URLcontent.equals("")) {
                    txtArea.appendText("No initial input yet");
                }
            }
        }




        public void showTextt(String s) {

            String nextline = null;

            txtArea.setText("");

            try {
                RandomAccessFile myRandomAccessFile = new RandomAccessFile(s, "r");//open file

                while ((nextline = myRandomAccessFile.readLine()) != null) {
                    txtArea.appendText(nextline + "\n");
                }

            } catch (Exception e) {
                e.printStackTrace();
                txtArea.appendText(e.toString());
            }
        }



    }//end class timerT

0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now