Link to home
Start Free TrialLog in
Avatar of polkadot
polkadot

asked on

reading html pages, url, uri ... doesn't work now?

My original question was asked:

https://www.experts-exchange.com/questions/21162135/opening-an-online-web-document-html-page-and-parsing-the-text.html

I used the same code a few hours ago and it worked, what is going on?
No complie errors, but it doesn't print out any text, why?
I am using it like this:

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;

public class testurl{

public static void main(String[] args)
{

            System.out.println(getText("http://www.cnn.com"));

}


      public static String getText(String uriStr) {
                  final StringBuffer buf = new StringBuffer(1000);

                  try {
                        // Create an HTML document that appends all text to buf
                        HTMLDocument doc = new HTMLDocument() {
                              public HTMLEditorKit.ParserCallback getReader(int pos) {
                                    return new HTMLEditorKit.ParserCallback() {
                                          // This method is whenever text is encountered in the HTML file
                                          public void handleText(char[] data, int pos) {
                                                buf.append(data);
                                                buf.append('\n');
                                          }
                                    };
                              }
                        };

                        // Create a reader on the HTML content
                        URL url = new URI(uriStr).toURL();
                        URLConnection conn = url.openConnection();
                        Reader rd = new InputStreamReader(conn.getInputStream());

                        // Parse the HTML
                        EditorKit kit = new HTMLEditorKit();
                        kit.read(rd, doc, 0);
                  } catch (MalformedURLException e) {
                  } catch (URISyntaxException e) {
                  } catch (BadLocationException e) {
                  } catch (IOException e) {
                  }

                  // Return the text
                  return buf.toString();
            }
}

Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

You certainly won't know why if you have empty exception blocks! Fill them with printStackTrace
Avatar of polkadot
polkadot

ASKER

don't know what printstacktrace is
Replace

>>

                   EditorKit kit = new HTMLEditorKit();
                   kit.read(rd, doc, 0);
              } catch (MalformedURLException e) {
              } catch (URISyntaxException e) {
              } catch (BadLocationException e) {
              } catch (IOException e) {
              }
>>

with


                   EditorKit kit = new HTMLEditorKit();
                   kit.read(rd, doc, 0);
              } catch (Exception e) {
                  e.printStackTrace();
              }
well I tried system.out.println and the error i get is  java.net.UnknownHostException: www.cnn.com

but what does that mean? and how do i fix it?
java.net.UnknownHostException: www.cnn.com
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:153)
        at java.net.Socket.connect(Socket.java:452)
        at java.net.Socket.connect(Socket.java:402)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:139)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:402)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:618)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:306)
        at sun.net.www.http.HttpClient.<init>(HttpClient.java:267)
        at sun.net.www.http.HttpClient.New(HttpClient.java:339)
        at sun.net.www.http.HttpClient.New(HttpClient.java:320)
        at sun.net.www.http.HttpClient.New(HttpClient.java:315)
        at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConne
ction.java:521)
        at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection
.java:498)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
nection.java:626)
        at testurl.getText(testurl.java:36)
        at testurl.main(testurl.java:11)
It means the address isn't valid. Try

http://www.cnn.com
javax.swing.text.ChangedCharSetException
        at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
        at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
        at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
        at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
        at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
        at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
        at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
        at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:230)
        at testurrl.getText(testurrl.java:41)
        at testurrl.main(testurrl.java:12)

and finally get this result
CNN.com


did you see the source code, that's what i have!

I swear it worked earlier today. Why now? and cnn site is up and running
I don't know where you got the code from with those empty catch blocks, but you should know that the Java classes you're using are quite flaky
nope! same errors as for cnn.com

I've tried other sites, same thing, same error
can you suggest some other code for what I want to do?
sudhakar_koundinya I'm not sure what you are suggesting
>>can you suggest some other code for what I want to do?

Not from personal experience. All i can tell you is that the library classes are flaky, and/or intolerant to imprecision in markup. Try the Neko html parser
>>All i can tell you is that the library classes are flaky

They are from standard HTML library


The reason for not getting the text is because of java.net.UnknownHostException: www.cnn.com which is related java.net API


and when I try to parse the text from my side

the downloaded text what i get is "CNN.com"

So we need to find the solution why it is not downloading the entire HTML. there is nothing to do with HTML parser. That is perferct code.

polkadot,
  bear with me for some time. I will try why it is not downloading
I just tried the code as is. It worked fine. Perhaps the machine you are running it from has a network problem. Sounds like from your java.net.UnknownHostException that the machine's network configuration/state is wrong. It can't resolve www.cnn.com .  Try opening up a shell on the machine you're running the program on and do a ping to www.cnn.com. If the ping fails then your network on that machine is not working.

Good Luck :)
gen718, how do I do a ping?

My Mozilla browswer and IE are both working ok with the sites I've tried to use in parser.
>>That is perferct code.

So how do you account for this?:

>>javax.swing.text.ChangedCharSetException

etc




>>The reason for not getting the text is because of java.net.UnknownHostException

That's true. I suggested adding a file to the url as it might help. There could be many reasons for a momentary error of this kind. That doesn't detract from the fact the html classes are flaky though

try this,

working fine my side


import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;

public class testurl{
   
    public static void main(String[] args) {
       
        System.out.println(getText("http://www.cnn.com/"));
       
    }
   
   
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
       
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            System.err.println("Hello :"+new String(data));
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
           
            // Create a reader on the HTML content
            //URL url = new URI(uriStr).toURL();
           // URLConnection conn = url.openConnection();
            //Reader rd = new InputStreamReader(conn.getInputStream());
            ByteArrayInputStream stream=new ByteArrayInputStream(getHTML(uriStr).getBytes());
            Reader rd=new InputStreamReader(stream);
           
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (Exception e) {
            e.printStackTrace();
        }
       
        // Return the text
        return buf.toString();
    }
   
    static String getHTML(String _url) {
        StringBuffer sb=new StringBuffer();
        try {
            // Create a URL for the desired page
            URL url = new URL(_url);
           
            // Read all the text returned by the server
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String str;
            while ((str = in.readLine()) != null) {
                sb.append(str).append("\r\n");
            }
            in.close();
        } catch (MalformedURLException e) {
        } catch (IOException e) {
        }
        return sb.toString();
    }
}
>>So how do you account for this?:


That is obvious because any html parser can't parse non html strings

Before downloading the url content,  I just get (CNN.com) which is non HTML String
This is what i get with that code:

Hello :CNN.com
javax.swing.text.ChangedCharSetException
        at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(Unknown Source)
        at javax.swing.text.html.parser.Parser.startTag(Unknown Source)
        at javax.swing.text.html.parser.Parser.parseTag(Unknown Source)
        at javax.swing.text.html.parser.Parser.parseContent(Unknown Source)
        at javax.swing.text.html.parser.Parser.parse(Unknown Source)
        at javax.swing.text.html.parser.DocumentParser.parse(Unknown Source)
        at javax.swing.text.html.parser.ParserDelegator.parse(Unknown Source)
        at javax.swing.text.html.HTMLEditorKit.read(Unknown Source)
        at testurl.getText(testurl.java:42)
        at testurl.main(testurl.java:10)
CNN.com
yeah that's what I get too ...

I thought it may have been my firewall (which is a bit flaky) but I have it turned off now, hope pc doesn't explode :o
>>That is obvious because any html parser can't parse non html strings

Well it certainly doesn't make IE fall over does it? Or wget for that matter, which tends to support what i was saying earlier
>> This is what i get with that code:

With my new code??

It is working fine my side
>>With my new code??

Yes
um ... maybe cnn has blocked us somehow, because i'm getting it to work for less popular sites:

http://archives.math.utk.edu/
http://squid-docs.sourceforge.net/latest/book-full.html#AEN16

etc....

cnn yahoo att don't work
CEHJ, on some level I appreciate your remaks, but your really not helping :)
>>maybe cnn has blocked us somehow,

Well can you get it with your browser?
>>but your really not helping

Well i'm trying to. I just asked you a question by way of trying to help
yes, I can get it with my browser, that's what  I said before ...
Sudhakar, it does work sometimes. But what is the cause of when it works and when it doesn't. I may actually consider using the google API and just retreiving documents from its cache.

try this

But you need to download HttpClient Api from
http://jakarta.apache.org/commons/httpclient/downloads.html

Let me know the result. This is also working fine my side.

Make Sure that what is the Status code you are getting here



import java.io.FileOutputStream;
import java.io.IOException;
import java.io.*;
import java.net.*;
import javax.swing.text.html.*;
import javax.swing.text.*;

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.GetMethod;

public class URLDownload {

    private static String url =
         "http://www.cnn.com";

    public static String getHtmlText(String url) {

        //Instantiate an HttpClient
        HttpClient client = new HttpClient();

        //Instantiate a GET HTTP method
        HttpMethod method = new GetMethod(url);

        try{
            int statusCode = client.executeMethod(method);

            System.out.println("Status Text>>>"
                  +HttpStatus.getStatusText(statusCode));

            //Get data as a String
            System.out.println(method.getResponseBodyAsString());

       
            //release connection
            method.releaseConnection();
        }
        catch(IOException e) {
            e.printStackTrace();
        }
    }
   
   
   
   
   
    public static String getText(String uriStr) {
        final StringBuffer buf = new StringBuffer(1000);
       
        try {
            // Create an HTML document that appends all text to buf
            HTMLDocument doc = new HTMLDocument() {
                public HTMLEditorKit.ParserCallback getReader(int pos) {
                    return new HTMLEditorKit.ParserCallback() {
                        // This method is whenever text is encountered in the HTML file
                        public void handleText(char[] data, int pos) {
                            System.err.println("Hello :"+new String(data));
                            buf.append(data);
                            buf.append('\n');
                        }
                    };
                }
            };
           
            // Create a reader on the HTML content
            //URL url = new URI(uriStr).toURL();
           // URLConnection conn = url.openConnection();
            //Reader rd = new InputStreamReader(conn.getInputStream());
            ByteArrayInputStream stream=new ByteArrayInputStream(getHtmlText(uriStr).getBytes());
            Reader rd=new InputStreamReader(stream);
           
            // Parse the HTML
            EditorKit kit = new HTMLEditorKit();
            kit.read(rd, doc, 0);
        } catch (Exception e) {
            e.printStackTrace();
        }
       
        // Return the text
        return buf.toString();
    }
}
That's a better suggestion, but of course won't help if there's a network problem to cnn. You can only make an educated guess as to the reasons for that after gathering as many data as possible
im sorry, im really tired and really confused, what exactly am I downloading, because there are a few things on that site
actually, can you tell me how to just get the html source of a page? I can't seem to get the code to just extract the source
like can I just get a string s= "... <html> ... <font> blah blah ...</html> ..." from a site like cnn.com
ASKER CERTIFIED SOLUTION
Avatar of sudhakar_koundinya
sudhakar_koundinya

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Sites like CNN - essentially a news service - change all the time, and this probably affects the way your parser is performing. Point your code at a less dynamic site and see if it suddenly becomes more stable.
yes, krakatoa I think you are right, other sites are working
try any one of individual urls some thing like this and test with it

"http://www.cnn.com/2004/WORLD/asiapcf/10/09/afghanistan.elections/index.html"
the getHTML code works to extract the source html of the cnn page and all the other pages ok, maybe it is the parser after all

I will try the apache, and let you know how it worked

            protected timerT ttaskk;

                ttaskk = new timerT();
                timerr = new java.util.Timer();
                timerr.schedule(ttaskk, 0, 15000);










    //--------------------- **t i m e r T** INNER CLASS ----------------------------------//
    public class timerT extends TimerTask {


        private int deltaa,yeltaa;
        protected String urlpointer;
        protected String questiondata;
        protected String eeqpoints;
        protected String eeqtitle;
        protected String oldeeqpoints;
        protected String oldeeqtitle;
        protected String eemembername;

        public timerT() {

            super();
            URLcontent = "";
            keystring = "";
          backkeystring="";      
            questiondata = "";
            eeqpoints = "";
            eeqtitle = "";
            eemembername = "";
            oldeeqpoints = "";
            oldeeqtitle = "";
            urlpointer = txtField.getText().trim();

        }//end constructor


        public void run() {

            System.gc();

            if (urlpointer.equals("")) {
                alertURLchange = false;
                mymenuitemdownload.setLabel("Alert URL changes");
            }

            if (alertURLchange == false) {
                timerr.cancel();
                System.gc();
            }//end if

            txtArea.setText("");
            try {
                URL u = new URL(urlpointer);
                try {
                    Object o = u.getContent();


                    if (o instanceof InputStream) {
                        showTextt((InputStream) o);
                    } else {

                        showTextt(urlpointer);

                    }

                } catch (IOException e) {
                    e.printStackTrace();
                    showTextt("Could not connect to " + u.getHost());
                } catch (NullPointerException e) {
                    e.printStackTrace();
                    showTextt("There was a problem with the content." + '\n');
                }

            } catch (MalformedURLException e) {
                e.printStackTrace();
                showTextt(urlpointer + " is not a valid URL" + '\n');
            }


            if (URLcontent.indexOf(keystring) < 0) {


                if (!((eeqpoints.equals(oldeeqpoints)) && (eeqtitle.equals(oldeeqtitle)))) {

                    generalDialog am = new generalDialog(thisxcomms, "NEW QUESTION ARRIVED", false);
                    am.setSize(new Dimension(600, 47));
                    am.setLayout(new BorderLayout());
                    JLabel msgtf1 = new JLabel("Question worth " + eeqpoints + " points just posted : ");
                    JLabel msgtf2 = new JLabel(eeqtitle.trim());
                    msgtf2.setForeground(java.awt.Color.blue);
                    msgtf2.setBackground(java.awt.Color.blue);

                    am.add(msgtf1, BorderLayout.WEST);
                    am.add(msgtf2, BorderLayout.CENTER);
                    //msgtf1.setEditable(false);
                    //msgtf2.setEditable(false);

                    am.show();

                    txtArea.appendText(keystring + '\n');//"URL changed "+new java.util.GregorianCalendar().getTime()+'\n');

                }//end if

            }

            oldeeqpoints = eeqpoints;
            oldeeqtitle = eeqtitle;

            keystring = "";
          backkeystring="";
            questiondata = "";
            //eeqpoints="";
            //eeqtitle="";
            eemembername = "";

            URLcontent = URLtext.toString().trim();

        }//end run timerT


        public void showTextt(InputStream is) {

            String nextline = null;
            URLtext = new StringBuffer();


            try {
                DataInputStream dis = new DataInputStream(is);

                while ((nextline = dis.readLine()) != null) {
                    URLtext.append(nextline.trim());

                }

            } catch (IOException e) {
                e.printStackTrace();
                txtArea.appendText(e.toString());
            }
            try {

//this keystring locates the Questions Awaiting Answers string.

                //keystring = URLtext.substring(URLtext.indexOf("Questions Awaiting Answers:"), (URLtext.indexOf("</nobr>", URLtext.indexOf("<nobr>Questions Awaiting Answers:"))));//old EE format
            backkeystring=URLtext.substring(URLtext.indexOf("Questions Awaiting Answers:"));
            keystring = URLtext.substring(URLtext.indexOf("title="));

//this extracts the guts of the Q data from the keystring.

                //questiondata = URLtext.substring(URLtext.indexOf("<b>Member</b>", URLtext.indexOf(keystring)), URLtext.indexOf("viewMember", URLtext.indexOf("<b>Member</b>", URLtext.indexOf(keystring))));//old ee format
            questiondata = keystring.substring(keystring.indexOf("title=")+7,keystring.indexOf((char)34,keystring.indexOf("title=")+7));

            String[] bits = questiondata.split("&nbsp;");
            StringBuffer stb = new StringBuffer("");
            for(int a=0;a<bits.length;a++){stb.append(bits[a]);}
            if(!stb.toString().equals("")){
            questiondata = stb.toString();
            }            

            //System.out.println(questiondata);


//this isolates the points for the Q.



                //eeqpoints = questiondata.substring(questiondata.indexOf(">", questiondata.indexOf("center")) + 1, questiondata.indexOf("<", questiondata.indexOf("center")));//old ee format

            eeqpoints = backkeystring.substring(backkeystring.indexOf((char)34+" align=center>")+15,backkeystring.indexOf("</td>",backkeystring.indexOf((char)34+" align=center>")+15));

                //eeqtitle = questiondata.substring(questiondata.indexOf("title") + 7, questiondata.indexOf(">", questiondata.indexOf("title") + 7) - 1);//old ee format

            eeqtitle = questiondata;

//System.out.println(eeqpoints);

//System.out.println(eeqtitle);

            } catch (StringIndexOutOfBoundsException sio) {
                if (URLcontent.equals("")) {
                    txtArea.appendText("No initial input yet");
                }
            }
        }




        public void showTextt(String s) {

            String nextline = null;

            txtArea.setText("");

            try {
                RandomAccessFile myRandomAccessFile = new RandomAccessFile(s, "r");//open file

                while ((nextline = myRandomAccessFile.readLine()) != null) {
                    txtArea.appendText(nextline + "\n");
                }

            } catch (Exception e) {
                e.printStackTrace();
                txtArea.appendText(e.toString());
            }
        }



    }//end class timerT