asked on

reading html pages, url, uri ... doesn't work now?

My original question was asked:

https://www.experts-exchange.com/questions/21162135/opening-an-online-web-document-html-page-and-parsing-the-text.html

I used the same code a few hours ago and it worked, what is going on?
No complie errors, but it doesn't print out any text, why?
I am using it like this:

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;

public class testurl{

public static void main(String[] args)
{

            System.out.println(getText("http://www.cnn.com"));

}

      public static String getText(String uriStr) {
                  final StringBuffer buf = new StringBuffer(1000);

                  try {
                        // Create an HTML document that appends all text to buf
                        HTMLDocument doc = new HTMLDocument() {
                              public HTMLEditorKit.ParserCallback getReader(int pos) {
                                    return new HTMLEditorKit.ParserCallback() {
                                          // This method is whenever text is encountered in the HTML file
                                          public void handleText(char[] data, int pos) {
                                                buf.append(data);
                                                buf.append('\n');
                                          }
                                    };
                              }
                        };

                        // Create a reader on the HTML content
                        URL url = new URI(uriStr).toURL();
                        URLConnection conn = url.openConnection();
                        Reader rd = new InputStreamReader(conn.getInputStream());

                        // Parse the HTML
                        EditorKit kit = new HTMLEditorKit();
                        kit.read(rd, doc, 0);
                  } catch (MalformedURLException e) {
                  } catch (URISyntaxException e) {
                  } catch (BadLocationException e) {
                  } catch (IOException e) {
                  }

                  // Return the text
                  return buf.toString();
            }
}

CEHJ

You certainly won't know why if you have empty exception blocks! Fill them with printStackTrace

polkadot

ASKER

don't know what printstacktrace is

CEHJ

Replace

>>

EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (MalformedURLException e) {
} catch (URISyntaxException e) {
} catch (BadLocationException e) {
} catch (IOException e) {
}
>>

with

EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (Exception e) {
e.printStackTrace();
}

polkadot

ASKER

well I tried system.out.println and the error i get is java.net.UnknownHostException: www.cnn.com

but what does that mean? and how do i fix it?

polkadot

ASKER

java.net.UnknownHostException: www.cnn.com
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:153)
at java.net.Socket.connect(Socket.java:452)
at java.net.Socket.connect(Socket.java:402)
at sun.net.NetworkClient.doConnect(NetworkClient.java:139)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:402)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:618)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:306)
at sun.net.www.http.HttpClient.<init>(HttpClient.java:267)
at sun.net.www.http.HttpClient.New(HttpClient.java:339)
at sun.net.www.http.HttpClient.New(HttpClient.java:320)
at sun.net.www.http.HttpClient.New(HttpClient.java:315)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConne
ction.java:521)
at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection
.java:498)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
nection.java:626)
at testurl.getText(testurl.java:36)
at testurl.main(testurl.java:11)

CEHJ

It means the address isn't valid. Try

http://www.cnn.com

sudhakar_koundinya

javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(DocumentParser.java:169)
at javax.swing.text.html.parser.Parser.startTag(Parser.java:372)
at javax.swing.text.html.parser.Parser.parseTag(Parser.java:1846)
at javax.swing.text.html.parser.Parser.parseContent(Parser.java:1881)
at javax.swing.text.html.parser.Parser.parse(Parser.java:2047)
at javax.swing.text.html.parser.DocumentParser.parse(DocumentParser.java:106)
at javax.swing.text.html.parser.ParserDelegator.parse(ParserDelegator.java:78)
at javax.swing.text.html.HTMLEditorKit.read(HTMLEditorKit.java:230)
at testurrl.getText(testurrl.java:41)
at testurrl.main(testurrl.java:12)

and finally get this result
CNN.com

CEHJ

Sorry - i meant

http://www.cnn.com/index.html

polkadot

ASKER

did you see the source code, that's what i have!

I swear it worked earlier today. Why now? and cnn site is up and running

CEHJ

I don't know where you got the code from with those empty catch blocks, but you should know that the Java classes you're using are quite flaky

polkadot

ASKER

nope! same errors as for cnn.com

I've tried other sites, same thing, same error

polkadot

ASKER

can you suggest some other code for what I want to do?

polkadot

ASKER

sudhakar_koundinya I'm not sure what you are suggesting

CEHJ

>>can you suggest some other code for what I want to do?

Not from personal experience. All i can tell you is that the library classes are flaky, and/or intolerant to imprecision in markup. Try the Neko html parser

sudhakar_koundinya

>>All i can tell you is that the library classes are flaky

They are from standard HTML library

The reason for not getting the text is because of java.net.UnknownHostException: www.cnn.com which is related java.net API

sudhakar_koundinya

and when I try to parse the text from my side

the downloaded text what i get is "CNN.com"

sudhakar_koundinya

So we need to find the solution why it is not downloading the entire HTML. there is nothing to do with HTML parser. That is perferct code.

polkadot,
bear with me for some time. I will try why it is not downloading

gen718

I just tried the code as is. It worked fine. Perhaps the machine you are running it from has a network problem. Sounds like from your java.net.UnknownHostException that the machine's network configuration/state is wrong. It can't resolve www.cnn.com . Try opening up a shell on the machine you're running the program on and do a ping to www.cnn.com. If the ping fails then your network on that machine is not working.

Good Luck :)

polkadot

ASKER

gen718, how do I do a ping?

My Mozilla browswer and IE are both working ok with the sites I've tried to use in parser.

CEHJ

>>That is perferct code.

So how do you account for this?:

>>javax.swing.text.ChangedCharSetException

etc

>>The reason for not getting the text is because of java.net.UnknownHostException

That's true. I suggested adding a file to the url as it might help. There could be many reasons for a momentary error of this kind. That doesn't detract from the fact the html classes are flaky though

sudhakar_koundinya

try this,

working fine my side

import javax.swing.text.*;
import javax.swing.text.html.*;
import java.io.*;
import java.net.*;

public class testurl{

public static void main(String[] args) {

System.out.println(getText("http://www.cnn.com/"));

}

public static String getText(String uriStr) {
final StringBuffer buf = new StringBuffer(1000);

try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is whenever text is encountered in the HTML file
public void handleText(char[] data, int pos) {
System.err.println("Hello :"+new String(data));
buf.append(data);
buf.append('\n');
}
};
}
};

// Create a reader on the HTML content
//URL url = new URI(uriStr).toURL();
// URLConnection conn = url.openConnection();
//Reader rd = new InputStreamReader(conn.getInputStream());
ByteArrayInputStream stream=new ByteArrayInputStream(getHTML(uriStr).getBytes());
Reader rd=new InputStreamReader(stream);

// Parse the HTML
EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (Exception e) {
e.printStackTrace();
}

// Return the text
return buf.toString();
}

static String getHTML(String _url) {
StringBuffer sb=new StringBuffer();
try {
// Create a URL for the desired page
URL url = new URL(_url);

// Read all the text returned by the server
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
String str;
while ((str = in.readLine()) != null) {
sb.append(str).append("\r\n");
}
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
return sb.toString();
}
}

sudhakar_koundinya

>>So how do you account for this?:

That is obvious because any html parser can't parse non html strings

Before downloading the url content, I just get (CNN.com) which is non HTML String

CEHJ

This is what i get with that code:

Hello :CNN.com
javax.swing.text.ChangedCharSetException
at javax.swing.text.html.parser.DocumentParser.handleEmptyTag(Unknown Source)
at javax.swing.text.html.parser.Parser.startTag(Unknown Source)
at javax.swing.text.html.parser.Parser.parseTag(Unknown Source)
at javax.swing.text.html.parser.Parser.parseContent(Unknown Source)
at javax.swing.text.html.parser.Parser.parse(Unknown Source)
at javax.swing.text.html.parser.DocumentParser.parse(Unknown Source)
at javax.swing.text.html.parser.ParserDelegator.parse(Unknown Source)
at javax.swing.text.html.HTMLEditorKit.read(Unknown Source)
at testurl.getText(testurl.java:42)
at testurl.main(testurl.java:10)
CNN.com

polkadot

ASKER

yeah that's what I get too ...

I thought it may have been my firewall (which is a bit flaky) but I have it turned off now, hope pc doesn't explode :o

CEHJ

>>That is obvious because any html parser can't parse non html strings

Well it certainly doesn't make IE fall over does it? Or wget for that matter, which tends to support what i was saying earlier

sudhakar_koundinya

>> This is what i get with that code:

With my new code??

It is working fine my side

CEHJ

>>With my new code??

Yes

polkadot

ASKER

um ... maybe cnn has blocked us somehow, because i'm getting it to work for less popular sites:

http://archives.math.utk.edu/
http://squid-docs.sourceforge.net/latest/book-full.html#AEN16

etc....

cnn yahoo att don't work

polkadot

ASKER

CEHJ, on some level I appreciate your remaks, but your really not helping :)

CEHJ

>>maybe cnn has blocked us somehow,

Well can you get it with your browser?

CEHJ

>>but your really not helping

Well i'm trying to. I just asked you a question by way of trying to help

polkadot

ASKER

yes, I can get it with my browser, that's what I said before ...

polkadot

ASKER

Sudhakar, it does work sometimes. But what is the cause of when it works and when it doesn't. I may actually consider using the google API and just retreiving documents from its cache.

sudhakar_koundinya

try this

But you need to download HttpClient Api from
http://jakarta.apache.org/commons/httpclient/downloads.html

Let me know the result. This is also working fine my side.

Make Sure that what is the Status code you are getting here

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.*;
import java.net.*;
import javax.swing.text.html.*;
import javax.swing.text.*;

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.GetMethod;

public class URLDownload {

private static String url =
"http://www.cnn.com";

public static String getHtmlText(String url) {

//Instantiate an HttpClient
HttpClient client = new HttpClient();

//Instantiate a GET HTTP method
HttpMethod method = new GetMethod(url);

try{
int statusCode = client.executeMethod(method);

System.out.println("Status Text>>>"
+HttpStatus.getStatusText(statusCode));

//Get data as a String
System.out.println(method.getResponseBodyAsString());

//release connection
method.releaseConnection();
}
catch(IOException e) {
e.printStackTrace();
}
}

public static String getText(String uriStr) {
final StringBuffer buf = new StringBuffer(1000);

try {
// Create an HTML document that appends all text to buf
HTMLDocument doc = new HTMLDocument() {
public HTMLEditorKit.ParserCallback getReader(int pos) {
return new HTMLEditorKit.ParserCallback() {
// This method is whenever text is encountered in the HTML file
public void handleText(char[] data, int pos) {
System.err.println("Hello :"+new String(data));
buf.append(data);
buf.append('\n');
}
};
}
};

// Create a reader on the HTML content
//URL url = new URI(uriStr).toURL();
// URLConnection conn = url.openConnection();
//Reader rd = new InputStreamReader(conn.getInputStream());
ByteArrayInputStream stream=new ByteArrayInputStream(getHtmlText(uriStr).getBytes());
Reader rd=new InputStreamReader(stream);

// Parse the HTML
EditorKit kit = new HTMLEditorKit();
kit.read(rd, doc, 0);
} catch (Exception e) {
e.printStackTrace();
}

// Return the text
return buf.toString();
}
}

CEHJ

That's a better suggestion, but of course won't help if there's a network problem to cnn. You can only make an educated guess as to the reasons for that after gathering as many data as possible

polkadot

ASKER

im sorry, im really tired and really confused, what exactly am I downloading, because there are a few things on that site

polkadot

ASKER

actually, can you tell me how to just get the html source of a page? I can't seem to get the code to just extract the source

sudhakar_koundinya

OK

this is url

http://apache.247available.com/jakarta/commons/httpclient/binary/commons-httpclient-2.0.1.zip

polkadot

ASKER

like can I just get a string s= "... <html> ... blah blah ...</html> ..." from a site like cnn.com

ASKER CERTIFIED SOLUTION

sudhakar_koundinya

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

krakatoa

Sites like CNN - essentially a news service - change all the time, and this probably affects the way your parser is performing. Point your code at a less dynamic site and see if it suddenly becomes more stable.

polkadot

ASKER

yes, krakatoa I think you are right, other sites are working

sudhakar_koundinya

try any one of individual urls some thing like this and test with it

"http://www.cnn.com/2004/WORLD/asiapcf/10/09/afghanistan.elections/index.html"

polkadot

ASKER

the getHTML code works to extract the source html of the cnn page and all the other pages ok, maybe it is the parser after all

I will try the apache, and let you know how it worked

krakatoa

protected timerT ttaskk;

ttaskk = new timerT();
timerr = new java.util.Timer();
timerr.schedule(ttaskk, 0, 15000);

//--------------------- **t i m e r T** INNER CLASS ----------------------------------//
public class timerT extends TimerTask {

private int deltaa,yeltaa;
protected String urlpointer;
protected String questiondata;
protected String eeqpoints;
protected String eeqtitle;
protected String oldeeqpoints;
protected String oldeeqtitle;
protected String eemembername;

public timerT() {

super();
URLcontent = "";
keystring = "";
 backkeystring="";
questiondata = "";
eeqpoints = "";
eeqtitle = "";
eemembername = "";
oldeeqpoints = "";
oldeeqtitle = "";
urlpointer = txtField.getText().trim();

}//end constructor

public void run() {

System.gc();

if (urlpointer.equals("")) {
alertURLchange = false;
mymenuitemdownload.setLabel("Alert URL changes");
}

if (alertURLchange == false) {
timerr.cancel();
System.gc();
}//end if

txtArea.setText("");
try {
URL u = new URL(urlpointer);
try {
Object o = u.getContent();

if (o instanceof InputStream) {
showTextt((InputStream) o);
} else {

showTextt(urlpointer);

}

} catch (IOException e) {
e.printStackTrace();
showTextt("Could not connect to " + u.getHost());
} catch (NullPointerException e) {
e.printStackTrace();
showTextt("There was a problem with the content." + '\n');
}

} catch (MalformedURLException e) {
e.printStackTrace();
showTextt(urlpointer + " is not a valid URL" + '\n');
}

if (URLcontent.indexOf(keystring) < 0) {

if (!((eeqpoints.equals(oldeeqpoints)) && (eeqtitle.equals(oldeeqtitle)))) {

generalDialog am = new generalDialog(thisxcomms, "NEW QUESTION ARRIVED", false);
am.setSize(new Dimension(600, 47));
am.setLayout(new BorderLayout());
JLabel msgtf1 = new JLabel("Question worth " + eeqpoints + " points just posted : ");
JLabel msgtf2 = new JLabel(eeqtitle.trim());
msgtf2.setForeground(java.awt.Color.blue);
msgtf2.setBackground(java.awt.Color.blue);

am.add(msgtf1, BorderLayout.WEST);
am.add(msgtf2, BorderLayout.CENTER);
//msgtf1.setEditable(false);
//msgtf2.setEditable(false);

am.show();

txtArea.appendText(keystring + '\n');//"URL changed "+new java.util.GregorianCalendar().getTime()+'\n');

}//end if

}

oldeeqpoints = eeqpoints;
oldeeqtitle = eeqtitle;

keystring = "";
 backkeystring="";
questiondata = "";
//eeqpoints="";
//eeqtitle="";
eemembername = "";

URLcontent = URLtext.toString().trim();

}//end run timerT

public void showTextt(InputStream is) {

String nextline = null;
URLtext = new StringBuffer();

try {
DataInputStream dis = new DataInputStream(is);

while ((nextline = dis.readLine()) != null) {
URLtext.append(nextline.trim());

}

} catch (IOException e) {
e.printStackTrace();
txtArea.appendText(e.toString());
}
try {

//this keystring locates the Questions Awaiting Answers string.

//keystring = URLtext.substring(URLtext.indexOf("Questions Awaiting Answers:"), (URLtext.indexOf("", URLtext.indexOf("Questions Awaiting Answers:"))));//old EE format
 backkeystring=URLtext.substring(URLtext.indexOf("Questions Awaiting Answers:"));
 keystring = URLtext.substring(URLtext.indexOf("title="));

//this extracts the guts of the Q data from the keystring.

//questiondata = URLtext.substring(URLtext.indexOf("Member", URLtext.indexOf(keystring)), URLtext.indexOf("viewMember", URLtext.indexOf("Member", URLtext.indexOf(keystring))));//old ee format
 questiondata = keystring.substring(keystring.indexOf("title=")+7,keystring.indexOf((char)34,keystring.indexOf("title=")+7));

 String[] bits = questiondata.split(" ");
 StringBuffer stb = new StringBuffer("");
 for(int a=0;a<bits.length;a++){stb.append(bits[a]);}
 if(!stb.toString().equals("")){
 questiondata = stb.toString();
 }

 //System.out.println(questiondata);

//this isolates the points for the Q.

//eeqpoints = questiondata.substring(questiondata.indexOf(">", questiondata.indexOf("center")) + 1, questiondata.indexOf("<", questiondata.indexOf("center")));//old ee format

 eeqpoints = backkeystring.substring(backkeystring.indexOf((char)34+" align=center>")+15,backkeystring.indexOf("</td>",backkeystring.indexOf((char)34+" align=center>")+15));

//eeqtitle = questiondata.substring(questiondata.indexOf("title") + 7, questiondata.indexOf(">", questiondata.indexOf("title") + 7) - 1);//old ee format

 eeqtitle = questiondata;

//System.out.println(eeqpoints);

//System.out.println(eeqtitle);

} catch (StringIndexOutOfBoundsException sio) {
if (URLcontent.equals("")) {
txtArea.appendText("No initial input yet");
}
}
}

public void showTextt(String s) {

String nextline = null;

txtArea.setText("");

try {
RandomAccessFile myRandomAccessFile = new RandomAccessFile(s, "r");//open file

while ((nextline = myRandomAccessFile.readLine()) != null) {
txtArea.appendText(nextline + "\n");
}

} catch (Exception e) {
e.printStackTrace();
txtArea.appendText(e.toString());
}
}

}//end class timerT