Link to home
Start Free TrialLog in
Avatar of CarlosScheidecker
CarlosScheidecker

asked on

Download the source for an HTML page and forcing it to UTF-8

Hello,

How can I download the source code from a web page and force it to be in UTF-8 format?

If I do this to avoid the https error:

System.setProperty("java.protocol.handler.pkgs","com.sun.net.ssl.internal.www.protocol");
                Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

                u = new URL(url);

Than I have an URL object. How can I save that to the file system and make sure it is on UTF-8 all the time.

Also, if the webpage source says:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


I wonder if I should do a String replace to replace iso-8859-1 to UTF-8.

My problem is that I am trying to use JTidy to convert it to XML. If the page is on UTF-8 then I am OK, if not it crashes.

Here is what I am doing:

u = new URL(url);
                  // http://www.enniselectric.com/bid_list/index.htm

                  // Create input and output streams
                  in = new BufferedInputStream(u.openStream());
                  out = new FileOutputStream(outFileName);
                  
                  // Convert files
                  tidy.parse(in, out);

                  // Clean up
                  in.close();
                  out.close();

The error that I have is com.ximpleware.EOFException: permature EOF reached, XML document incomplete

I have tried to add a -1 to the bytes[] array but that crashes.

I also tried to save it on UTF-8 like  out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-8")

I also tried to read it in UTF-8 doing: scanner = new Scanner(new FileInputStream(fileName), "UTF-8");

The problem is that the resulting XML that it creates if I pass it to my VTD parser it crashes due to the exception above.

Now if the page is on UTF-8 originally then all is good.

How can I download a page regardless of its enconding and force it to save in UTF-8 before I send it to JTidy?
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

I'm wondering why you need to transcode the whole page - why not (at most) just the data of the page you're interested in?
Java Strings are Unicode strings, that's why it is working when the page is UTF-8 encoded.
When it is not, you need to know what the original encoding is, then download the page into a byte array and convert into String specifying the original encoding.
Avatar of CarlosScheidecker
CarlosScheidecker

ASKER

CEHJ,

Exactly. What I have been doing is removing the table tags and doing only those.

Hegemon,

I am extracting the encoding schema from the page, I thought the same. I guess, then I have to save it in the proper format and read it with the proper encoding. Then, I extract its parts I am interested on, that is everything inside a table.

For that, as I am converting to XML, I will do an easy regex so that I can get the char encoding.
ASKER CERTIFIED SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
CEHJ,

That seems good bu the code does not work. I have created a Maven project, added the httpunit dependency to it, also the servlet-api and Jtidy and then complied (which wokrs) and tried to run (which fails). The error is:


at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at com.meterware.httpunit.ParsedHTML.<clinit>(ParsedHTML.java:724)
      at com.meterware.httpunit.WebResponse.getReceivedPage(WebResponse.java:1300)
      at com.meterware.httpunit.WebResponse.getFrames(WebResponse.java:1285)
      at com.meterware.httpunit.WebResponse.getFrameRequests(WebResponse.java:1024)
      at com.meterware.httpunit.FrameHolder.updateFrames(FrameHolder.java:179)
      at com.meterware.httpunit.WebWindow.updateFrameContents(WebWindow.java:315)
      at com.meterware.httpunit.WebClient.updateFrameContents(WebClient.java:526)
      at com.meterware.httpunit.WebWindow.updateWindow(WebWindow.java:201)
      at com.meterware.httpunit.WebWindow.getSubframeResponse(WebWindow.java:183)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:158)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:125)
      at com.meterware.httpunit.WebClient.getResponse(WebClient.java:96)
      at myapp.extractor.table.TableModel.App.runIt(App.java:23)
      at myapp.extractor.table.TableModel.App.main(App.java:49)
Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.Scriptable
      at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
You're missing a dependency - probably something like js-1.6R5.jar. I get something like the following output (an extract)
| Job Name 
                   | Bid Date & Time 
                
                

                   | Dulles Jet Center 
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | NIST Cooling Tower Replacement
                   | February 22, 2011 @ 2:00 P.M. 
                
                

                   | Catharpin Irrigation 
                   | February 22, 2011 @ 1:00 P.M. 
                
                

                   | Mt. Weather - Emergency Ops Center
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | AWG Firing Range, Ft. Meade 
                   | February 28, 2011 @ 2:00 P.M 
                
                

                   | Center for Strategic & International Studies
                   | March 7, 2011 @ 2:00 P.M. 
 
|

Open in new window

That is nice CEHJ, I have changed the code a little bit so that it would print the actual table in XML format which is what I want.

And here is the code:


import java.io.IOException;

import org.xml.sax.SAXException;

import org.w3c.dom.Attr;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import com.meterware.httpunit.*;

public class App 
{
	public static String escapeXML(String s) {
	    StringBuffer str = new StringBuffer();
	    int len = (s != null) ? s.length() : 0;
	    for (int i=0; i<len; i++) {
	       char ch = s.charAt(i);
	       switch (ch) {
	       case '<': str.append("&lt;"); break;
	       case '>': str.append("&gt;"); break;
	       case '&': str.append("&amp;"); break;
	       case '"': str.append("&quot;"); break;
	       case '\'': str.append("&apos;"); break;
	       default: str.append(ch);
	     }
	    }
	    return str.toString();
	  }
	
	public static String print(Node node) {
		String result = "";
	    int type = node.getNodeType();
	    switch (type) {
	      case Node.ELEMENT_NODE:
	        result += "<" + node.getNodeName();
	        NamedNodeMap attrs = node.getAttributes();
	        int len = attrs.getLength();
	        for (int i=0; i<len; i++) {
	            Attr attr = (Attr)attrs.item(i);
	            result += " " + attr.getNodeName() + "=\"" +escapeXML(attr.getNodeValue()) + "\"";
	        }
	        result += ">";
	        NodeList children = node.getChildNodes();
	        len = children.getLength();
	        for (int i=0; i<len; i++)
	          result += print(children.item(i));
	        result += "</" + node.getNodeName() + ">";
	        break;
	      case Node.ENTITY_REFERENCE_NODE:
	    	  result += "&" + node.getNodeName() + ";";
	        break;
	      case Node.CDATA_SECTION_NODE:
	    	  result += "<![CDATA[" + node.getNodeValue() + "]]>";
	        break;
	      case Node.TEXT_NODE:
	    	  result += escapeXML(node.getNodeValue());
	        break;
	      case Node.PROCESSING_INSTRUCTION_NODE:
	    	  result += "<?" + node.getNodeName();
	        String data = node.getNodeValue();
	        if (data!=null && data.length()>0)
	        	result += " " + data;
	        result += "?>";
	        break;
	    }
	    return result;
	  }

	
	
	public static void runIt(String url) {
    	HttpUnitOptions.setScriptingEnabled(false);

		WebConversation wc = new WebConversation();
		WebResponse resp;
		try {
			resp = wc.getResponse(url);
			WebTable[] tables = resp.getTables();
			System.out.printf("Found %d table(s) in the response\n", tables.length);
			for (int i = 0; i < tables.length; i++) {
				System.out.println("Table no "+(i+1));
				System.out.println(print(tables[i].getNode()));
				System.out.println("");
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
    public static void main( String[] args )
    {
    	String url = "http://www.asu.edu/purchasing/bids/";
    	runIt(url);
    }
}

Open in new window

if you just need to strip out the tags then this may help

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

though striping out the tags will make it a lot harder to convert to xml.
That's good :)
Objects, I am converting to XML above already.
I need the XML. For extracting just the text I like to do it via JTidy.
you could avoid having to manually build the xml, and just do a transformation
You are right, but in this case is a requirement.