• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 432
  • Last Modified:

Download the source for an HTML page and forcing it to UTF-8

Hello,

How can I download the source code from a web page and force it to be in UTF-8 format?

If I do this to avoid the https error:

System.setProperty("java.protocol.handler.pkgs","com.sun.net.ssl.internal.www.protocol");
                Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

                u = new URL(url);

Than I have an URL object. How can I save that to the file system and make sure it is on UTF-8 all the time.

Also, if the webpage source says:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


I wonder if I should do a String replace to replace iso-8859-1 to UTF-8.

My problem is that I am trying to use JTidy to convert it to XML. If the page is on UTF-8 then I am OK, if not it crashes.

Here is what I am doing:

u = new URL(url);
                  // http://www.enniselectric.com/bid_list/index.htm

                  // Create input and output streams
                  in = new BufferedInputStream(u.openStream());
                  out = new FileOutputStream(outFileName);
                  
                  // Convert files
                  tidy.parse(in, out);

                  // Clean up
                  in.close();
                  out.close();

The error that I have is com.ximpleware.EOFException: permature EOF reached, XML document incomplete

I have tried to add a -1 to the bytes[] array but that crashes.

I also tried to save it on UTF-8 like  out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-8")

I also tried to read it in UTF-8 doing: scanner = new Scanner(new FileInputStream(fileName), "UTF-8");

The problem is that the resulting XML that it creates if I pass it to my VTD parser it crashes due to the exception above.

Now if the page is on UTF-8 originally then all is good.

How can I download a page regardless of its enconding and force it to save in UTF-8 before I send it to JTidy?
0
CarlosScheidecker
Asked:
CarlosScheidecker
  • 6
  • 4
  • 2
  • +1
1 Solution
 
CEHJCommented:
I'm wondering why you need to transcode the whole page - why not (at most) just the data of the page you're interested in?
0
 
HegemonCommented:
Java Strings are Unicode strings, that's why it is working when the page is UTF-8 encoded.
When it is not, you need to know what the original encoding is, then download the page into a byte array and convert into String specifying the original encoding.
0
 
CarlosScheideckerAuthor Commented:
CEHJ,

Exactly. What I have been doing is removing the table tags and doing only those.

Hegemon,

I am extracting the encoding schema from the page, I thought the same. I guess, then I have to save it in the proper format and read it with the proper encoding. Then, I extract its parts I am interested on, that is everything inside a table.

For that, as I am converting to XML, I will do an easy regex so that I can get the char encoding.
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
CEHJCommented:
You'd actually be better off using a higher level API such as HttpUnit (which uses tidying internally), then you can output the table data in whatever encoding you want
import com.meterware.httpunit.*;


public class Tables {

	public static void main(String[] args) throws Exception {
		HttpUnitOptions.setScriptingEnabled(false);

		String START = "http://www.enniselectric.com/bid_list/index.htm";
		WebConversation wc = new WebConversation();
		WebResponse resp = wc.getResponse(START);
		WebTable[] tables = resp.getTables();
		System.out.printf("Found %d table(s) in the response\n", tables.length);

		for (int r = 0; r < tables[0].getRowCount(); r++) {
			for (int c = 0; c < tables[0].getColumnCount(); c++) {
				//TableCell cell = tables[0].getTableCell(r, c);
				//System.out.printf("%s ", cell.getText());
				System.out.printf("%s ", tables[0].getCellAsText(r, c).trim());
			}
			System.out.println();
		}
	}
}

Open in new window

0
 
CarlosScheideckerAuthor Commented:
CEHJ,

That seems good bu the code does not work. I have created a Maven project, added the httpunit dependency to it, also the servlet-api and Jtidy and then complied (which wokrs) and tried to run (which fails). The error is:


at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at com.meterware.httpunit.ParsedHTML.<clinit>(ParsedHTML.java:724)
      at com.meterware.httpunit.WebResponse.getReceivedPage(WebResponse.java:1300)
      at com.meterware.httpunit.WebResponse.getFrames(WebResponse.java:1285)
      at com.meterware.httpunit.WebResponse.getFrameRequests(WebResponse.java:1024)
      at com.meterware.httpunit.FrameHolder.updateFrames(FrameHolder.java:179)
      at com.meterware.httpunit.WebWindow.updateFrameContents(WebWindow.java:315)
      at com.meterware.httpunit.WebClient.updateFrameContents(WebClient.java:526)
      at com.meterware.httpunit.WebWindow.updateWindow(WebWindow.java:201)
      at com.meterware.httpunit.WebWindow.getSubframeResponse(WebWindow.java:183)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:158)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:125)
      at com.meterware.httpunit.WebClient.getResponse(WebClient.java:96)
      at myapp.extractor.table.TableModel.App.runIt(App.java:23)
      at myapp.extractor.table.TableModel.App.main(App.java:49)
Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.Scriptable
      at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
0
 
CEHJCommented:
You're missing a dependency - probably something like js-1.6R5.jar. I get something like the following output (an extract)
| Job Name 
                   | Bid Date & Time 
                
                

                   | Dulles Jet Center 
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | NIST Cooling Tower Replacement
                   | February 22, 2011 @ 2:00 P.M. 
                
                

                   | Catharpin Irrigation 
                   | February 22, 2011 @ 1:00 P.M. 
                
                

                   | Mt. Weather - Emergency Ops Center
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | AWG Firing Range, Ft. Meade 
                   | February 28, 2011 @ 2:00 P.M 
                
                

                   | Center for Strategic & International Studies
                   | March 7, 2011 @ 2:00 P.M. 
 
|

Open in new window

0
 
CarlosScheideckerAuthor Commented:
That is nice CEHJ, I have changed the code a little bit so that it would print the actual table in XML format which is what I want.

And here is the code:


import java.io.IOException;

import org.xml.sax.SAXException;

import org.w3c.dom.Attr;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import com.meterware.httpunit.*;

public class App 
{
	public static String escapeXML(String s) {
	    StringBuffer str = new StringBuffer();
	    int len = (s != null) ? s.length() : 0;
	    for (int i=0; i<len; i++) {
	       char ch = s.charAt(i);
	       switch (ch) {
	       case '<': str.append("&lt;"); break;
	       case '>': str.append("&gt;"); break;
	       case '&': str.append("&amp;"); break;
	       case '"': str.append("&quot;"); break;
	       case '\'': str.append("&apos;"); break;
	       default: str.append(ch);
	     }
	    }
	    return str.toString();
	  }
	
	public static String print(Node node) {
		String result = "";
	    int type = node.getNodeType();
	    switch (type) {
	      case Node.ELEMENT_NODE:
	        result += "<" + node.getNodeName();
	        NamedNodeMap attrs = node.getAttributes();
	        int len = attrs.getLength();
	        for (int i=0; i<len; i++) {
	            Attr attr = (Attr)attrs.item(i);
	            result += " " + attr.getNodeName() + "=\"" +escapeXML(attr.getNodeValue()) + "\"";
	        }
	        result += ">";
	        NodeList children = node.getChildNodes();
	        len = children.getLength();
	        for (int i=0; i<len; i++)
	          result += print(children.item(i));
	        result += "</" + node.getNodeName() + ">";
	        break;
	      case Node.ENTITY_REFERENCE_NODE:
	    	  result += "&" + node.getNodeName() + ";";
	        break;
	      case Node.CDATA_SECTION_NODE:
	    	  result += "<![CDATA[" + node.getNodeValue() + "]]>";
	        break;
	      case Node.TEXT_NODE:
	    	  result += escapeXML(node.getNodeValue());
	        break;
	      case Node.PROCESSING_INSTRUCTION_NODE:
	    	  result += "<?" + node.getNodeName();
	        String data = node.getNodeValue();
	        if (data!=null && data.length()>0)
	        	result += " " + data;
	        result += "?>";
	        break;
	    }
	    return result;
	  }

	
	
	public static void runIt(String url) {
    	HttpUnitOptions.setScriptingEnabled(false);

		WebConversation wc = new WebConversation();
		WebResponse resp;
		try {
			resp = wc.getResponse(url);
			WebTable[] tables = resp.getTables();
			System.out.printf("Found %d table(s) in the response\n", tables.length);
			for (int i = 0; i < tables.length; i++) {
				System.out.println("Table no "+(i+1));
				System.out.println(print(tables[i].getNode()));
				System.out.println("");
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
    public static void main( String[] args )
    {
    	String url = "http://www.asu.edu/purchasing/bids/";
    	runIt(url);
    }
}

Open in new window

0
 
objectsCommented:
if you just need to strip out the tags then this may help

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

though striping out the tags will make it a lot harder to convert to xml.
0
 
CEHJCommented:
That's good :)
0
 
CarlosScheideckerAuthor Commented:
Objects, I am converting to XML above already.
0
 
CarlosScheideckerAuthor Commented:
I need the XML. For extracting just the text I like to do it via JTidy.
0
 
objectsCommented:
you could avoid having to manually build the xml, and just do a transformation
0
 
CarlosScheideckerAuthor Commented:
You are right, but in this case is a requirement.
0

Featured Post

Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

  • 6
  • 4
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now