Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Download the source for an HTML page and forcing it to UTF-8

Posted on 2011-03-01
13
Medium Priority
?
431 Views
Last Modified: 2012-05-11
Hello,

How can I download the source code from a web page and force it to be in UTF-8 format?

If I do this to avoid the https error:

System.setProperty("java.protocol.handler.pkgs","com.sun.net.ssl.internal.www.protocol");
                Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

                u = new URL(url);

Than I have an URL object. How can I save that to the file system and make sure it is on UTF-8 all the time.

Also, if the webpage source says:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


I wonder if I should do a String replace to replace iso-8859-1 to UTF-8.

My problem is that I am trying to use JTidy to convert it to XML. If the page is on UTF-8 then I am OK, if not it crashes.

Here is what I am doing:

u = new URL(url);
                  // http://www.enniselectric.com/bid_list/index.htm

                  // Create input and output streams
                  in = new BufferedInputStream(u.openStream());
                  out = new FileOutputStream(outFileName);
                  
                  // Convert files
                  tidy.parse(in, out);

                  // Clean up
                  in.close();
                  out.close();

The error that I have is com.ximpleware.EOFException: permature EOF reached, XML document incomplete

I have tried to add a -1 to the bytes[] array but that crashes.

I also tried to save it on UTF-8 like  out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-8")

I also tried to read it in UTF-8 doing: scanner = new Scanner(new FileInputStream(fileName), "UTF-8");

The problem is that the resulting XML that it creates if I pass it to my VTD parser it crashes due to the exception above.

Now if the page is on UTF-8 originally then all is good.

How can I download a page regardless of its enconding and force it to save in UTF-8 before I send it to JTidy?
0
Comment
Question by:CarlosScheidecker
  • 6
  • 4
  • 2
  • +1
13 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 35015785
I'm wondering why you need to transcode the whole page - why not (at most) just the data of the page you're interested in?
0
 
LVL 10

Expert Comment

by:Hegemon
ID: 35018668
Java Strings are Unicode strings, that's why it is working when the page is UTF-8 encoded.
When it is not, you need to know what the original encoding is, then download the page into a byte array and convert into String specifying the original encoding.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35018945
CEHJ,

Exactly. What I have been doing is removing the table tags and doing only those.

Hegemon,

I am extracting the encoding schema from the page, I thought the same. I guess, then I have to save it in the proper format and read it with the proper encoding. Then, I extract its parts I am interested on, that is everything inside a table.

For that, as I am converting to XML, I will do an easy regex so that I can get the char encoding.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 86

Accepted Solution

by:
CEHJ earned 2000 total points
ID: 35020263
You'd actually be better off using a higher level API such as HttpUnit (which uses tidying internally), then you can output the table data in whatever encoding you want
import com.meterware.httpunit.*;


public class Tables {

	public static void main(String[] args) throws Exception {
		HttpUnitOptions.setScriptingEnabled(false);

		String START = "http://www.enniselectric.com/bid_list/index.htm";
		WebConversation wc = new WebConversation();
		WebResponse resp = wc.getResponse(START);
		WebTable[] tables = resp.getTables();
		System.out.printf("Found %d table(s) in the response\n", tables.length);

		for (int r = 0; r < tables[0].getRowCount(); r++) {
			for (int c = 0; c < tables[0].getColumnCount(); c++) {
				//TableCell cell = tables[0].getTableCell(r, c);
				//System.out.printf("%s ", cell.getText());
				System.out.printf("%s ", tables[0].getCellAsText(r, c).trim());
			}
			System.out.println();
		}
	}
}

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35021534
CEHJ,

That seems good bu the code does not work. I have created a Maven project, added the httpunit dependency to it, also the servlet-api and Jtidy and then complied (which wokrs) and tried to run (which fails). The error is:


at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at com.meterware.httpunit.ParsedHTML.<clinit>(ParsedHTML.java:724)
      at com.meterware.httpunit.WebResponse.getReceivedPage(WebResponse.java:1300)
      at com.meterware.httpunit.WebResponse.getFrames(WebResponse.java:1285)
      at com.meterware.httpunit.WebResponse.getFrameRequests(WebResponse.java:1024)
      at com.meterware.httpunit.FrameHolder.updateFrames(FrameHolder.java:179)
      at com.meterware.httpunit.WebWindow.updateFrameContents(WebWindow.java:315)
      at com.meterware.httpunit.WebClient.updateFrameContents(WebClient.java:526)
      at com.meterware.httpunit.WebWindow.updateWindow(WebWindow.java:201)
      at com.meterware.httpunit.WebWindow.getSubframeResponse(WebWindow.java:183)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:158)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:125)
      at com.meterware.httpunit.WebClient.getResponse(WebClient.java:96)
      at myapp.extractor.table.TableModel.App.runIt(App.java:23)
      at myapp.extractor.table.TableModel.App.main(App.java:49)
Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.Scriptable
      at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35021766
You're missing a dependency - probably something like js-1.6R5.jar. I get something like the following output (an extract)
| Job Name 
                   | Bid Date & Time 
                
                

                   | Dulles Jet Center 
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | NIST Cooling Tower Replacement
                   | February 22, 2011 @ 2:00 P.M. 
                
                

                   | Catharpin Irrigation 
                   | February 22, 2011 @ 1:00 P.M. 
                
                

                   | Mt. Weather - Emergency Ops Center
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | AWG Firing Range, Ft. Meade 
                   | February 28, 2011 @ 2:00 P.M 
                
                

                   | Center for Strategic & International Studies
                   | March 7, 2011 @ 2:00 P.M. 
 
|

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022362
That is nice CEHJ, I have changed the code a little bit so that it would print the actual table in XML format which is what I want.

And here is the code:


import java.io.IOException;

import org.xml.sax.SAXException;

import org.w3c.dom.Attr;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import com.meterware.httpunit.*;

public class App 
{
	public static String escapeXML(String s) {
	    StringBuffer str = new StringBuffer();
	    int len = (s != null) ? s.length() : 0;
	    for (int i=0; i<len; i++) {
	       char ch = s.charAt(i);
	       switch (ch) {
	       case '<': str.append("&lt;"); break;
	       case '>': str.append("&gt;"); break;
	       case '&': str.append("&amp;"); break;
	       case '"': str.append("&quot;"); break;
	       case '\'': str.append("&apos;"); break;
	       default: str.append(ch);
	     }
	    }
	    return str.toString();
	  }
	
	public static String print(Node node) {
		String result = "";
	    int type = node.getNodeType();
	    switch (type) {
	      case Node.ELEMENT_NODE:
	        result += "<" + node.getNodeName();
	        NamedNodeMap attrs = node.getAttributes();
	        int len = attrs.getLength();
	        for (int i=0; i<len; i++) {
	            Attr attr = (Attr)attrs.item(i);
	            result += " " + attr.getNodeName() + "=\"" +escapeXML(attr.getNodeValue()) + "\"";
	        }
	        result += ">";
	        NodeList children = node.getChildNodes();
	        len = children.getLength();
	        for (int i=0; i<len; i++)
	          result += print(children.item(i));
	        result += "</" + node.getNodeName() + ">";
	        break;
	      case Node.ENTITY_REFERENCE_NODE:
	    	  result += "&" + node.getNodeName() + ";";
	        break;
	      case Node.CDATA_SECTION_NODE:
	    	  result += "<![CDATA[" + node.getNodeValue() + "]]>";
	        break;
	      case Node.TEXT_NODE:
	    	  result += escapeXML(node.getNodeValue());
	        break;
	      case Node.PROCESSING_INSTRUCTION_NODE:
	    	  result += "<?" + node.getNodeName();
	        String data = node.getNodeValue();
	        if (data!=null && data.length()>0)
	        	result += " " + data;
	        result += "?>";
	        break;
	    }
	    return result;
	  }

	
	
	public static void runIt(String url) {
    	HttpUnitOptions.setScriptingEnabled(false);

		WebConversation wc = new WebConversation();
		WebResponse resp;
		try {
			resp = wc.getResponse(url);
			WebTable[] tables = resp.getTables();
			System.out.printf("Found %d table(s) in the response\n", tables.length);
			for (int i = 0; i < tables.length; i++) {
				System.out.println("Table no "+(i+1));
				System.out.println(print(tables[i].getNode()));
				System.out.println("");
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
    public static void main( String[] args )
    {
    	String url = "http://www.asu.edu/purchasing/bids/";
    	runIt(url);
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 35022378
if you just need to strip out the tags then this may help

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

though striping out the tags will make it a lot harder to convert to xml.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35022386
That's good :)
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022399
Objects, I am converting to XML above already.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022409
I need the XML. For extracting just the text I like to do it via JTidy.
0
 
LVL 92

Expert Comment

by:objects
ID: 35022431
you could avoid having to manually build the xml, and just do a transformation
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022675
You are right, but in this case is a requirement.
0

Featured Post

How to Use the Help Bell

Need to boost the visibility of your question for solutions? Use the Experts Exchange Help Bell to confirm priority levels and contact subject-matter experts for question attention.  Check out this how-to article for more information.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Viewers learn about the “for” loop and how it works in Java. By comparing it to the while loop learned before, viewers can make the transition easily. You will learn about the formatting of the for loop as we write a program that prints even numbers…
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
Suggested Courses

963 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question