Solved

Download the source for an HTML page and forcing it to UTF-8

Posted on 2011-03-01
13
425 Views
Last Modified: 2012-05-11
Hello,

How can I download the source code from a web page and force it to be in UTF-8 format?

If I do this to avoid the https error:

System.setProperty("java.protocol.handler.pkgs","com.sun.net.ssl.internal.www.protocol");
                Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

                u = new URL(url);

Than I have an URL object. How can I save that to the file system and make sure it is on UTF-8 all the time.

Also, if the webpage source says:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


I wonder if I should do a String replace to replace iso-8859-1 to UTF-8.

My problem is that I am trying to use JTidy to convert it to XML. If the page is on UTF-8 then I am OK, if not it crashes.

Here is what I am doing:

u = new URL(url);
                  // http://www.enniselectric.com/bid_list/index.htm

                  // Create input and output streams
                  in = new BufferedInputStream(u.openStream());
                  out = new FileOutputStream(outFileName);
                  
                  // Convert files
                  tidy.parse(in, out);

                  // Clean up
                  in.close();
                  out.close();

The error that I have is com.ximpleware.EOFException: permature EOF reached, XML document incomplete

I have tried to add a -1 to the bytes[] array but that crashes.

I also tried to save it on UTF-8 like  out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-8")

I also tried to read it in UTF-8 doing: scanner = new Scanner(new FileInputStream(fileName), "UTF-8");

The problem is that the resulting XML that it creates if I pass it to my VTD parser it crashes due to the exception above.

Now if the page is on UTF-8 originally then all is good.

How can I download a page regardless of its enconding and force it to save in UTF-8 before I send it to JTidy?
0
Comment
Question by:CarlosScheidecker
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
  • +1
13 Comments
 
LVL 86

Expert Comment

by:CEHJ
ID: 35015785
I'm wondering why you need to transcode the whole page - why not (at most) just the data of the page you're interested in?
0
 
LVL 10

Expert Comment

by:Hegemon
ID: 35018668
Java Strings are Unicode strings, that's why it is working when the page is UTF-8 encoded.
When it is not, you need to know what the original encoding is, then download the page into a byte array and convert into String specifying the original encoding.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35018945
CEHJ,

Exactly. What I have been doing is removing the table tags and doing only those.

Hegemon,

I am extracting the encoding schema from the page, I thought the same. I guess, then I have to save it in the proper format and read it with the proper encoding. Then, I extract its parts I am interested on, that is everything inside a table.

For that, as I am converting to XML, I will do an easy regex so that I can get the char encoding.
0
PeopleSoft Has Never Been Easier

PeopleSoft Adoption Made Smooth & Simple!

On-The-Job Training Is made Intuitive & Easy With WalkMe's On-Screen Guidance Tool.  Claim Your Free WalkMe Account Now

 
LVL 86

Accepted Solution

by:
CEHJ earned 500 total points
ID: 35020263
You'd actually be better off using a higher level API such as HttpUnit (which uses tidying internally), then you can output the table data in whatever encoding you want
import com.meterware.httpunit.*;


public class Tables {

	public static void main(String[] args) throws Exception {
		HttpUnitOptions.setScriptingEnabled(false);

		String START = "http://www.enniselectric.com/bid_list/index.htm";
		WebConversation wc = new WebConversation();
		WebResponse resp = wc.getResponse(START);
		WebTable[] tables = resp.getTables();
		System.out.printf("Found %d table(s) in the response\n", tables.length);

		for (int r = 0; r < tables[0].getRowCount(); r++) {
			for (int c = 0; c < tables[0].getColumnCount(); c++) {
				//TableCell cell = tables[0].getTableCell(r, c);
				//System.out.printf("%s ", cell.getText());
				System.out.printf("%s ", tables[0].getCellAsText(r, c).trim());
			}
			System.out.println();
		}
	}
}

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35021534
CEHJ,

That seems good bu the code does not work. I have created a Maven project, added the httpunit dependency to it, also the servlet-api and Jtidy and then complied (which wokrs) and tried to run (which fails). The error is:


at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at com.meterware.httpunit.ParsedHTML.<clinit>(ParsedHTML.java:724)
      at com.meterware.httpunit.WebResponse.getReceivedPage(WebResponse.java:1300)
      at com.meterware.httpunit.WebResponse.getFrames(WebResponse.java:1285)
      at com.meterware.httpunit.WebResponse.getFrameRequests(WebResponse.java:1024)
      at com.meterware.httpunit.FrameHolder.updateFrames(FrameHolder.java:179)
      at com.meterware.httpunit.WebWindow.updateFrameContents(WebWindow.java:315)
      at com.meterware.httpunit.WebClient.updateFrameContents(WebClient.java:526)
      at com.meterware.httpunit.WebWindow.updateWindow(WebWindow.java:201)
      at com.meterware.httpunit.WebWindow.getSubframeResponse(WebWindow.java:183)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:158)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:125)
      at com.meterware.httpunit.WebClient.getResponse(WebClient.java:96)
      at myapp.extractor.table.TableModel.App.runIt(App.java:23)
      at myapp.extractor.table.TableModel.App.main(App.java:49)
Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.Scriptable
      at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35021766
You're missing a dependency - probably something like js-1.6R5.jar. I get something like the following output (an extract)
| Job Name 
                   | Bid Date & Time 
                
                

                   | Dulles Jet Center 
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | NIST Cooling Tower Replacement
                   | February 22, 2011 @ 2:00 P.M. 
                
                

                   | Catharpin Irrigation 
                   | February 22, 2011 @ 1:00 P.M. 
                
                

                   | Mt. Weather - Emergency Ops Center
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | AWG Firing Range, Ft. Meade 
                   | February 28, 2011 @ 2:00 P.M 
                
                

                   | Center for Strategic & International Studies
                   | March 7, 2011 @ 2:00 P.M. 
 
|

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022362
That is nice CEHJ, I have changed the code a little bit so that it would print the actual table in XML format which is what I want.

And here is the code:


import java.io.IOException;

import org.xml.sax.SAXException;

import org.w3c.dom.Attr;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import com.meterware.httpunit.*;

public class App 
{
	public static String escapeXML(String s) {
	    StringBuffer str = new StringBuffer();
	    int len = (s != null) ? s.length() : 0;
	    for (int i=0; i<len; i++) {
	       char ch = s.charAt(i);
	       switch (ch) {
	       case '<': str.append("&lt;"); break;
	       case '>': str.append("&gt;"); break;
	       case '&': str.append("&amp;"); break;
	       case '"': str.append("&quot;"); break;
	       case '\'': str.append("&apos;"); break;
	       default: str.append(ch);
	     }
	    }
	    return str.toString();
	  }
	
	public static String print(Node node) {
		String result = "";
	    int type = node.getNodeType();
	    switch (type) {
	      case Node.ELEMENT_NODE:
	        result += "<" + node.getNodeName();
	        NamedNodeMap attrs = node.getAttributes();
	        int len = attrs.getLength();
	        for (int i=0; i<len; i++) {
	            Attr attr = (Attr)attrs.item(i);
	            result += " " + attr.getNodeName() + "=\"" +escapeXML(attr.getNodeValue()) + "\"";
	        }
	        result += ">";
	        NodeList children = node.getChildNodes();
	        len = children.getLength();
	        for (int i=0; i<len; i++)
	          result += print(children.item(i));
	        result += "</" + node.getNodeName() + ">";
	        break;
	      case Node.ENTITY_REFERENCE_NODE:
	    	  result += "&" + node.getNodeName() + ";";
	        break;
	      case Node.CDATA_SECTION_NODE:
	    	  result += "<![CDATA[" + node.getNodeValue() + "]]>";
	        break;
	      case Node.TEXT_NODE:
	    	  result += escapeXML(node.getNodeValue());
	        break;
	      case Node.PROCESSING_INSTRUCTION_NODE:
	    	  result += "<?" + node.getNodeName();
	        String data = node.getNodeValue();
	        if (data!=null && data.length()>0)
	        	result += " " + data;
	        result += "?>";
	        break;
	    }
	    return result;
	  }

	
	
	public static void runIt(String url) {
    	HttpUnitOptions.setScriptingEnabled(false);

		WebConversation wc = new WebConversation();
		WebResponse resp;
		try {
			resp = wc.getResponse(url);
			WebTable[] tables = resp.getTables();
			System.out.printf("Found %d table(s) in the response\n", tables.length);
			for (int i = 0; i < tables.length; i++) {
				System.out.println("Table no "+(i+1));
				System.out.println(print(tables[i].getNode()));
				System.out.println("");
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
    public static void main( String[] args )
    {
    	String url = "http://www.asu.edu/purchasing/bids/";
    	runIt(url);
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
ID: 35022378
if you just need to strip out the tags then this may help

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

though striping out the tags will make it a lot harder to convert to xml.
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 35022386
That's good :)
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022399
Objects, I am converting to XML above already.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022409
I need the XML. For extracting just the text I like to do it via JTidy.
0
 
LVL 92

Expert Comment

by:objects
ID: 35022431
you could avoid having to manually build the xml, and just do a transformation
0
 
LVL 1

Author Comment

by:CarlosScheidecker
ID: 35022675
You are right, but in this case is a requirement.
0

Featured Post

Online Training Solution

Drastically shorten your training time with WalkMe's advanced online training solution that Guides your trainees to action. Forget about retraining and skyrocket knowledge retention rates.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

After being asked a question last year, I went into one of my moods where I did some research and code just for the fun and learning of it all.  Subsequently, from this journey, I put together this article on "Range Searching Using Visual Basic.NET …
Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:

635 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question