Solved

Download the source for an HTML page and forcing it to UTF-8

Posted on 2011-03-01
13
416 Views
Last Modified: 2012-05-11
Hello,

How can I download the source code from a web page and force it to be in UTF-8 format?

If I do this to avoid the https error:

System.setProperty("java.protocol.handler.pkgs","com.sun.net.ssl.internal.www.protocol");
                Security.addProvider(new com.sun.net.ssl.internal.ssl.Provider());

                u = new URL(url);

Than I have an URL object. How can I save that to the file system and make sure it is on UTF-8 all the time.

Also, if the webpage source says:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />


I wonder if I should do a String replace to replace iso-8859-1 to UTF-8.

My problem is that I am trying to use JTidy to convert it to XML. If the page is on UTF-8 then I am OK, if not it crashes.

Here is what I am doing:

u = new URL(url);
                  // http://www.enniselectric.com/bid_list/index.htm

                  // Create input and output streams
                  in = new BufferedInputStream(u.openStream());
                  out = new FileOutputStream(outFileName);
                  
                  // Convert files
                  tidy.parse(in, out);

                  // Clean up
                  in.close();
                  out.close();

The error that I have is com.ximpleware.EOFException: permature EOF reached, XML document incomplete

I have tried to add a -1 to the bytes[] array but that crashes.

I also tried to save it on UTF-8 like  out = new OutputStreamWriter(new FileOutputStream(fileName),"UTF-8")

I also tried to read it in UTF-8 doing: scanner = new Scanner(new FileInputStream(fileName), "UTF-8");

The problem is that the resulting XML that it creates if I pass it to my VTD parser it crashes due to the exception above.

Now if the page is on UTF-8 originally then all is good.

How can I download a page regardless of its enconding and force it to save in UTF-8 before I send it to JTidy?
0
Comment
Question by:CarlosScheidecker
  • 6
  • 4
  • 2
  • +1
13 Comments
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
I'm wondering why you need to transcode the whole page - why not (at most) just the data of the page you're interested in?
0
 
LVL 10

Expert Comment

by:Hegemon
Comment Utility
Java Strings are Unicode strings, that's why it is working when the page is UTF-8 encoded.
When it is not, you need to know what the original encoding is, then download the page into a byte array and convert into String specifying the original encoding.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
Comment Utility
CEHJ,

Exactly. What I have been doing is removing the table tags and doing only those.

Hegemon,

I am extracting the encoding schema from the page, I thought the same. I guess, then I have to save it in the proper format and read it with the proper encoding. Then, I extract its parts I am interested on, that is everything inside a table.

For that, as I am converting to XML, I will do an easy regex so that I can get the char encoding.
0
 
LVL 86

Accepted Solution

by:
CEHJ earned 500 total points
Comment Utility
You'd actually be better off using a higher level API such as HttpUnit (which uses tidying internally), then you can output the table data in whatever encoding you want
import com.meterware.httpunit.*;


public class Tables {

	public static void main(String[] args) throws Exception {
		HttpUnitOptions.setScriptingEnabled(false);

		String START = "http://www.enniselectric.com/bid_list/index.htm";
		WebConversation wc = new WebConversation();
		WebResponse resp = wc.getResponse(START);
		WebTable[] tables = resp.getTables();
		System.out.printf("Found %d table(s) in the response\n", tables.length);

		for (int r = 0; r < tables[0].getRowCount(); r++) {
			for (int c = 0; c < tables[0].getColumnCount(); c++) {
				//TableCell cell = tables[0].getTableCell(r, c);
				//System.out.printf("%s ", cell.getText());
				System.out.printf("%s ", tables[0].getCellAsText(r, c).trim());
			}
			System.out.println();
		}
	}
}

Open in new window

0
 
LVL 1

Author Comment

by:CarlosScheidecker
Comment Utility
CEHJ,

That seems good bu the code does not work. I have created a Maven project, added the httpunit dependency to it, also the servlet-api and Jtidy and then complied (which wokrs) and tried to run (which fails). The error is:


at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at java.lang.ClassLoader.defineClass1(Native Method)
      at java.lang.ClassLoader.defineClass(ClassLoader.java:634)
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:277)
      at java.net.URLClassLoader.access$000(URLClassLoader.java:73)
      at java.net.URLClassLoader$1.run(URLClassLoader.java:212)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
      at com.meterware.httpunit.ParsedHTML.<clinit>(ParsedHTML.java:724)
      at com.meterware.httpunit.WebResponse.getReceivedPage(WebResponse.java:1300)
      at com.meterware.httpunit.WebResponse.getFrames(WebResponse.java:1285)
      at com.meterware.httpunit.WebResponse.getFrameRequests(WebResponse.java:1024)
      at com.meterware.httpunit.FrameHolder.updateFrames(FrameHolder.java:179)
      at com.meterware.httpunit.WebWindow.updateFrameContents(WebWindow.java:315)
      at com.meterware.httpunit.WebClient.updateFrameContents(WebClient.java:526)
      at com.meterware.httpunit.WebWindow.updateWindow(WebWindow.java:201)
      at com.meterware.httpunit.WebWindow.getSubframeResponse(WebWindow.java:183)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:158)
      at com.meterware.httpunit.WebWindow.getResponse(WebWindow.java:125)
      at com.meterware.httpunit.WebClient.getResponse(WebClient.java:96)
      at myapp.extractor.table.TableModel.App.runIt(App.java:23)
      at myapp.extractor.table.TableModel.App.main(App.java:49)
Caused by: java.lang.ClassNotFoundException: org.mozilla.javascript.Scriptable
      at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
      at java.security.AccessController.doPrivileged(Native Method)
      at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
      at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
      at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
You're missing a dependency - probably something like js-1.6R5.jar. I get something like the following output (an extract)
| Job Name 
                   | Bid Date & Time 
                
                

                   | Dulles Jet Center 
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | NIST Cooling Tower Replacement
                   | February 22, 2011 @ 2:00 P.M. 
                
                

                   | Catharpin Irrigation 
                   | February 22, 2011 @ 1:00 P.M. 
                
                

                   | Mt. Weather - Emergency Ops Center
                   | February 21, 2011 @ 2:00 P.M. 
                
                

                   | AWG Firing Range, Ft. Meade 
                   | February 28, 2011 @ 2:00 P.M 
                
                

                   | Center for Strategic & International Studies
                   | March 7, 2011 @ 2:00 P.M. 
 
|

Open in new window

0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 1

Author Comment

by:CarlosScheidecker
Comment Utility
That is nice CEHJ, I have changed the code a little bit so that it would print the actual table in XML format which is what I want.

And here is the code:


import java.io.IOException;

import org.xml.sax.SAXException;

import org.w3c.dom.Attr;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import com.meterware.httpunit.*;

public class App 
{
	public static String escapeXML(String s) {
	    StringBuffer str = new StringBuffer();
	    int len = (s != null) ? s.length() : 0;
	    for (int i=0; i<len; i++) {
	       char ch = s.charAt(i);
	       switch (ch) {
	       case '<': str.append("&lt;"); break;
	       case '>': str.append("&gt;"); break;
	       case '&': str.append("&amp;"); break;
	       case '"': str.append("&quot;"); break;
	       case '\'': str.append("&apos;"); break;
	       default: str.append(ch);
	     }
	    }
	    return str.toString();
	  }
	
	public static String print(Node node) {
		String result = "";
	    int type = node.getNodeType();
	    switch (type) {
	      case Node.ELEMENT_NODE:
	        result += "<" + node.getNodeName();
	        NamedNodeMap attrs = node.getAttributes();
	        int len = attrs.getLength();
	        for (int i=0; i<len; i++) {
	            Attr attr = (Attr)attrs.item(i);
	            result += " " + attr.getNodeName() + "=\"" +escapeXML(attr.getNodeValue()) + "\"";
	        }
	        result += ">";
	        NodeList children = node.getChildNodes();
	        len = children.getLength();
	        for (int i=0; i<len; i++)
	          result += print(children.item(i));
	        result += "</" + node.getNodeName() + ">";
	        break;
	      case Node.ENTITY_REFERENCE_NODE:
	    	  result += "&" + node.getNodeName() + ";";
	        break;
	      case Node.CDATA_SECTION_NODE:
	    	  result += "<![CDATA[" + node.getNodeValue() + "]]>";
	        break;
	      case Node.TEXT_NODE:
	    	  result += escapeXML(node.getNodeValue());
	        break;
	      case Node.PROCESSING_INSTRUCTION_NODE:
	    	  result += "<?" + node.getNodeName();
	        String data = node.getNodeValue();
	        if (data!=null && data.length()>0)
	        	result += " " + data;
	        result += "?>";
	        break;
	    }
	    return result;
	  }

	
	
	public static void runIt(String url) {
    	HttpUnitOptions.setScriptingEnabled(false);

		WebConversation wc = new WebConversation();
		WebResponse resp;
		try {
			resp = wc.getResponse(url);
			WebTable[] tables = resp.getTables();
			System.out.printf("Found %d table(s) in the response\n", tables.length);
			for (int i = 0; i < tables.length; i++) {
				System.out.println("Table no "+(i+1));
				System.out.println(print(tables[i].getNode()));
				System.out.println("");
			}
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (SAXException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
    public static void main( String[] args )
    {
    	String url = "http://www.asu.edu/purchasing/bids/";
    	runIt(url);
    }
}

Open in new window

0
 
LVL 92

Expert Comment

by:objects
Comment Utility
if you just need to strip out the tags then this may help

http://helpdesk.objects.com.au/java/how-do-i-extract-just-the-text-form-a-html-document-ie-strip-out-all-the-html-tags

though striping out the tags will make it a lot harder to convert to xml.
0
 
LVL 86

Expert Comment

by:CEHJ
Comment Utility
That's good :)
0
 
LVL 1

Author Comment

by:CarlosScheidecker
Comment Utility
Objects, I am converting to XML above already.
0
 
LVL 1

Author Comment

by:CarlosScheidecker
Comment Utility
I need the XML. For extracting just the text I like to do it via JTidy.
0
 
LVL 92

Expert Comment

by:objects
Comment Utility
you could avoid having to manually build the xml, and just do a transformation
0
 
LVL 1

Author Comment

by:CarlosScheidecker
Comment Utility
You are right, but in this case is a requirement.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

Introduction Java can be integrated with native programs using an interface called JNI(Java Native Interface). Native programs are programs which can directly run on the processor. JNI is simply a naming and calling convention so that the JVM (Java…
Introduction This article is the last of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers our test design approach and then goes through a simple test case example, how …
Viewers learn about the “while” loop and how to utilize it correctly in Java. Additionally, viewers begin exploring how to include conditional statements within a while loop and avoid an endless loop. Define While Loop: Basic Example: Explanatio…
Viewers learn about the scanner class in this video and are introduced to receiving user input for their programs. Additionally, objects, conditional statements, and loops are used to help reinforce the concepts. Introduce Scanner class: Importing…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now