asked on

html to xhtml

How would I convert the html string to xhtml before parsing it. The below works fine, unless I take out the quotes around 'white', making it non-xhtml compliant. Would Tidy do the job? If so, how would I code such?

<%@ page import="java.io.*,java.net.*,java.text.*,java.util.*,javax.xml.parsers.*,javax.xml.xpath.*,org.w3c.dom.*,org.w3c.dom.*,org.xml.sax.*" %>
<%
String htm;
 
htm = "<html>" +
      "<body bgcolor='white'>" +
      "<head>" +
      "<title>Hello World</title>" +
      "</head>" +
      "</body>" +
      "</html>";
 
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringElementContentWhitespace(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(htm)));
document.getDocumentElement().normalize();
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList) xpath.evaluate("//title/text()",document,XPathConstants.NODESET);
 
if (nodeList.getLength() > 0) {
  for (int i = 0; i < nodeList.getLength(); i++) {
    out.print(nodeList.item(i).toString());
  }
}else{
  out.print("not found");
}
%>

Open in new window

CEHJ

>>Would Tidy do the job?

Yes,

See:

http://jtidy.sourceforge.net/apidocs/org/w3c/tidy/Tidy.html#setXHTML(boolean)

arichexe

ASKER

How would I modify my code to utilize Tidy?

<%@ page import="java.io.*,java.net.*,java.text.*,java.util.*,javax.xml.parsers.*,javax.xml.xpath.*,org.w3c.dom.*,org.w3c.dom.*,org.xml.sax.*" %>
<%
String htm;
 
htm = "<html>" +
      "<body bgcolor='white'>" +
      "<head>" +
      "<title>Hello World</title>" +
      "</head>" +
      "</body>" +
      "</html>";
 
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringElementContentWhitespace(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(htm)));
document.getDocumentElement().normalize();
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodeList = (NodeList) xpath.evaluate("//title/text()",document,XPathConstants.NODESET);
 
if (nodeList.getLength() > 0) {
  for (int i = 0; i < nodeList.getLength(); i++) {
    out.print(nodeList.item(i).toString());
  }
}else{
  out.print("not found");
}
%>

Open in new window

ASKER CERTIFIED SOLUTION

CEHJ

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CEHJ

:-)