Solved

problem in JTIdy

Posted on 2007-03-30
7
269 Views
Last Modified: 2013-11-19
I am using a jtidy to convert an HTML page to XML document. But the coverted page header looks like this:
"<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1" />
"

so I can not parse it as regular XML doucument  because it did not contain a regular xml header??

I am using the following java code to use Jtidy

import java.net.URL;
import java.io.*;
import org.w3c.tidy.Tidy;

public class TestHTML2XML{
private String url;
private String outFileName;
private String errOutFileName;




public TestHTML2XML(String url, String outFileName, String
errOutFileName) {

this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}



public void convert() {
URL u;
BufferedInputStream in;
FileOutputStream out;

Tidy tidy = new Tidy();

//Tell Tidy to convert HTML to XML
tidy.setXmlOut(true);

try {
//Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);

//Create input and output streams
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);

//Convert files
tidy.parse(in, out);

//Clean up
in.close();
out.close();

} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}
public static void main(String[] args) {
/*
* Parameters are:
* URL of HTML file
* Filename of output file
* Filename of error file
*/
TestHTML2XML t = new TestHTML2XML(args[0], args[1], args[2]);
t.convert();
}
}

0
Comment
Question by:badour_ma
  • 3
  • 2
7 Comments
 
LVL 30

Expert Comment

by:mayankeagle
Comment Utility
>> so I can not parse it as regular XML doucument  because it did not contain a regular xml header??

It will have an <html> root node. You should be able to parse it.

Do you mean it does not contain the <?xml version....> header?

Did you try parsing it as it is?
0
 

Author Comment

by:badour_ma
Comment Utility
yes I traied and it give my an error!!
0
 
LVL 30

Expert Comment

by:mayankeagle
Comment Utility
Which parser did you use? I guess DOM parser will not give you that error
0
 

Author Comment

by:badour_ma
Comment Utility
I do not know which parser i use because i use Jtidy classes only
0
 
LVL 30

Accepted Solution

by:
mayankeagle earned 500 total points
Comment Utility
It will use DOM. I meant to ask if you tried with the plain and simple DOM parser without using JTidy
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

What is Node.js? Node.js is a server side scripting language much like PHP or ASP but is used to implement the complete package of HTTP webserver and application framework. The difference is that Node.js’s execution engine is asynchronous and event…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
The viewer will learn the basics of jQuery, including how to invoke it on a web page. Reference your jQuery libraries: (CODE) Include your new external js/jQuery file: (CODE) Write your first lines of code to setup your site for jQuery.: (CODE)

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

7 Experts available now in Live!

Get 1:1 Help Now