Solved

problem in JTIdy

Posted on 2007-03-30
7
275 Views
Last Modified: 2013-11-19
I am using a jtidy to convert an HTML page to XML document. But the coverted page header looks like this:
"<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1" />
"

so I can not parse it as regular XML doucument  because it did not contain a regular xml header??

I am using the following java code to use Jtidy

import java.net.URL;
import java.io.*;
import org.w3c.tidy.Tidy;

public class TestHTML2XML{
private String url;
private String outFileName;
private String errOutFileName;




public TestHTML2XML(String url, String outFileName, String
errOutFileName) {

this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}



public void convert() {
URL u;
BufferedInputStream in;
FileOutputStream out;

Tidy tidy = new Tidy();

//Tell Tidy to convert HTML to XML
tidy.setXmlOut(true);

try {
//Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);

//Create input and output streams
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);

//Convert files
tidy.parse(in, out);

//Clean up
in.close();
out.close();

} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}
public static void main(String[] args) {
/*
* Parameters are:
* URL of HTML file
* Filename of output file
* Filename of error file
*/
TestHTML2XML t = new TestHTML2XML(args[0], args[1], args[2]);
t.convert();
}
}

0
Comment
Question by:badour_ma
  • 3
  • 2
7 Comments
 
LVL 30

Expert Comment

by:Mayank S
ID: 18825702
>> so I can not parse it as regular XML doucument  because it did not contain a regular xml header??

It will have an <html> root node. You should be able to parse it.

Do you mean it does not contain the <?xml version....> header?

Did you try parsing it as it is?
0
 

Author Comment

by:badour_ma
ID: 18828375
yes I traied and it give my an error!!
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 18829120
Which parser did you use? I guess DOM parser will not give you that error
0
 

Author Comment

by:badour_ma
ID: 18835348
I do not know which parser i use because i use Jtidy classes only
0
 
LVL 30

Accepted Solution

by:
Mayank S earned 500 total points
ID: 19046648
It will use DOM. I meant to ask if you tried with the plain and simple DOM parser without using JTidy
0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
Basic understanding on "OO- Object Orientation" is needed for designing a logical solution to solve a problem. Basic OOAD is a prerequisite for a coder to ensure that they follow the basic design of OO. This would help developers to understand the b…
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …
The viewer will learn the benefit of using external CSS files and the relationship between class and ID selectors. Create your external css file by saving it as style.css then set up your style tags: (CODE) Reference the nav tag and set your prop…

816 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now