Solved

problem in JTIdy

Posted on 2007-03-30
7
290 Views
Last Modified: 2013-11-19
I am using a jtidy to convert an HTML page to XML document. But the coverted page header looks like this:
"<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1" />
"

so I can not parse it as regular XML doucument  because it did not contain a regular xml header??

I am using the following java code to use Jtidy

import java.net.URL;
import java.io.*;
import org.w3c.tidy.Tidy;

public class TestHTML2XML{
private String url;
private String outFileName;
private String errOutFileName;




public TestHTML2XML(String url, String outFileName, String
errOutFileName) {

this.url = url;
this.outFileName = outFileName;
this.errOutFileName = errOutFileName;
}



public void convert() {
URL u;
BufferedInputStream in;
FileOutputStream out;

Tidy tidy = new Tidy();

//Tell Tidy to convert HTML to XML
tidy.setXmlOut(true);

try {
//Set file for error messages
tidy.setErrout(new PrintWriter(new FileWriter(errOutFileName), true));
u = new URL(url);

//Create input and output streams
in = new BufferedInputStream(u.openStream());
out = new FileOutputStream(outFileName);

//Convert files
tidy.parse(in, out);

//Clean up
in.close();
out.close();

} catch (IOException e) {
System.out.println(this.toString() + e.toString());
}
}
public static void main(String[] args) {
/*
* Parameters are:
* URL of HTML file
* Filename of output file
* Filename of error file
*/
TestHTML2XML t = new TestHTML2XML(args[0], args[1], args[2]);
t.convert();
}
}

0
Comment
Question by:badour_ma
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
7 Comments
 
LVL 30

Expert Comment

by:Mayank S
ID: 18825702
>> so I can not parse it as regular XML doucument  because it did not contain a regular xml header??

It will have an <html> root node. You should be able to parse it.

Do you mean it does not contain the <?xml version....> header?

Did you try parsing it as it is?
0
 

Author Comment

by:badour_ma
ID: 18828375
yes I traied and it give my an error!!
0
 
LVL 30

Expert Comment

by:Mayank S
ID: 18829120
Which parser did you use? I guess DOM parser will not give you that error
0
 

Author Comment

by:badour_ma
ID: 18835348
I do not know which parser i use because i use Jtidy classes only
0
 
LVL 30

Accepted Solution

by:
Mayank S earned 500 total points
ID: 19046648
It will use DOM. I meant to ask if you tried with the plain and simple DOM parser without using JTidy
0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction This article is the second of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article covers the basic installation and configuration of the test automation tools used by…
In this post we will learn how to connect and configure Android Device (Smartphone etc.) with Android Studio. After that we will run a simple Hello World Program.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.

630 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question