Link to home
Start Free TrialLog in
Avatar of mikechen
mikechen

asked on

Validate a XML document using DTD

Hi,

I have a question about using DTD to validate an XML data feed. Here is what I need to do.

1. I need to retrieve an XML file from a website, say http://ABCD.COM/sample.xml.

2. This XML is well formatted per a DTD. This DTD is defined external at the same website. Here is the sample of the XML file:

<?xml version="1.0" encoding="ISO8859-1"?>
<!DOCTYPE index SYSTEM "/dtds/format1.dtd">
<Foo>
    <Foo1>
    </Foo1>
</Foo>

3. When I retrieve the XML file(Or after I retrieve the XML file), I need to checked whether it is valid or not per the DTD file.


Any idea ? Sample code would be really appreciated.

Thanks.


Avatar of girionis
girionis
Flag of Greece image

 Take a look here: http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/sax/index.html

  At the bottom of the page it has links to various things you can do with java and XML.

  Hope it helps.
Avatar of MikaelHK
MikaelHK

First of all:

Use the URL class in the java.net package.
It allows you to get an InputStream for your resource(remember to wrap it in a BufferedInputStream for good measure).

Now to parse the xml:

Use the javax.xml.parsers package (JAXP) to have a factory create a validating parser for you. If all you want is to know if there are any errors in the document simply make a SAXParser and pass it an implementation of the SAX2 interface DefaultHander extending the error, fatalError and warning methods to get the error information. Then if it doesn't do enough and you aren't down with SAX I suggest you get a hold of a DOM parser which is far easier to use (it represent your document as a tree of nodes), but it is also memory and processing expensive in comparison to the SAX implementation

According to the sample xml you have on the page your xml contains an DOCTYPE with a SYSTEM reference. This requires that the DTD is present in the parsing system (filesystem). If possible you should change this either to a PUBLIC "http://abcd.com/dtds/format1.dtd" which will allow the parser to go and load the DTD from the server (In fact some parser are now smart enough to cache these DTD for later reuse).

Hope it was helpful.
 Mikael please do not propose answers as this locks the question and it is difficult for other peopel to see it and add their comments. Propose comments instead as comments can still be accepted as answers.
Avatar of mikechen

ASKER

Thanks for the responses. I guess that is the data feed from a certain website and I could not change it.

So is there way to achieve what I want ?

Could somebody provide some sample code ?

Thanks.
Mike,

Yes, there is probably a way to achieve what you want, and it's not too difficult. However, it depends on properly defining the path to the DTD in the XML document. The external DTD subset ("/dtds/format1.dtd") is resolved in the context of the document entity ("http://abcd.com/sample.xml"), so your DTD must be accessible at http://abcd.com/dtds/format1.dtd.

Here's some sample code using JAXP. Note, however, that Crimson (the default parser with JDK 1.4) has a bug causing it to incorrectly resolve the DTD URL. You'll need to get a different validating parser such as Xerces (http://xml.apache.org/xerces2-j).

import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class validate {
    static public void main(String[] args) {
        try {
            SAXParserFactory saxfactory = SAXParserFactory.newInstance();
            saxfactory.setValidating(true);
            SAXParser saxparser = saxfactory.newSAXParser();
   
            if (args.length < 1) {
                System.err.println(
                    "Usage: java validate http://abcd.com/sample.xml");
                System.exit(1);
            }

            saxparser.parse(args[0],(DefaultHandler)null);
            System.out.println("File is valid");
        }
        catch (SAXException e) {
            System.out.println("File is not valid: " + e.getMessage());
        }
        catch (Exception e) {
            System.out.println("Error parsing document:");
            e.printStackTrace();
        }
    }
}
Hi, here is what I plan to do.

1. Get the XML file.
2. Replace <!DOCTYPE index SYSTEM "/dtds/format1.dtd"> with <!DOCTYPE index SYSTEM "http://abcd.com/dtds/format1.dtd">
3. Parse it.

Do you think this is good enough ?

But here is what I need help since I am still a C++/C# programmer

1. To get XML file. Here is what I did.

<<
DocumentBuilderFactory docBuilderFactory;
docBuilderFactory = DocumentBuilderFactory.newInstance();
docBuilderFactory.setValidating(false);

DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document xmlDoc = docBuilder.parse(uri);
...
>>



But I got error like this
<<
Exception in thread "main" java.lang.InternalError
        at org.apache.crimson.parser.Parser2.parseSystemId(Parser2.java:2636)
        at org.apache.crimson.parser.Parser2.maybeExternalID(Parser2.java:2605)
        at org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1116)

        at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:488)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:304)
        at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)

        at org.apache.crimson.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:179)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:134)
>>

Why I got this error ?

2. Since I got the xmlDoc, I assume I can manipulate it and change the DTD. What is the best way ?

3. After I change the DTD to absolute path, how should I parse it again ?


Thanks a lot.


BTW, I am using JDK 1.4.
Mike, your code isn't working because it's trying to parse the DTD before you change it. I can tell you how to fix it, but it shouldn't be necessary. As I stated in my previous comment, your XML file and the system identifier for the DTD external subset are valid. You're probably having problems just because of the Crimson bug. Just use a different parser and try the sample code I posted. Let us know if it doesn't work.
Hi, One question here.

When I call docBuilderFactory.setValidating(false);
is it still going to validate the xml against the dtd ?

If not, then why did I still get that error ?

Thanks.
 No it should not validate it...

  Getting this error means that there might be a problem with your parser, or, even with your Servlet engine. Tomcat 4.0 is known to have such problems. What Servlet Engine are you using?
If you have copied your XML document elsewhere, then the problem is that the DTD external subset is specified but doesn't exist. That's not only a validation error, that's a well-formedness error too. Parsers check for well-formedness even when validation is off.

If you really need to save the document elsewhere and rewrite the DTD, consider using SAX2. The SAX2 API has two features, "external-general-entities" and "external-parameter-entities" that allow you to skip external entities. You can use the default Crimson parser, but you'll lose the comments in your document. If you want to preserve comments you should use a parser that supports the SAX2 extensions. Piccolo (http://piccolo.sourceforge.net), Xerces, and a few others (check http://www.saxproject.org/?selected=links) will work fine.
HI, Yoren,

Can you post the sample code using Xerces.(http://xml.apache.org/xerces2-j).

Thanks.
Mike,

One of the great things about JAXP and SAX is that you can switch parsers without changing any code. You can use the code listed in my previous comment.

To have the program use Xerces instead of the default Crimson parser, you'll need to:

1. Download Xerces and place the .jar files in your [JAVA_HOME]/jre/lib/ext directory.

2. Tell Java to use Xerces as the default parser by creating the file, [JAVA_HOME]/jre/lib/jaxp.properties, and putting in these two lines:

javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl
javax.xml.parsers.DocumentBuilderFactory=org.apache.xerces.jaxp.DocumentBuilderFactoryImpl
Hi, Yoren,

I followed what you said, but I still got the error like
It seems it is still using crimson.

Any idea ?

Thanks.

<<
        at org.apache.crimson.parser.Parser2.parseSystemId(Parser2.java:2636)
        at org.apache.crimson.parser.Parser2.maybeExternalID(Parser2.java:2605)
        at org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1116)

        at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:488)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:304)
        at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)

        at javax.xml.parsers.SAXParser.parse(SAXParser.java:346)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:232)
        at validate.main(validate.java:18)
>>
Hi, Yoren,

I followed what you said, but I still got the error like
It seems it is still using crimson.

Any idea ?

Thanks.

<<
        at org.apache.crimson.parser.Parser2.parseSystemId(Parser2.java:2636)
        at org.apache.crimson.parser.Parser2.maybeExternalID(Parser2.java:2605)
        at org.apache.crimson.parser.Parser2.maybeDoctypeDecl(Parser2.java:1116)

        at org.apache.crimson.parser.Parser2.parseInternal(Parser2.java:488)
        at org.apache.crimson.parser.Parser2.parse(Parser2.java:304)
        at org.apache.crimson.parser.XMLReaderImpl.parse(XMLReaderImpl.java:433)

        at javax.xml.parsers.SAXParser.parse(SAXParser.java:346)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:232)
        at validate.main(validate.java:18)
>>
Try specifying the parser on the command line:

java -Djavax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl  validate http://abcd.com/sample.xml
I got an error
File is not valid: The encoding "ISO8859-1" is not supported.
I got an error
File is not valid: The encoding "ISO8859-1" is not supported.
Ah, I didn't see that typo before. That's not a valid encoding. Instead, it should be "ISO-8859-1".
I put a space between -D and javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl  then I got an error

Exception in thread "main" java.lang.NoClassDefFoundError: javax/xml/parsers/SAXParserFactory=org/apache/xerces/jaxp/SAXParserFactoryImpl


I put a space between -D and javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl  then I got an error

Exception in thread "main" java.lang.NoClassDefFoundError: javax/xml/parsers/SAXParserFactory=org/apache/xerces/jaxp/SAXParserFactoryImpl


I put a space between -D and javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl  then I got an error

Exception in thread "main" java.lang.NoClassDefFoundError: javax/xml/parsers/SAXParserFactory=org/apache/xerces/jaxp/SAXParserFactoryImpl


I put a space between -D and javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl  then I got an error

Exception in thread "main" java.lang.NoClassDefFoundError: javax/xml/parsers/SAXParserFactory=org/apache/xerces/jaxp/SAXParserFactoryImpl


I put a space between -D and javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl  then I got an error

Exception in thread "main" java.lang.NoClassDefFoundError: javax/xml/parsers/SAXParserFactory=org/apache/xerces/jaxp/SAXParserFactoryImpl


Did you ever hear the joke about the guy who walks into a doctor's office and said "it hurts when I raise my arm like this?" Well, like the doctor told him,

Don't do that.
ASKER CERTIFIED SOLUTION
Avatar of yoren
yoren

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>Did you ever hear the joke about the guy who walks into a doctor's office and said "it hurts when I
>raise my arm like this?" Well, like the doctor told him,
>
>Don't do that.

  LOL. :-)

  mikechen the java.lang.NoClassDefFoundError means that the VM cannot find the class you want. Make sure that the path to this class (or the jar file) is in the classpath. If you have put the jar file inside the /ext folder then the VM should pick it up automatically. If you still have the NoClassDefFoundError then make sure that the class you are looking for is in the correct jar file.

  Sometimes you even have to restart your server.

  Hope it helps :-)
The NoClassDefFoundError is occurring because of the space after the -D flag. It thinks you're trying to run a program called "javax.xml.parsers.SAXParserFactory=org.apache.xerces.jaxp.SAXParserFactoryImpl " instead of setting the system property. Remove the space to fix the problem.
 Aha.. I see. To be honest I was never in the need to set a new system property dynamically. Thanks for the information.
OK, I am back from some chaos.

Hi, Yoren,
How do I fix
"File is not valid: The encoding "ISO8859-1" is not supported." ?

 Change this: "ISO8859-1" to this: "ISO-8859-1"

  Hope it helps.
Thanks for your help, MikaelHK.