asked on

Validation against XML Schema

Is there a way of validating an XML file against a Schema so that when the XML file has an element declared like this:

<some-field/>

where the schema defines it like:

<xsd:element name="some-field" type="xsd:nonNegativeInteger" />

as opposed to:

<xsd:element name="some-field" type="xsd:nonNegativeInteger" nillable="true" />

the parser will fire an exception?

When the element is not defined as "nillable", I don't think the XML file can have it as <some-field/>.

I am using Xerces for C++, but I guess this is ultimately a case of correctly understanding XML concepts. In code, the parser doesn't find the error and I wonder if that's normal behaviour or there is a bug somewhere.

Any help would be appreciated.
TIA

dualsoul

hm...may be you should explicitly say that <some-field> has a nil values in the instance document, try to do it like this:

<some-field xsi:nil="true"/>

rdcpro

That seems like a bug somewhere, possibly in your implementation. I'd be surprised that Xerces wouldn't catch it. This XML:

<?xml version="1.0" encoding="UTF-8"?>
<some-field xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="E:\Projects\Pro Bono\temp\nillable.xsd"/>

fails to validate against:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
<xs:element name="some-field" type="xs:nonNegativeInteger" /></xs:schema>

using XML Spy. Parsers are not required to validate by default...are you sure you've set it to validate, and you're able to read errors?

void setDoSchema(const bool)
true: Enable the parser's schema support.
false: Disable the parser's schema support.
default: false
note If set to true, namespace processing must also be turned on.
see: setDoNamespaces

void setValidationSchemaFullChecking(const bool)
true: Enable full schema constraint checking, including checking which may be time-consuming or memory intensive. Currently, particle unique attribution constraint checking and particle derivation restriction checking are controlled by this option.
false: Disable full schema constraint checking .
default: false
note: This feature checks the Schema grammar itself for additional errors that are time-consuming or memory intensive. It does not affect the level of checking performed on document instances that use Schema grammars.
see: setDoSchema

void setDoNamespaces(const bool)
true: Perform Namespace processing.
false: Do not perform Namespace processing.
default: false
note: If the validation scheme is set to Val_Always or Val_Auto, then the document must contain a grammar that supports the use of namespaces.
see: setValidationScheme

Regards,
Mike Sharp

savalou

Hey, Mensana, have you tried using the SAX parser?
SAXParser* parser = new SAXParser;
parser->setValidationScheme(valScheme);
parser->setDoNamespaces(doNamespaces);
parser->setDoSchema(doSchema);
parser->setValidationSchemaFullChecking(schemaFullChecking);

And in your instance document, do you specify the schema location? And is it findable? Otherwise the parser may not validate and it won't say anything.

Mensana

ASKER

Thank you all for your replies. Here are my answers:

(1) dualsoul: I didn't define the XML Schema and I don't generate the XML files/messages. I am only supposed to process them. Because errors can occur while the files/messages are generated/transferred, I need means to validate them, and this is where the XML Schema should come into picture.

(2) rdcpro: Here is how I create my validating parser

XercesDOMParser *CreateFullValidatingParser( const XMLCh *schema )
{
XercesDOMParser *parser = new XercesDOMParser;
parser->setValidationScheme( XercesDOMParser::Val_Always );
parser->setDoNamespaces( true );
parser->setDoSchema( true );
parser->setExitOnFirstFatalError( true );
parser->setValidationConstraintFatal( true );
// parser->setDoValidation( true );
parser->setValidationSchemaFullChecking( true );
parser->setExternalNoNamespaceSchemaLocation( schema );

return parser;
}

I saw in the docs that "setDoValidation" is a deprecated function and that's why it is commented out in my code. I tried to uncomment it and it still didn't work.

(3) savalou: I just reposted this question after you answered the other one because I thought that I should reformulate it. Like I said before, I do not generate these XML messages so it is not my task to make sure they are correct. I only need to validate them against the XML validation schema (again, not specified by me). I just rechecked and the "schema" parameter has the complete path and the file name for the XML Schema. Hey, it had to be, because in my testing project I copied and pasted it from the Windows Explorer.
I will try creating a SAX Parser, see if it makes any difference.

As you all can see, I am still lost in the XML jungle. Thanks for your help anyway. Keep it coming.

Eddie

Mensana

ASKER

I tried with the SAXParser and still the error is not caught. I remembered something that I noticed with the XercesDOMParser as well. After you parse the file/message you can call the "getErrorCount()" method of the parser. In my case this function returns 1, but I don't know how to get to that error. My "CXMLErrorHandler" class (inherited from Xerces' ErrorHandler) doesn't report anything. Maybe that's where the problem is. An instance of this class is passed in to the parser's "setErrorHandler" method.

Mensana

ASKER

OK, now I am getting something. I noticed that ErrorHandler's "error" interface method is called when the document is parsed. The location of the error takes me to where my nill tag is declared (<some-field/>). In the documentation, this is described as a recoverable error and it is suggested that you should keep parsing the document till its end. I don't take any action here (I only do it when the "fatalError" method is called). Originally, I would treat this as an error but then I changed it recently, after I noticed that some files where not processed, although there was absolutely nothing wrong with them. The location of the error didn't give me any indication as what went wrong, so I decided to ignore all the recoverable errors.
This being the case, my question now becomes: In which cases such a recoverable error is detected? Does anyone here knows where are these types of errors described?

rdcpro

About all I can advise you here is that validation errors might be recoverable so that you can (in some circumstances) continue parsing to find *all* validation errors in one pass. Not all pasers will do this, though. In your case, the error has to do with the datatype, not the XML document structure, so this should be recoverable. that is, the invalid element doesn't affect other elements. If the validation failed because, for example, a complex type was not correct, this might not be recoverable, because now the very structure of the document is in question.

If a recoverable error occurs, but the document has nothing wrong with it, maybe it's because the parser has not resolved an external of some kind? I sometimes see validation errors that go something like "this document is not valid, but it could be valid as part of another document".

Regards,
Mike Sharp

Mensana

ASKER

Well, where does that leave me then? Like I said, I have no control on the schema and I do not generate the XML messages. All I need to do is open them, validate them against the schema and save them in a database.
Normally I read a node's value like this:

DOMText *pTextNode = MyFunction2FindTextNode( pNode, "some-field" );
objMyRecordset.strSOME_FIELD_ = pTextNode->getData();

Some nodes (that are nillable or have minOccurens=0, according to the XML Schema) can be absent and there I would check the DOMText pointer for NULLness:

DOMText *pTextNode = MyFunction2FindTextNode( pNode, "some-field" );
if( pTextNode )
objMyRecordset.strSOME_FIELD_ = pTextNode->getData();
else
objMyRecordset.SetFieldNull( &objMyRecordset.strSOME_FIELD_ );

I don't want to do that for every field (they are thousands) and I want to rather rely on the validation process to catch all the illegal values (that do not verify the schema). Checking the DOMText pointer for NULLness every where looks to me as if I defeat the very purpose of the validation process.

rdcpro

I think you trap the recoverable errors as they occur...a Fatal error means the document is not only invalid, but bad enough that you can't really continue checking for errors. As I said, the document invalidates when I check it, but with MSXML I can only find the first instance of an error. It is my understanding that in Xerces you can parse the entire document, and gather up all the recoverable errors as you go. I'm not really a C++ programmer, but wouldn't you use try-catch blocks, and store the exception information in an array or something as you go? As long as they're recoverable, they're probably going to be type errors like you're getting.

Regards,
Mike Sharp

savalou

You can set a handler in Xerces to process errors in your own special way. Handlers can be set for errors, fatal errors and warnings.

I guess I don't understand the problem. The document doesn't validate against the schema using Xerces-C, I thought we'd established that?

Mensana

ASKER

I guess I confused you completely. Let's do a recap.

I can tune the code of my error_handler class so that:

(1) it can find non-recoverable errors only;
or
(2) it can find non-recoverable and recoverable errors;

Just as a side note (if you're not familiar with C++), the parser method of the AbstractDOMParser class will not throw an exception but will call some methods in the Error Handler that is associated to the parser. The user needs to write its own class inheriting from Xerces’ ErrorHandler. The interface of this abstract class has two functions called “error” and “fatalError”. These methods will be called during the parsing process when recoverable/non-recoverable errors are encountered.

In case (1) an error such as the one described in my original posting will not be found. This would cause my code to crash because I would try to use a pointer that's NULL. In the code snippet inserted in my previous posting, pTextNode is NULL for a field declared as "<some-field/>" (regardless of what the XML schema says). The alternative would be to check for NULLness in every place. I already did that for those fields that were nillable or had minOccurs=0, but not for all of them. I don't want to do that because first, there are thousands of places where this modification should be done and second, I would rather rely on the parser to catch this kind of errors (after all, what's the purpose of parsing a message if not for finding errors, right?). Just so that you understand, in this case no error would be found, yet my code would break.

In case (2) I would find all these errors but unfortunately some others too that are recoverable. For example I remember a recoverable error when there was a space (char 32) inserted in the document. I would have something like this “…</field> </structure>…”. Because of the blank space inserted in between, the parser calls MyErrorHandler::error signalling a recoverable error and this makes me throw the message away.

My code was designed to implement the exception class according to plan (2) but then along came the message with that extra space. I read the documentation and I changed the code so that despite recoverable errors, it would continue to parse the document - plan (1). Evidently, my application started to crash due to empty fields described in case (1). At that time I didn't know that empty fields would be a recoverable error and thought that the parser had a bug.

I guess that now I am wondering whether there is middle way in which some recoverable errors can be ignored while some others cannot.

That's all, really.

savalou

Why would spaces cause an error?

Mensana

ASKER

I don't know, but I get a recoverable error at column X, line Y. When I go there in the document I find a space between two tags like this:

. . .</field> </structure>. . .

For some reasons the parser complains there. If you ignore this recoverable error, the process goes on without any problem and saves everything in the database.

Weird, right?

rdcpro

I believe MSXML is the only mainstream parser that strips whitespace by default. Xerces-C probably leaves it in. I think you can set Xerces up to strip whitespace-only nodes during the parse. IIRC, it's "include-ignorable-whitespace" which is set to false. But I think Xerces has some peculiarities there, meaning it must have a DTD or schema to identify what's ignorable. But it may work for you... This might help in case 2.

Failing that, you can use XSLT to strip whitespace-only text nodes from an XML document with a simple identity transform:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" version="1.0" indent="no"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/|*|@*|processing-instruction()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

This seems over kill, though, since you should be able to set Xerces to strip whitespace upon parsing.

Regards,
Mike Sharp

Mensana

ASKER

It's interesting what you say. I'll try to see if I can find out how to ignore whitespaces in the XML documents when using Xerces' parser.
Thanks for your suggestion.

Mensana

ASKER

Another thing that puzzled me today: I spent the whole day trying to figure out why is that my code ceased to find recoverable errors in case of an element that is declared like this:
...
<Some-Field/>
...

Same problem as described before and I thought I sort this thing out but today it came back to haunt me. Eventually I managed to find out what it was. For elements declared like this:

<xsd:element name="Some-Field" type="xsd:nonNegativeInteger"/>
or
<xsd:element name="Some-Field" type="xsd:positiveInteger"/>

a recoverable error is found. However, if the element is a string

<xsd:element name="Some-Field" type="xsd:string"/>

then a tag "<Some-Field/>" doesn't generate that error anymore. Any idea what is the difference between numbers and strings as far as schema validation is concerned. Could be because a string can be empty, but a number must be at least "0"?

I still dig after that "ignore whitespace" attribute.

TIA

rdcpro

I think you're right, an empty string is still a string. But an empty number, is not a number, it's a string.

Here's what Apache says for the Xerces Java parser:

http://apache.org/xml/features/dom/include-ignorable-whitespace

True: Includes text nodes that can be considered "ignorable whitespace" in the DOM tree.
False: Does not include ignorable whitespace in the DOM tree.
Default: true
Note: The only way that the parser can determine if text is ignorable is by reading the associated grammar and having a content model for the document. When ignorable whitespace text nodes are included in the DOM tree, they will be flagged as ignorable. The ignorable flag can be queried by calling the TextImpl#isIgnorableWhitespace():boolean method.

Apparently you need a schema in Xerces to even attempt to exlude whitespace...which you have. Now, how you actually set the feature...in Java you would:

SAXParser p=new SAXParser();
try { p.setFeature("http://apache.org/xml/features/dom/include-ignorable-whitespace", true); }
catch (SAXException e)
{ System.out.println("error in setting up parser feature"); }

In C++, the syntax is different, but the feature is the same:

void setIncludeIgnorableWhitespace(const bool)

true: Include text nodes that can be considered "ignorable whitespace" in the DOM tree.
false: Do not include ignorable whitespace in the DOM tree.
default: true
note: The only way that the parser can determine if text is ignorable is by reading the associated grammar and having a content model for the document. When ignorable whitespace text nodes are included in the DOM tree, they will be flagged as ignorable; and the method DOMText::isIgnorableWhitespace() will return true for those text nodes.

Regards,
Mike Sharp

Mensana

ASKER

Hi Mike,

I tried to use the "setIncludeIgnorableWhitespace" method. My function that creates the parser now looks like this:

XercesDOMParser *MyFullValidatingParser( const XMLCh *schema )
{
XercesDOMParser *parser = new XercesDOMParser;
parser->setValidationScheme( XercesDOMParser::Val_Always );
parser->setDoNamespaces( true );
parser->setDoSchema( true );
parser->setExitOnFirstFatalError( true );
parser->setValidationConstraintFatal( true );
parser->setDoValidation( true ); // deprecated function
parser->setIncludeIgnorableWhitespace(false);
parser->setValidationSchemaFullChecking( true );
parser->setExternalNoNamespaceSchemaLocation( schema );

return parser;
}

and yet I still can't ignore the whitespace(s).

You got me confused with the
setFeature("http://apache.org/xml/features/dom/include-ignorable-whitespace", true );
method.
First of all, that link doesn't take me anywhere and second, the XercesDOMParser doesn't have a setFeature method.

I posted this question on a different forum:

http://alphaworks.ibm.com/forum/xml4c.nsf/current?OpenView&Count=30

There are several old postings on this subject, but none solve my problem. Anyway, over there you don't get replies as fast as here.
I am really annoyed by this whole matter and I feel I am wasting too much time to investigate something that might not even have a solution.

Thanks for your help anyway. I should probably give you the points.

Regards,
Eddie

ASKER CERTIFIED SOLUTION

rdcpro

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial