Solved

XML validation against a subset of a schema

Posted on 2012-03-12
8
584 Views
Last Modified: 2013-12-13
Hi,

I have XML documents that are using a subset, e.g. 90% of all the elements available in some official, industry schema (cannot be modified). The business requirements for the project excludes about, say, 100 fields from this schema. What is the best way to validate these XML documents to make sure they do not contain any of the 100 "banished" fields/xpaths?

One option (not preferable) would be to create a custom schema that is a subset of the original industry schema without the 100 fields and validate the XML documents against that trimmed down schema. Creating/maintaining a separate schema was ruled out, however, as it would be too complex and not maintainable.

I am then interested in the approach where these are given: 1) XML documents 2) schema 3) list of excluded fields... and a technical solution is needed to validate the XML documents against 2 + 3. Ideally the solution is efficient and maintainable.
[I am thinking some XSLT perhaps that would somehow loop through a list of 100 Xpath expressions (perhaps maintained in an external configuration file) and check for the presence of banished elements, if that makes any sense...]

I would appreciate detailed guidance / insights from a few experts on this. A code snipped to illustrate would be great. Thanks for your expertise.

Best Regards,
Lyteck
0
Comment
Question by:lyteck
  • 4
  • 4
8 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37708631
Hi Lyteck,

Allow me to say that your question is too vague to give detailed examples. I understand the requirement in general however and that is a common use case.
Given the schema and what you actually need, I have two approaches to this problem
1. Import the schema and redefine some element definitions. This will work in some cases. Requires a somewhat deeper understanding of the inheritence mechanisms of XML schema.
The advantage is one single schema
2. Restrict using schematron. Schematron is a rules based schema language. You could validate according the existing schema and restrict your XML further in a schematron validation. This is what I do in over 80% of the case where I have your requirement. It adds a second validation step, but gives you full flexibility. It also comes close to your suggestion where there is a 1)XML 2)schema, and 3) list of excluded elements (schematron, be it such a list but wityh specific syntaxt) Highly efficient, highly flexible and highly maintainable.

Needless to say, learn schematron (in less than two hours) and go down that route:
http://www.schematron.com/
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
ID: 37708682
Here is an example to make the story less vague

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:element name="example">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="foo" maxOccurs="unbounded"/>
                <xs:element name="bar" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

Open in new window


The <example> element in this schema can have one or more <foo> and then an optional <bar>. No restrictions on the content of the <foo> element

Here is a valid XML document

<?xml version="1.0" encoding="UTF-8"?>
<example xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="example.xsd">
    <foo>fo content</foo>
    <foo>foo2 content</foo>
    <bar>bar content</bar>
</example>

Open in new window


Now, basically without changing the schema, you want:
- a maximum of one foo
- no bar
- the foo element to contain a "foo" in the string content

Here is the schematron that gives you that. Very clear and flexible, no?

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="example">
            <assert test="count(foo) = 1">More than one foo is not allowed in example</assert>
            <assert test="count(bar) = 0">Bar is not allowed in example</assert>
        </rule>
        <rule context="foo">
            <assert test="contains(., 'foo')">element foo does not contain the word "foo"</assert>
        </rule>
    </pattern>
</schema>

Open in new window


Note that there is a restriction in this approach. You can use schematron as a second step only if you do pure restriction. You can't say that you want to wipe out a mandatory element, or in general that you want stuff that renders the XML invalid against the original schema. eg. You can't say: give me no <foo> and two or more <bar> since than the two schemata conflict. If that is what you need then you need to go for redefine and import
0
 

Author Comment

by:lyteck
ID: 37708753
Hi Gertone:

First, Let me apologize for not providing enough color in my original question for you/others to provide detailed examples/answers. Second, thank you for taking the initiative to create a simple and clear example, which I should have done in the first place.

Your explanations are quite impressive actually. I do see the power and flexibility of Schematron as you explained it. It wasn't that obvious when I started delving into the specs.

Could you please expand on your last sentence "redefine and import"? I think 10% of the cases may fall in that category, i.e., foo is required per schema but it should not be included in the XML instances. The other 90% will use pure restriction.

Say <example> contains foo (mandatory, unbounded), bar (optional), as you have them, plus some other variable tux (mandatory). A valid instance should only contain <tux> inside <example>. Can your example be taken a step further to illustrate this?

In other words, can the overall solution be a mix of "redefine and import" and Schematron?

Many thanks and best regards,
Lyteck
0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 

Author Comment

by:lyteck
ID: 37708926
Hi Gertone, I need to scratch my last question completely (if you read it already). As a matter of fact, 100% of the banished elements are about pure restriction (for the simple reason that all the XML instances are compliant with the industry schema to begin with.)

I am going to explore Schematron further and probably end up using it per your expert recommendation. I have one last related question I am writing and will post shortly.
0
 

Author Comment

by:lyteck
ID: 37708945
Hi Gertone:

I have one last, quick question on this as I will need justify the Schematron choice vs all possible options considered. Can you briefly comment on the processing speed of different approaches?

Beforehand, would a third option have been to programmatically create validation rules in e.g., java, C# that parses the instance and creates a DOM tree and check for the banished elements?

1) use schematron validation as second step (preferred approach, but second fastest?)
2) create a custom, restricted schema that strips the 100 fields (fastest processing I assume)
3) create validation rules (is it slowest?)

That is my last query on this. Many thanks for your expertise and best regards,
Lyteck
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 37708972
My opinion on custom code for validation is that you should not do it. Unless speed is really, really important, you should value the flexibility and extensibility of using an ISO standardized schema validation. Here is how schematron mostly works. The schematron schema gets compiled into an XSLT. You can do that compilation beforehand. You could compile the XSLT with .net compiled transform. Little efficiency lost compared to C# DOM coding in my opinion.

The fastest processing (given you have schema validation) is changing the existing schema... but what if in due time the changed schema gets altered?
For that redefinition and import comes into play (well, xs:redefine is a special form of xs:include) but there are important restrictions: you can redefine named complexTypes, but you can't redefine element definitions...

It is hard to comment any further on speed of processing, since I don't know teh details.
I hope you understand that following are important for a correct advice: complexity of the schema, complexity of the restricting rules, size of the XML documents, number of documents to be processed per second...

My advice on speed optimisation however is. Only optimise when you need to. That means that you need a target processing time set upfront and start muddling teh approach if you don't hit the target
0
 

Author Closing Comment

by:lyteck
ID: 37708994
Gertone,

Thank you for your expertise and educating me on this topic.

Best Regards,
Lyteck
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37709041
welcome, good luck with it
0

Featured Post

How our DevOps Teams Maximize Uptime

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us. Read the use case whitepaper.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I previously wrote an article addressing the use of UBCD4WIN and SARDU. All are great, but I have always been an advocate of SARDU. Recently it was suggested that I go back and take a look at Easy2Boot in comparison.
If your app took Google’s lash recently, here are the 5 most likely reasons.
Video by: Tony
This video teaches viewers how to export a project from Adobe Premiere Pro and the various file types involved.
XMind Plus helps organize all details/aspects of any project from large to small in an orderly and concise manner. If you are working on a complex project, use this micro tutorial to show you how to make a basic flow chart. The software is free when…

825 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question