XML validation against a subset of a schema


I have XML documents that are using a subset, e.g. 90% of all the elements available in some official, industry schema (cannot be modified). The business requirements for the project excludes about, say, 100 fields from this schema. What is the best way to validate these XML documents to make sure they do not contain any of the 100 "banished" fields/xpaths?

One option (not preferable) would be to create a custom schema that is a subset of the original industry schema without the 100 fields and validate the XML documents against that trimmed down schema. Creating/maintaining a separate schema was ruled out, however, as it would be too complex and not maintainable.

I am then interested in the approach where these are given: 1) XML documents 2) schema 3) list of excluded fields... and a technical solution is needed to validate the XML documents against 2 + 3. Ideally the solution is efficient and maintainable.
[I am thinking some XSLT perhaps that would somehow loop through a list of 100 Xpath expressions (perhaps maintained in an external configuration file) and check for the presence of banished elements, if that makes any sense...]

I would appreciate detailed guidance / insights from a few experts on this. A code snipped to illustrate would be great. Thanks for your expertise.

Best Regards,
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Gertone (Geert Bormans)Information ArchitectCommented:
Hi Lyteck,

Allow me to say that your question is too vague to give detailed examples. I understand the requirement in general however and that is a common use case.
Given the schema and what you actually need, I have two approaches to this problem
1. Import the schema and redefine some element definitions. This will work in some cases. Requires a somewhat deeper understanding of the inheritence mechanisms of XML schema.
The advantage is one single schema
2. Restrict using schematron. Schematron is a rules based schema language. You could validate according the existing schema and restrict your XML further in a schematron validation. This is what I do in over 80% of the case where I have your requirement. It adds a second validation step, but gives you full flexibility. It also comes close to your suggestion where there is a 1)XML 2)schema, and 3) list of excluded elements (schematron, be it such a list but wityh specific syntaxt) Highly efficient, highly flexible and highly maintainable.

Needless to say, learn schematron (in less than two hours) and go down that route:
Gertone (Geert Bormans)Information ArchitectCommented:
Here is an example to make the story less vague

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:element name="example">
                <xs:element name="foo" maxOccurs="unbounded"/>
                <xs:element name="bar" minOccurs="0"/>

Open in new window

The <example> element in this schema can have one or more <foo> and then an optional <bar>. No restrictions on the content of the <foo> element

Here is a valid XML document

<?xml version="1.0" encoding="UTF-8"?>
<example xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    <foo>fo content</foo>
    <foo>foo2 content</foo>
    <bar>bar content</bar>

Open in new window

Now, basically without changing the schema, you want:
- a maximum of one foo
- no bar
- the foo element to contain a "foo" in the string content

Here is the schematron that gives you that. Very clear and flexible, no?

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
        <rule context="example">
            <assert test="count(foo) = 1">More than one foo is not allowed in example</assert>
            <assert test="count(bar) = 0">Bar is not allowed in example</assert>
        <rule context="foo">
            <assert test="contains(., 'foo')">element foo does not contain the word "foo"</assert>

Open in new window

Note that there is a restriction in this approach. You can use schematron as a second step only if you do pure restriction. You can't say that you want to wipe out a mandatory element, or in general that you want stuff that renders the XML invalid against the original schema. eg. You can't say: give me no <foo> and two or more <bar> since than the two schemata conflict. If that is what you need then you need to go for redefine and import

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
lyteckAuthor Commented:
Hi Gertone:

First, Let me apologize for not providing enough color in my original question for you/others to provide detailed examples/answers. Second, thank you for taking the initiative to create a simple and clear example, which I should have done in the first place.

Your explanations are quite impressive actually. I do see the power and flexibility of Schematron as you explained it. It wasn't that obvious when I started delving into the specs.

Could you please expand on your last sentence "redefine and import"? I think 10% of the cases may fall in that category, i.e., foo is required per schema but it should not be included in the XML instances. The other 90% will use pure restriction.

Say <example> contains foo (mandatory, unbounded), bar (optional), as you have them, plus some other variable tux (mandatory). A valid instance should only contain <tux> inside <example>. Can your example be taken a step further to illustrate this?

In other words, can the overall solution be a mix of "redefine and import" and Schematron?

Many thanks and best regards,
IT Pros Agree: AI and Machine Learning Key

We’d all like to think our company’s data is well protected, but when you ask IT professionals they admit the data probably is not as safe as it could be.

lyteckAuthor Commented:
Hi Gertone, I need to scratch my last question completely (if you read it already). As a matter of fact, 100% of the banished elements are about pure restriction (for the simple reason that all the XML instances are compliant with the industry schema to begin with.)

I am going to explore Schematron further and probably end up using it per your expert recommendation. I have one last related question I am writing and will post shortly.
lyteckAuthor Commented:
Hi Gertone:

I have one last, quick question on this as I will need justify the Schematron choice vs all possible options considered. Can you briefly comment on the processing speed of different approaches?

Beforehand, would a third option have been to programmatically create validation rules in e.g., java, C# that parses the instance and creates a DOM tree and check for the banished elements?

1) use schematron validation as second step (preferred approach, but second fastest?)
2) create a custom, restricted schema that strips the 100 fields (fastest processing I assume)
3) create validation rules (is it slowest?)

That is my last query on this. Many thanks for your expertise and best regards,
Gertone (Geert Bormans)Information ArchitectCommented:
My opinion on custom code for validation is that you should not do it. Unless speed is really, really important, you should value the flexibility and extensibility of using an ISO standardized schema validation. Here is how schematron mostly works. The schematron schema gets compiled into an XSLT. You can do that compilation beforehand. You could compile the XSLT with .net compiled transform. Little efficiency lost compared to C# DOM coding in my opinion.

The fastest processing (given you have schema validation) is changing the existing schema... but what if in due time the changed schema gets altered?
For that redefinition and import comes into play (well, xs:redefine is a special form of xs:include) but there are important restrictions: you can redefine named complexTypes, but you can't redefine element definitions...

It is hard to comment any further on speed of processing, since I don't know teh details.
I hope you understand that following are important for a correct advice: complexity of the schema, complexity of the restricting rules, size of the XML documents, number of documents to be processed per second...

My advice on speed optimisation however is. Only optimise when you need to. That means that you need a target processing time set upfront and start muddling teh approach if you don't hit the target
lyteckAuthor Commented:

Thank you for your expertise and educating me on this topic.

Best Regards,
Gertone (Geert Bormans)Information ArchitectCommented:
welcome, good luck with it
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.