Solved

XML validation against a subset of a schema

Posted on 2012-03-12
8
574 Views
Last Modified: 2013-12-13
Hi,

I have XML documents that are using a subset, e.g. 90% of all the elements available in some official, industry schema (cannot be modified). The business requirements for the project excludes about, say, 100 fields from this schema. What is the best way to validate these XML documents to make sure they do not contain any of the 100 "banished" fields/xpaths?

One option (not preferable) would be to create a custom schema that is a subset of the original industry schema without the 100 fields and validate the XML documents against that trimmed down schema. Creating/maintaining a separate schema was ruled out, however, as it would be too complex and not maintainable.

I am then interested in the approach where these are given: 1) XML documents 2) schema 3) list of excluded fields... and a technical solution is needed to validate the XML documents against 2 + 3. Ideally the solution is efficient and maintainable.
[I am thinking some XSLT perhaps that would somehow loop through a list of 100 Xpath expressions (perhaps maintained in an external configuration file) and check for the presence of banished elements, if that makes any sense...]

I would appreciate detailed guidance / insights from a few experts on this. A code snipped to illustrate would be great. Thanks for your expertise.

Best Regards,
Lyteck
0
Comment
Question by:lyteck
  • 4
  • 4
8 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
Hi Lyteck,

Allow me to say that your question is too vague to give detailed examples. I understand the requirement in general however and that is a common use case.
Given the schema and what you actually need, I have two approaches to this problem
1. Import the schema and redefine some element definitions. This will work in some cases. Requires a somewhat deeper understanding of the inheritence mechanisms of XML schema.
The advantage is one single schema
2. Restrict using schematron. Schematron is a rules based schema language. You could validate according the existing schema and restrict your XML further in a schematron validation. This is what I do in over 80% of the case where I have your requirement. It adds a second validation step, but gives you full flexibility. It also comes close to your suggestion where there is a 1)XML 2)schema, and 3) list of excluded elements (schematron, be it such a list but wityh specific syntaxt) Highly efficient, highly flexible and highly maintainable.

Needless to say, learn schematron (in less than two hours) and go down that route:
http://www.schematron.com/
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
Comment Utility
Here is an example to make the story less vague

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:element name="example">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="foo" maxOccurs="unbounded"/>
                <xs:element name="bar" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

Open in new window


The <example> element in this schema can have one or more <foo> and then an optional <bar>. No restrictions on the content of the <foo> element

Here is a valid XML document

<?xml version="1.0" encoding="UTF-8"?>
<example xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="example.xsd">
    <foo>fo content</foo>
    <foo>foo2 content</foo>
    <bar>bar content</bar>
</example>

Open in new window


Now, basically without changing the schema, you want:
- a maximum of one foo
- no bar
- the foo element to contain a "foo" in the string content

Here is the schematron that gives you that. Very clear and flexible, no?

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="example">
            <assert test="count(foo) = 1">More than one foo is not allowed in example</assert>
            <assert test="count(bar) = 0">Bar is not allowed in example</assert>
        </rule>
        <rule context="foo">
            <assert test="contains(., 'foo')">element foo does not contain the word "foo"</assert>
        </rule>
    </pattern>
</schema>

Open in new window


Note that there is a restriction in this approach. You can use schematron as a second step only if you do pure restriction. You can't say that you want to wipe out a mandatory element, or in general that you want stuff that renders the XML invalid against the original schema. eg. You can't say: give me no <foo> and two or more <bar> since than the two schemata conflict. If that is what you need then you need to go for redefine and import
0
 

Author Comment

by:lyteck
Comment Utility
Hi Gertone:

First, Let me apologize for not providing enough color in my original question for you/others to provide detailed examples/answers. Second, thank you for taking the initiative to create a simple and clear example, which I should have done in the first place.

Your explanations are quite impressive actually. I do see the power and flexibility of Schematron as you explained it. It wasn't that obvious when I started delving into the specs.

Could you please expand on your last sentence "redefine and import"? I think 10% of the cases may fall in that category, i.e., foo is required per schema but it should not be included in the XML instances. The other 90% will use pure restriction.

Say <example> contains foo (mandatory, unbounded), bar (optional), as you have them, plus some other variable tux (mandatory). A valid instance should only contain <tux> inside <example>. Can your example be taken a step further to illustrate this?

In other words, can the overall solution be a mix of "redefine and import" and Schematron?

Many thanks and best regards,
Lyteck
0
 

Author Comment

by:lyteck
Comment Utility
Hi Gertone, I need to scratch my last question completely (if you read it already). As a matter of fact, 100% of the banished elements are about pure restriction (for the simple reason that all the XML instances are compliant with the industry schema to begin with.)

I am going to explore Schematron further and probably end up using it per your expert recommendation. I have one last related question I am writing and will post shortly.
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 

Author Comment

by:lyteck
Comment Utility
Hi Gertone:

I have one last, quick question on this as I will need justify the Schematron choice vs all possible options considered. Can you briefly comment on the processing speed of different approaches?

Beforehand, would a third option have been to programmatically create validation rules in e.g., java, C# that parses the instance and creates a DOM tree and check for the banished elements?

1) use schematron validation as second step (preferred approach, but second fastest?)
2) create a custom, restricted schema that strips the 100 fields (fastest processing I assume)
3) create validation rules (is it slowest?)

That is my last query on this. Many thanks for your expertise and best regards,
Lyteck
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
My opinion on custom code for validation is that you should not do it. Unless speed is really, really important, you should value the flexibility and extensibility of using an ISO standardized schema validation. Here is how schematron mostly works. The schematron schema gets compiled into an XSLT. You can do that compilation beforehand. You could compile the XSLT with .net compiled transform. Little efficiency lost compared to C# DOM coding in my opinion.

The fastest processing (given you have schema validation) is changing the existing schema... but what if in due time the changed schema gets altered?
For that redefinition and import comes into play (well, xs:redefine is a special form of xs:include) but there are important restrictions: you can redefine named complexTypes, but you can't redefine element definitions...

It is hard to comment any further on speed of processing, since I don't know teh details.
I hope you understand that following are important for a correct advice: complexity of the schema, complexity of the restricting rules, size of the XML documents, number of documents to be processed per second...

My advice on speed optimisation however is. Only optimise when you need to. That means that you need a target processing time set upfront and start muddling teh approach if you don't hit the target
0
 

Author Closing Comment

by:lyteck
Comment Utility
Gertone,

Thank you for your expertise and educating me on this topic.

Best Regards,
Lyteck
0
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
welcome, good luck with it
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

Today, still in the boom of Apple, PC's and products, nearly 50% of the computer users use Windows as graphical operating systems. If you are among those users who love windows, but are grappling to keep the system's hard drive optimized, then you s…
Skype is a P2P (Peer to Peer) instant messaging and VOIP (Voice over IP) service – as well as a whole lot more.
Using Adobe Premiere Pro, the viewer will learn how to set up a sequence with proper settings, importing pictures, rendering, and exporting the finished product.
The viewer will learn how to successfully download and install the SARDU utility on Windows 8, without downloading adware.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now