[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now

x
?
Solved

XML validation against a subset of a schema

Posted on 2012-03-12
8
Medium Priority
?
613 Views
Last Modified: 2013-12-13
Hi,

I have XML documents that are using a subset, e.g. 90% of all the elements available in some official, industry schema (cannot be modified). The business requirements for the project excludes about, say, 100 fields from this schema. What is the best way to validate these XML documents to make sure they do not contain any of the 100 "banished" fields/xpaths?

One option (not preferable) would be to create a custom schema that is a subset of the original industry schema without the 100 fields and validate the XML documents against that trimmed down schema. Creating/maintaining a separate schema was ruled out, however, as it would be too complex and not maintainable.

I am then interested in the approach where these are given: 1) XML documents 2) schema 3) list of excluded fields... and a technical solution is needed to validate the XML documents against 2 + 3. Ideally the solution is efficient and maintainable.
[I am thinking some XSLT perhaps that would somehow loop through a list of 100 Xpath expressions (perhaps maintained in an external configuration file) and check for the presence of banished elements, if that makes any sense...]

I would appreciate detailed guidance / insights from a few experts on this. A code snipped to illustrate would be great. Thanks for your expertise.

Best Regards,
Lyteck
0
Comment
Question by:lyteck
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 4
8 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37708631
Hi Lyteck,

Allow me to say that your question is too vague to give detailed examples. I understand the requirement in general however and that is a common use case.
Given the schema and what you actually need, I have two approaches to this problem
1. Import the schema and redefine some element definitions. This will work in some cases. Requires a somewhat deeper understanding of the inheritence mechanisms of XML schema.
The advantage is one single schema
2. Restrict using schematron. Schematron is a rules based schema language. You could validate according the existing schema and restrict your XML further in a schematron validation. This is what I do in over 80% of the case where I have your requirement. It adds a second validation step, but gives you full flexibility. It also comes close to your suggestion where there is a 1)XML 2)schema, and 3) list of excluded elements (schematron, be it such a list but wityh specific syntaxt) Highly efficient, highly flexible and highly maintainable.

Needless to say, learn schematron (in less than two hours) and go down that route:
http://www.schematron.com/
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 2000 total points
ID: 37708682
Here is an example to make the story less vague

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
    <xs:element name="example">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="foo" maxOccurs="unbounded"/>
                <xs:element name="bar" minOccurs="0"/>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

Open in new window


The <example> element in this schema can have one or more <foo> and then an optional <bar>. No restrictions on the content of the <foo> element

Here is a valid XML document

<?xml version="1.0" encoding="UTF-8"?>
<example xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:noNamespaceSchemaLocation="example.xsd">
    <foo>fo content</foo>
    <foo>foo2 content</foo>
    <bar>bar content</bar>
</example>

Open in new window


Now, basically without changing the schema, you want:
- a maximum of one foo
- no bar
- the foo element to contain a "foo" in the string content

Here is the schematron that gives you that. Very clear and flexible, no?

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron">
    <pattern>
        <rule context="example">
            <assert test="count(foo) = 1">More than one foo is not allowed in example</assert>
            <assert test="count(bar) = 0">Bar is not allowed in example</assert>
        </rule>
        <rule context="foo">
            <assert test="contains(., 'foo')">element foo does not contain the word "foo"</assert>
        </rule>
    </pattern>
</schema>

Open in new window


Note that there is a restriction in this approach. You can use schematron as a second step only if you do pure restriction. You can't say that you want to wipe out a mandatory element, or in general that you want stuff that renders the XML invalid against the original schema. eg. You can't say: give me no <foo> and two or more <bar> since than the two schemata conflict. If that is what you need then you need to go for redefine and import
0
 

Author Comment

by:lyteck
ID: 37708753
Hi Gertone:

First, Let me apologize for not providing enough color in my original question for you/others to provide detailed examples/answers. Second, thank you for taking the initiative to create a simple and clear example, which I should have done in the first place.

Your explanations are quite impressive actually. I do see the power and flexibility of Schematron as you explained it. It wasn't that obvious when I started delving into the specs.

Could you please expand on your last sentence "redefine and import"? I think 10% of the cases may fall in that category, i.e., foo is required per schema but it should not be included in the XML instances. The other 90% will use pure restriction.

Say <example> contains foo (mandatory, unbounded), bar (optional), as you have them, plus some other variable tux (mandatory). A valid instance should only contain <tux> inside <example>. Can your example be taken a step further to illustrate this?

In other words, can the overall solution be a mix of "redefine and import" and Schematron?

Many thanks and best regards,
Lyteck
0
Survive A High-Traffic Event with Percona

Your application or website rely on your database to deliver information about products and services to your customers. You can’t afford to have your database lose performance, lose availability or become unresponsive – even for just a few minutes.

 

Author Comment

by:lyteck
ID: 37708926
Hi Gertone, I need to scratch my last question completely (if you read it already). As a matter of fact, 100% of the banished elements are about pure restriction (for the simple reason that all the XML instances are compliant with the industry schema to begin with.)

I am going to explore Schematron further and probably end up using it per your expert recommendation. I have one last related question I am writing and will post shortly.
0
 

Author Comment

by:lyteck
ID: 37708945
Hi Gertone:

I have one last, quick question on this as I will need justify the Schematron choice vs all possible options considered. Can you briefly comment on the processing speed of different approaches?

Beforehand, would a third option have been to programmatically create validation rules in e.g., java, C# that parses the instance and creates a DOM tree and check for the banished elements?

1) use schematron validation as second step (preferred approach, but second fastest?)
2) create a custom, restricted schema that strips the 100 fields (fastest processing I assume)
3) create validation rules (is it slowest?)

That is my last query on this. Many thanks for your expertise and best regards,
Lyteck
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 2000 total points
ID: 37708972
My opinion on custom code for validation is that you should not do it. Unless speed is really, really important, you should value the flexibility and extensibility of using an ISO standardized schema validation. Here is how schematron mostly works. The schematron schema gets compiled into an XSLT. You can do that compilation beforehand. You could compile the XSLT with .net compiled transform. Little efficiency lost compared to C# DOM coding in my opinion.

The fastest processing (given you have schema validation) is changing the existing schema... but what if in due time the changed schema gets altered?
For that redefinition and import comes into play (well, xs:redefine is a special form of xs:include) but there are important restrictions: you can redefine named complexTypes, but you can't redefine element definitions...

It is hard to comment any further on speed of processing, since I don't know teh details.
I hope you understand that following are important for a correct advice: complexity of the schema, complexity of the restricting rules, size of the XML documents, number of documents to be processed per second...

My advice on speed optimisation however is. Only optimise when you need to. That means that you need a target processing time set upfront and start muddling teh approach if you don't hit the target
0
 

Author Closing Comment

by:lyteck
ID: 37708994
Gertone,

Thank you for your expertise and educating me on this topic.

Best Regards,
Lyteck
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 37709041
welcome, good luck with it
0

Featured Post

Will your db performance match your db growth?

In Percona’s white paper “Performance at Scale: Keeping Your Database on Its Toes,” we take a high-level approach to what you need to think about when planning for database scalability.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Developer portfolios can be a bit of an enigma—how do you present yourself to employers without burying them in lines of code?  A modern portfolio is more than just work samples, it’s also a statement of how you work.
This post contains step-by-step instructions for setting up alerting in Percona Monitoring and Management (PMM) using Grafana.
The viewer will learn how to successfully create a multiboot device using the SARDU utility on Windows 7. Start the SARDU utility: Change the image directory to wherever you store your ISOs, this will prevent you from having 2 copies of an ISO wit…
The viewer will learn how to successfully download and install the SARDU utility on Windows 7, without downloading adware.

649 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question