Link to home
Start Free TrialLog in
Avatar of BrentTemple
BrentTemple

asked on

Inconsistent XML Validation Error in SAXParser

We process several thousand xml documents a day with a webserver that recieves xml documents, parses them and updates a database.   A small percentage of them, (est 1/5,000) fail xml validation, but when resubmitted exactly the same, they pass.  

I've got some data from the logs.  I think the error is coming from the apache SAXParser.  It appears that when the failure comes, it expects a tag that is not defined as required in the schema, but it appears that the parser thinks it is required.  (The schema defines more fields than we are currently receiving, but most are minOccurs=0)

If I have 2 different failures, I can look in a log and see that the xml was in the same format for both failures, but the failure is on a different field.  

This is an excerpt of an error & matching xml:
cvc-complex-type.2.4.a: Invalid content starting with element 'TerminationDate'. The content must match '(("":EmployeeNumber),("":LastName){0-1},("":FirstName){0-1},("":Department){0-1},("":WelderSymbol){0-1},("":EmployeeRate),("":TerminationDate){0-1},("":TelephoneNumber),("":Sex){0-1},("":PhoneExtension){0-1},("":WorkTelephone){0-1},("":TimeReportGroup){0-1},("":EmpExemptInd){0-1},("":PayCycleType){0-1},
...
<EmployeeNumber>203777</EmployeeNumber>
<LastName>SMITH</LastName>
<FirstName>AL</FirstName>
<Department/>
<TerminationDate>        </TerminationDate>
<TelephoneNumber/>
<Sex>M</Sex>
<PhoneExtension/>
<WorkTelephone/>
<PayCycleType>WKLY</PayCycleType>
...
---------------
This is an excerpt of another error & matching xml:
cvc-complex-type.2.4.a: Invalid content starting with element 'PayCycleType'. The content must match ("":EmployeeNumber),("":LastName){0-1},("":FirstName){0-1},("":Department){0-1},("":WelderSymbol){0-1},("":EmployeeRate){0-1},("":TerminationDate){0-1},("":TelephoneNumber){0-1},("":Sex){0-1},("":PhoneExtension){0-1},("":WorkTelephone){0-1},("":TimeReportGroup),("":EmpExemptInd){0-1},("":PayCycleType){0-1}....
...
<EmployeeNumber>203688</EmployeeNumber>
<LastName>JONAS</LastName>
<FirstName>PAUL</FirstName>
<Department/>
<TerminationDate>        </TerminationDate>
<TelephoneNumber/>
<Sex>M</Sex>
<PhoneExtension/>
<WorkTelephone/>
<PayCycleType>WKLY</PayCycleType>
...
From the Schema
<xs:element ref="EmployeeNumber"/>
<xs:element ref="LastName" minOccurs="0"/>
<xs:element ref="FirstName" minOccurs="0"/>
<xs:element ref="Department" minOccurs="0"/>
<xs:element ref="WelderSymbol" minOccurs="0"/>
<xs:element ref="EmployeeRate" minOccurs="0"/>
<xs:element ref="TerminationDate" minOccurs="0"/>
<xs:element ref="TelephoneNumber" minOccurs="0"/>
<xs:element ref="Sex" minOccurs="0"/>
<xs:element ref="PhoneExtension" minOccurs="0"/>
<xs:element ref="WorkTelephone" minOccurs="0"/>
<xs:element ref="TimeReportGroup" minOccurs="0"/>
<xs:element ref="EmpExemptInd" minOccurs="0"/>
<xs:element ref="PayCycleType" minOccurs="0"/>

Neither xml has a tag for EmployeeRate, the first example error lists it as required:("":EmployeeRate), the second doesn't;  ("":EmployeeRate{0-1}),    The second example failed on TimeReportGroup, and the the error message indicates it is required on the second message, but not the first.    

The xml docs were processed within minutes of each other, with several identical (in form) xml docs processing successfully before, after and in between.   Both were successfully resubmitted and did not get the error.  The schema has not been changed for several months.

The parser is called like this:
        try {
            // Instantiate a parser
            XMLReader parser =
                XMLReaderFactory.createXMLReader(org.apache.xerces.parsers.SAXParser);

            // Register the content handler
            parser.setContentHandler(contentHandler);

            // Register the error handler
            parser.setErrorHandler(errorHandler);
            // Turn on validation
            parser.setFeature("http://xml.org/sax/features/validation", true);
            // Schema
            parser.setFeature("http://apache.org/xml/features/validation/schema", true);
            // Parse the document
            //sr is a StringReader
          InputSource is = new InputSource(sr);
          is.setSystemId(systemId);
          parser.parse(is);

This seems completely random to me...   Does anyone know how to stop this error?  

Thanks
Avatar of J_Mak
J_Mak

With regards to the elements defined using the 'ref' attribute, are they done so such that their parent element is the <xs:schema> element. I'm just curious... they're probably not, but I just want to make sure, because they cannot be direct children of the <xs:schema> element lie so:

<xs:schema>
    <xs:element ref="EmployeeNumber"/>
    <xs:element ref="LastName" minOccurs="0"/>
    <xs:element ref="FirstName" minOccurs="0"/>
    <xs:element ref="Department" minOccurs="0"/>
    <xs:element ref="WelderSymbol" minOccurs="0"/>
    .........
</xs:schema>

What does it mean by invalid content sharing? Also, where how are the above elements defined elsewhere in the schema? I realise that they are being referenced in the above example. Cheers.
Avatar of BrentTemple

ASKER


More detail on how the schema works:  
The header of the Employee Schema includes a 'dictionary' type of schema:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">
      <xs:include schemaLocation="docs/TheDictionary.xsd"/>
      <xs:element name="EmployeeDoc">
                .... all the refs in my original post reside within this structure.  (I've omitted some of the structure within EmployeeDoc)
      </xs:element>
</xs:schema>

This is the layout of the dictionary schema:
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified">
...
      <xs:element name="EmployeeNumber">
            <xs:simpleType>
                  <xs:restriction base="xs:string">
                        <xs:maxLength value="12"/>
                  </xs:restriction>
            </xs:simpleType>
      </xs:element>
...
</xs:schema>

I don't see 'sharing', it's 'starting'.  (cvc-complex-type.2.4.a: Invalid content starting with element 'PayCycleType'. ) The error is saying that the tag PayCycleType is out of order, because I didn't provide TimeReportGroup first.   But TimeReportGroup isn't defined as required in the schema, nor is it required when validating 5000 other documents with the same tags.

Thanks
Can ask you what schema element your references are under? That is, what is their parent node? Is it <xs:choice> or <xs:sequence>?

I'm assuming that it is <xs:sequence>, in which case you must, under any circumstances, provide all the defined elements in the correct order regardles of whether they contain any content or not.

Thanks.
It is <xs:sequence>.   If I switched this for <xs:choice> would it have any other side-effects?  

Thanks
If you switched it to <xs:choice> you must only have one of the elements present, only one. I was just asking for completeness.

I also noticed that you consistently use 'minOccurs=0'. I'm assuming that you can only have ONE FirstName and LastName elements in an EmployeeDoc element, correct? Does that go for the other elements? I was just asking because if that's the case, instead of using <xs:sequence> you can try using <xs:all> instead. Its allows the elements to be in any order, but they must appear only once each. Here is a link for more information:

http://www.w3schools.com/schema/el_all.asp

I'm not sure what effect that will have. Cheers.
Thanks.  I looked up all, choice and sequence on the site J Mak gave me a link to.  

http://www.w3schools.com/schema/el_all.asp
The all element specifies that the child elements can appear in any order and that each child element can occur zero or one time.  

http://www.w3schools.com/schema/el_choice.asp
The choice element allows only one of the elements contained in the <choice> declaration to be present within the containing element

http://www.w3schools.com/schema/el_sequence.asp
The sequence element specifies that the child elements must appear in a sequence. Each child element can occur from 0 to any number of times.

If <xs:all> has a max limit of one, as it sounds like it does in the definition, it won't work for all our xsds. In EmployeeDoc I think it would.   But we also get this same error in other documents that have more complex structures which include some 'zero to many' or 'one to many' elements/nodes.

<xs:choice> won't work if it limits to only one.
-------------------------------------
To answer J Maks question:

Most of the elements allow zero or one.  A few don't have the 'MinOccurs=0' and that makes them required during schema validation.  In the example I'm debugging with (Employee) we don't have any zero to many elements.  But in some of the other documents that get the same 'random' failure, there are elements defined with a "maxOccurs=unbounded" to allow them to exist multiple times.

For example, if I submitted an xml document that was missing the EmployeeNumber (which doesn't have a minOccurs in the schema), I would get the same error that we see 'randomly'.

Here is the error if I submit a document with no EmployeeNumber element:
cvc-complex-type.2.4.a: Invalid content starting with element 'LastName'. The content must match '((EmployeeNumber),("":LastName){0-1},("":FirstName){0-1},("...

And in the text of the error message itself, it has a {0-1} following the elements with a minOccurs, and not after the EmployeeNumber...   If you go back to the original example in my Post above, it skips the {0-1} for an element that IS defined as minOccurs=0, and throws the exception for the next tag present following the missing tag.
...<xs:sequence>
<xs:element ref="EmployeeNumber"/>
<xs:element ref="LastName" minOccurs="0"/>...

My theory based on the weird (dis)appearance of the {0-1} in the error messages, is that occasionally the validator neglects to notice the minOccurs=0 when it is validating the xml.
On each of the failures, I can find an element that:
-- isn't in the xml
-- is prior to the one that the error failed on, (error message says ...starting with element 'LastName')
-- is defined in the error message without {0-1}
-- is defined in the schema as minOccurs=0.  

Thanks


J_Mak;

Thanks for trying to help.   I'm still getting the error, but after quite a bit of searching through the apache user forum, I've found a few users who claim that the SAXParser class isn't 100% thread safe.  I'm going to try to sychronize the code that calls it, and see if that solves the problem.  I'm guessing that something in the SAXParser class crosses wires when more than one thread is using it at the same time.  

-Brent
ASKER CERTIFIED SOLUTION
Avatar of modulo
modulo

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I've had the following mod in Production for 3 weeks and have not seen the error:

Added a new method:
   private static synchronized void parseIt(XMLReader parser, InputSource is) throws IOException, SAXException {
       parser.parse(is);
   }

Changed
parser.parse(is);
in the existing method, (see original question) to
parseIt(parser, is);

-Brent