The entity name must immediately follow the '&' in the entity reference.

Hi,

I am reading XML in Java using XPATH

When I am reading XML I am getting exception on some of the XMLs which as & in it.

Example is attached

I am able to read all other XMLs perfect but some XMLS are throwing exception below

2009-02-23 14:24:57,609 [main] ERROR com.pbms.businessLogic.ReadXML  - Problem in reading file: C:\REGISTRATION-0000000410.xml
org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference.
      at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
      at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
      at com.pbms.businessLogic.ReadXML.getSingleRegistrationRequest(ReadXML.java:154)
      at com.pbms.businessLogic.BaseBL.processRegistrationRequestXMLs(BaseBL.java:78)
      at mccprocesswatcher.Main.readRegistrationXMLs(Main.java:53)
      at mccprocesswatcher.Main.main(Main.java:40)


Many Thanks for all your co-opeartion      

Here is my code
 
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringElementContentWhitespace(true);
 
DocumentBuilder builder = factory.newDocumentBuilder();
 
InputStream in = null;
in = new FileInputStream(sFilePath);        
 
Document document = builder.parse(new InputSource(in));
document.getDocumentElement().normalize();
 
XPath xpath = XPathFactory.newInstance().newXPath();
String strWebsite = (String) xpath.evaluate("//RegistrationRequest/OptInLink/text()",document,XPathConstants.STRING);
 
 
XML below:
 
<?xml version="1.0" encoding="UTF-8"?>
<RegistrationRequest>
	<ID>305</ID>
	<Timestamp>2008-05-05 05:57:05</Timestamp>
	<Website>B2b</Website>
	<Customer>
		<Title>0</Title>
		<LastName>Adrian</LastName>
		<FirstName>Gerd</FirstName>
		<CompanyName>ABC & Co. KG</CompanyName>
		<City>München</City>
		<Postcode>80335</Postcode>
		<Country>Germany</Country>
		<EmailAddress>abc.def@myemail.de</EmailAddress>
		<Address1>Hello. 33</Address1>
		<Address2></Address2>
		<MobileNumber></MobileNumber>
		<LandLineNumber></LandLineNumber>
		<Type>Institutional</Type>
		<RegistrationStatus>{STATUS}</RegistrationStatus>
		<Language></Language>
	</Customer>
	<FutureProducts>
		<Email>No</Email>
		<Post>No</Post>
		<Sms>No</Sms>
	</FutureProducts>
</RegistrationRequest>

Open in new window

tia_kamakshiAsked:
Who is Participating?
 
abelCommented:
You will find out that opinions vary, but because XML is so well specified, it's easy to check the validity of opinions on this subject. Bottomline, there are only two characters not allowed: & (ampersand) and < (smaller than) sign. It is a common misunderstanding that > (greater then) and " (quote) are not allowed.

the smaller than can only appear as start of a tag (element name). The ampersand can only appear as the beginning of an entity: &_amp;, &#1234;, &#xAB; or &namedentity; (named entities must be declared in the doctype decl.). The only five predefined entities allowed are: &_amp; &_gt; &_lt; &_apos; and &_quot; (remove underscore, they are there because EE messes up otherwise).

Now, there's more needed before you have legal XML. Another very common mistake is having illegal byte sequences in your page. For instance, byte 01h, 02h etc are not allowed (in XML 1.1 it is allowed as a numerical character entity, like in &#x01;). The NULL is never allowed, not even in XML 1.1.

Another very common mistake is having unescaped characters that do not belong to the specified encoding. For instance, you put an é (e-accent aigu) inside an XML document with only US-ASCII encoding in the header. Bottom line here: the easiest way is to always use UTF-8 or UTF-16 of the unicode encodings. Any XML compliant parser/processor MUST understand these, so it is easy to ask for them. Then, you can have any weird character just as it is (in its UTF-8 byte sequence) in the document. The editor / parser etc will take care of that for you.

Hope this helps, feel free to ask further if you need more assistence. If you want me (or someone from here) to contact your supplier of data, I'll be happy to do so on your behalf.

Regards,
-- Abel --
0
 
CEHJCommented:
Sounds like you have malformed xml. & should be &amp;
0
 
abelCommented:
Change

<CompanyName>ABC & Co. KG</CompanyName>

into

<CompanyName>ABC &amp; Co. KG</CompanyName>

If you got this XML from an external party, you can tell them that they did not give you valid XML... which is mandatory to work with XML period...
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

 
tia_kamakshiAuthor Commented:
Hi,

Thanks for your response

I am sorry I cannot change

<CompanyName>ABC & Co. KG</CompanyName>

into

<CompanyName>ABC &amp; Co. KG</CompanyName>

As these XMLs are comming from some external source, to which we cannot do anything

So, we need to fix and get the records by our own.

Please guide me to get these values

Many Thanks for your co-operation
0
 
abelCommented:
I always hated it as a programmer when a company said "you can interface with us because we use official standards like XML" only to find out later that they have no idea what they talk about.... Sorry, couldn't resist this flaming ;)

To do something about this is a bit tricky, but possible. The problem lies in the parts where the entity (which starts with an ampersand) is actually correct and you do not want to escape that. What I would do is, before you read it in as XML, to parse it as plain text and replace every ampersand that is not followed with a series of non-space characters and a semicolon.

This will remove 95% of your problem. But a value like the following will not be corrected:

<CompanyName>ABC &Co;. KG</CompanyName>

Working with XML that is not XML is a non-exact science and has many traps. In XML newsgroups this discussion is often held, ending with something like "is XML does not conform, it is just a string and you don't know what you're at, anything can happen".

The best advice from about a decade working with XML that I can give you is: make sure that you or your company arranges for the correct liability clauses about this so that the sending party knows that they are responsible for mistakes in your workarounds and that they know, in advance, that their XML is not XML.
0
 
abelCommented:
In a line-by-line loop, do something like this:

Pattern pattern = Pattern.compile ("&[a-zA-Z0-9]+;");
string line = ....;
if (line.contains("&"))
{
    if(!pattern.matcher(line).find())
    {
         // replace the ampersand with &amp;
    }
}
0
 
CEHJCommented:
You'll probably find this clears up most of the errors:
s = s.replaceAll("& ", "&amp; ")

Open in new window

0
 
CEHJCommented:
>>As these XMLs are comming from some external source, to which we cannot do anything

You'll have to, i'm afraid, or you won't be able to parse it. The best solution would probably be to use a FilterReader that plugs into your pipeline
0
 
objectsCommented:
They are sending you invalid xml. I'd suggest contacting them and let them know so they can fix it at there end.

0
 
objectsCommented:
If they need assistance resolving the problem feel free to recommend my services :)

0
 
tia_kamakshiAuthor Commented:
Thanks all.

I will come back to you on this.

Can you please let me know what are the values which should not be present in the XML file and If they exists then with what we should replace those

Many Thanks for all your co-operation
0
 
tia_kamakshiAuthor Commented:
Many Many Thanks for great description.

Great help.

Thanks allot
0
 
abelCommented:
You're welcome :)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.