Link to home
Start Free TrialLog in
Avatar of tia_kamakshi
tia_kamakshiFlag for United Arab Emirates

asked on

The entity name must immediately follow the '&' in the entity reference.

Hi,

I am reading XML in Java using XPATH

When I am reading XML I am getting exception on some of the XMLs which as & in it.

Example is attached

I am able to read all other XMLs perfect but some XMLS are throwing exception below

2009-02-23 14:24:57,609 [main] ERROR com.pbms.businessLogic.ReadXML  - Problem in reading file: C:\REGISTRATION-0000000410.xml
org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference.
      at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
      at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
      at com.pbms.businessLogic.ReadXML.getSingleRegistrationRequest(ReadXML.java:154)
      at com.pbms.businessLogic.BaseBL.processRegistrationRequestXMLs(BaseBL.java:78)
      at mccprocesswatcher.Main.readRegistrationXMLs(Main.java:53)
      at mccprocesswatcher.Main.main(Main.java:40)


Many Thanks for all your co-opeartion      

Here is my code
 
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
factory.setIgnoringElementContentWhitespace(true);
 
DocumentBuilder builder = factory.newDocumentBuilder();
 
InputStream in = null;
in = new FileInputStream(sFilePath);        
 
Document document = builder.parse(new InputSource(in));
document.getDocumentElement().normalize();
 
XPath xpath = XPathFactory.newInstance().newXPath();
String strWebsite = (String) xpath.evaluate("//RegistrationRequest/OptInLink/text()",document,XPathConstants.STRING);
 
 
XML below:
 
<?xml version="1.0" encoding="UTF-8"?>
<RegistrationRequest>
	<ID>305</ID>
	<Timestamp>2008-05-05 05:57:05</Timestamp>
	<Website>B2b</Website>
	<Customer>
		<Title>0</Title>
		<LastName>Adrian</LastName>
		<FirstName>Gerd</FirstName>
		<CompanyName>ABC & Co. KG</CompanyName>
		<City>München</City>
		<Postcode>80335</Postcode>
		<Country>Germany</Country>
		<EmailAddress>abc.def@myemail.de</EmailAddress>
		<Address1>Hello. 33</Address1>
		<Address2></Address2>
		<MobileNumber></MobileNumber>
		<LandLineNumber></LandLineNumber>
		<Type>Institutional</Type>
		<RegistrationStatus>{STATUS}</RegistrationStatus>
		<Language></Language>
	</Customer>
	<FutureProducts>
		<Email>No</Email>
		<Post>No</Post>
		<Sms>No</Sms>
	</FutureProducts>
</RegistrationRequest>

Open in new window

Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Sounds like you have malformed xml. & should be &amp;
Change

<CompanyName>ABC & Co. KG</CompanyName>

into

<CompanyName>ABC &amp; Co. KG</CompanyName>

If you got this XML from an external party, you can tell them that they did not give you valid XML... which is mandatory to work with XML period...
Avatar of tia_kamakshi

ASKER

Hi,

Thanks for your response

I am sorry I cannot change

<CompanyName>ABC & Co. KG</CompanyName>

into

<CompanyName>ABC &amp; Co. KG</CompanyName>

As these XMLs are comming from some external source, to which we cannot do anything

So, we need to fix and get the records by our own.

Please guide me to get these values

Many Thanks for your co-operation
I always hated it as a programmer when a company said "you can interface with us because we use official standards like XML" only to find out later that they have no idea what they talk about.... Sorry, couldn't resist this flaming ;)

To do something about this is a bit tricky, but possible. The problem lies in the parts where the entity (which starts with an ampersand) is actually correct and you do not want to escape that. What I would do is, before you read it in as XML, to parse it as plain text and replace every ampersand that is not followed with a series of non-space characters and a semicolon.

This will remove 95% of your problem. But a value like the following will not be corrected:

<CompanyName>ABC &Co;. KG</CompanyName>

Working with XML that is not XML is a non-exact science and has many traps. In XML newsgroups this discussion is often held, ending with something like "is XML does not conform, it is just a string and you don't know what you're at, anything can happen".

The best advice from about a decade working with XML that I can give you is: make sure that you or your company arranges for the correct liability clauses about this so that the sending party knows that they are responsible for mistakes in your workarounds and that they know, in advance, that their XML is not XML.
In a line-by-line loop, do something like this:

Pattern pattern = Pattern.compile ("&[a-zA-Z0-9]+;");
string line = ....;
if (line.contains("&"))
{
    if(!pattern.matcher(line).find())
    {
         // replace the ampersand with &amp;
    }
}
You'll probably find this clears up most of the errors:
s = s.replaceAll("& ", "&amp; ")

Open in new window

>>As these XMLs are comming from some external source, to which we cannot do anything

You'll have to, i'm afraid, or you won't be able to parse it. The best solution would probably be to use a FilterReader that plugs into your pipeline
They are sending you invalid xml. I'd suggest contacting them and let them know so they can fix it at there end.

If they need assistance resolving the problem feel free to recommend my services :)

Thanks all.

I will come back to you on this.

Can you please let me know what are the values which should not be present in the XML file and If they exists then with what we should replace those

Many Thanks for all your co-operation
ASKER CERTIFIED SOLUTION
Avatar of abel
abel
Flag of Netherlands image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Many Many Thanks for great description.

Great help.

Thanks allot
You're welcome :)