XML encoding in Java

Hi,
I am parsing an xml file which crashes when it encounters accented characters like Ü.
The code is:
      DocumentBuilder builder = dbf.newDocumentBuilder();
      InputSource is = new InputSource(xmlFile);
      String encoding = is.getEncoding(); //returns null
      is.setEncoding("UTF-8"); //does not work.
      Document doc = builder.parse(is);

      // Invoke the abstract method to parse
      parse(doc); //SAX Exception (org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.)

Open in new window


The XML:
<?xml version="1.0" encoding="UTF-8"?>
<!-- Generated by Oracle Reports version 10.1.2.3.0 -->
<Address>
    <Customer_ID>00985601</Customer_ID>
    <Customer_NAME>Gemütlichkeit LLC</Customer_NAME>
    <Customer_TYPE>Customer</Customer_TYPE>
    <TAX_ID></TAX_ID>
    <ADDRESS_TYPE>PRIMARY</ADDRESS_TYPE>
    <ADDRESS1>1104 ESPLANADE #107</ADDRESS1>
    <ADDRESS2></ADDRESS2>
    <ADDRESS3></ADDRESS3>
    <CITY>REDONDO BEACH</CITY>
    <STATE>CA</STATE>
    <ZIPCODE>90277</ZIPCODE>
    <FOREIGN_COUNTRY>N</FOREIGN_COUNTRY>
    <LAST_UPDATE_DATE>10-SEP-15</LAST_UPDATE_DATE>
    <NEW_UPDATE>NEW</NEW_UPDATE>
 </Address>

Open in new window



How do I fix this, I am limited by JDK1.5
VakilsDeveloperAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

mccarlIT Business Systems Analyst / Software DeveloperCommented:
In your code above, what is xmlFile (ie. what is its type and value)?
0
VakilsDeveloperAuthor Commented:
It is fully qualified name of xml file to be loaded. The contents are above.
0
gurpsbassiCommented:
It would help if we can see the TYPE of xmlFile i.e. the point at which it is declared.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

VakilsDeveloperAuthor Commented:
Type = string,
value = //itcom33/xmlFiles/201509140600//ADDRESSES.xml  (path)
ctor used:
InputSource(String systemId)           Create a new input source with a system identifier
If I try to load in Visual Studio, like below for xml: (xml instead of aspx)
 Visual Studio error message
0
VakilsDeveloperAuthor Commented:
@gurpsbassi: I am not sure what you mean. I am reading xml. Does above info is what you want?
0
gurpsbassiCommented:
Try creating an InputSource from an InputStream instead.
InputSource inputSource = new InputSource(myInputStream);
 
Then use
InputStreamReader(InputStream in, Charset cs) where you can definer UTF-8.
1
VakilsDeveloperAuthor Commented:
Are you suggesting something like this:
 
           InputStream input = new FileInputStream("V:\\ADDRESSES.xml");

            InputStreamReader reader = new InputStreamReader(input, "UTF-8"); //where you can definer UTF-8.
            InputSource inputSource = new InputSource(reader);
            inputSource.setEncoding("UTF-8");

Open in new window


and use documentbuilder to parse?
0
gurpsbassiCommented:
Yes
you may not need the line  inputSource.setEncoding("UTF-8");
but try it with and without.
0
VakilsDeveloperAuthor Commented:
Thanks! That works! The code does not crash while reading xml. However, as the xml contains non UTF-8 character,  <Customer_NAME>Gemütlichkeit LLC</Customer_NAME>, it gets substituted as :
Gem�lichkeit LLC. That ends up in database as Gem¿tlichkeit LLC.
It is then seems better to send it back to vendor to fix it. Hence it is OK code crash  ( then we would know why) and report the exception. Is there a way to check while parsing and report the offending line?
Then we can  report the error location.
0
gurpsbassiCommented:
http://www.fileformat.info/info/charset/UTF-8/list.htm

However, as the xml contains non UTF-8 character
pretty sure  Ü is UTF-8.

<Customer_NAME>Gemütlichkeit LLC</Customer_NAME>, it gets substituted as :
Gem�lichkeit LLC. That ends up in database as Gem¿tlichkeit LLC.
It is then seems better to send it back to vendor to fix it. Hence it is OK code crash  ( then we would know why) and report the exception. Is there a way to check while parsing and report the offending line?
Then we can  report the error location.

No idea what you are saying here.

what is substituted and where? I can only see your code doing a parse. nothing more.

Is your database set up as UTF-8? and whatever SQL editor you are using to view the data should be enabled for UTF-8 too.
0
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Either your input file is NOT in UTF-8 or it is corrupt somehow. Can you post the XML file as an attachment? Because when you just copy and paste the contents into your question above, it is resolving whatever issue there is, ie. the above copy/paste of the file is different to the actual file contents even though it looks the same.

I copied and pasted your XML above and I can parse it fine without any errors at all. You can even try that yourself, take a copy of the above and and paste it into a NEW xml file, and then set your code to parse that new file and you should not see any errors.

Finally, yes, we should be able to get some code that locates any encoding errors in the input file, but it would be easier to come up with something if you can post the offending file, then I can test it out.
0
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Actually thinking about it a bit more, what is probably likely is that even though the XML header states a UTF-8 encoding, the file that you have is probably NOT.

Change your code to this (just as a test) and see what result you get...

           InputStream input = new FileInputStream("V:\\ADDRESSES.xml");

            InputStreamReader reader = new InputStreamReader(input, StandardCharsets.ISO_8859_1);
            InputSource inputSource = new InputSource(reader);

Open in new window

1
gurpsbassiCommented:
StandardCharsets.ISO_8859_1

Yes this is a common problem, especially in editors such as eclipse where the default character enconding is not UTF-8.
0
VakilsDeveloperAuthor Commented:
I will try that, I use Visual Studio to view/edit xml. Weblogic for Java code.
Attached is file and screenshots.
[embed=file 952576 ]File loaded by Visual StudioCapture.PNG
ADDRESSES.xml
ADDRESSES.zip
0
VakilsDeveloperAuthor Commented:
Good news!
With UTF-8 -> Gem�lichkeit LLC

With ->StandardCharsets.ISO_8859_1->Gemülichkeit LLC (perfect)
Thanks!  Is there a way to find encoding and apply at run time?
0
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Is there a way to find encoding and apply at run time?
Unfortunately not that I know of, other than trying it as UTF-8 and then if you get an exception, try it as ISO_8859_1. It's not perfect but it might be ok for what you want.
0
VakilsDeveloperAuthor Commented:
Hi,
Unfortunately, one other file had character probably some variant of o(i am guessing) , output  with various encodings:
ISO_8859:1859UTF-8:UTF8.PNGThe code does not crash and character ends as inverted ? in database.
Each editor shows me something different. Is there a way to find what exactly it is?
Wouldn't it be wise to send file back to vendor for correction? I can't upload file because of security reasons. Sorry.
0
VakilsDeveloperAuthor Commented:
0
mccarlIT Business Systems Analyst / Software DeveloperCommented:
I can't upload file because of security reasons. Sorry.
No worries, I can work out exactly what is going on from the images posted.

What has happened is that an original file (that maybe had possible variant of "o") that was NOT UTF-8 encoded (probably ISO_8859_1) was opened by an application (or something) but in a mode where it was expecting UTF-8. And it got to the o character and since it was encoded in 8859 not UTF-8 that happens to be an invalid UTF-8 encoding. So the UTF-8 decoder replaces it with the standard replacement character that you have seen previously (the white ? inside the black diamond). The difference between this and your original question is that the result of this has been saved and now you are looking at this subsequent file (not the original file).

How do I know this? Well because the UTF-8 replacement character is Unicode codepoint U+FFFD (refer here) and that character has a UTF-8 encoding of the following three bytes, EF BF BD, which you can also see in that link. So your file now has those 3 bytes in it and when you load it using ISO_8859_1 encoding, those 3 bytes get decoded as the "�" characters that you see, and when decoding as UTF-8 you get the square box character that it just another rendering of the standard replacement character.


Long story short, in this case, there is absolutely NO way of getting the original character back because whoever/whatever has opened the file and resaved it, has blown away that original information.


Wouldn't it be wise to send file back to vendor for correction?
Yes, the fact that you are now also getting files that ARE UTF-8 encoded but have already been incorrectly decoded means that trying to resolve this yourself will just snowball in to a big headache. If there is some sort of agreement between yourself and the vendor, then the file encoding SHOULD be a part of that agreement (and probably SHOULD be UTF-8) and then they should be doing the right thing and sending the files correctly.
1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
VakilsDeveloperAuthor Commented:
Hi,
I am impressed by your findings and that exactly explains my observations. The original file would crash, on trying to examine the file, some how the offending character was replaced, so it would no longer crash, but the character was lost in translation. Thanks for your due diligence.
I am coming with a new xml format for different project  (the .xsd problem if you remember), I will specify UTF-8 there.
Thanks!
0
mccarlIT Business Systems Analyst / Software DeveloperCommented:
You're welcome!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
XML

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.