Solved

Java and XML, xerces

Posted on 2002-05-28
19
543 Views
Last Modified: 2013-11-23
HOw do I change the encoding of a document to be UTF-16 when using xerces org.w3c.dom please. We are currently using UTF-8
0
Comment
Question by:adamgernon
  • 10
  • 7
19 Comments
 

Author Comment

by:adamgernon
ID: 7039476
    final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
        final DocumentBuilder db = dbf.newDocumentBuilder ();
         Document doc = db.newDocument();
        ProcessingInstruction procInst = doc.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'");
        doc.appendChild(procInst);

I have tried this but it does not seem to make a difference.  Please help!!
0
 

Author Comment

by:adamgernon
ID: 7039485
The above code was what we used to do except I have added in the two links from ProcessingInstruction procInst onwards.  This I believed would change the format used from UTF-8 to UTF-16 ( it did with similar code using the MSXML2.DOMDOcument) but with xerces and java it doesn't seem to do the job at all.

This is urgent! I would appreciate a quick response
0
 
LVL 7

Accepted Solution

by:
yoren earned 200 total points
ID: 7039886
Your code is not guaranteed to work (and it won't with Xerces), because the XML declaration is technically not a processing instruction. Your example, using Xerces (and probably other Java DOM implementations), will create a document that is declared as UTF-16 but actually encoded as UTF-8.

Encoding is something you define in the serializer, not in the document. This is actually a good thing; you can write the same document in any encoding you want. Here's how to do it in Xerces 2:

[all your imports, plus:]
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.OutputFormat;

final DocumentBuilderFactory dbf =
  DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.newDocument();
doc.appendChild(doc.createElement("doc"));

OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF-16");
XMLSerializer serializer = new XMLSerializer(System.out,docformat);
serializer.serialize(doc);
0
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

 

Author Comment

by:adamgernon
ID: 7041322
There are two places from where I am creating a dom document one is a method that takes in an xml string and creates a DOM Document from it.  The other is a method where we havee to create a document from scratch and on the fly create nodes and import them to the new document.

Here is the code for the main xml string passed in and a document returned  (including your code add on's)

  public static
    Document toDocument (final String xml)
        throws ParserConfigurationException, SAXException, IOException
    {
        final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
        final DocumentBuilder db = dbf.newDocumentBuilder ();
        Document doc = db.newDocument();
        doc.appendChild(doc.createElement("doc"));
        OutputFormat docformat = new OutputFormat(doc);
        docformat.setEncoding("UTF-16");
        XMLSerializer serializer = new XMLSerializer(System.out,docformat);
        serializer.serialize(doc);
       
        final StringReader src = new StringReader(xml);
        InputSource is = new InputSource(src);
        doc = db.parse (is);
        return doc;
    }

There is something wrong here as the xml is still in UTF-8 encoding.   Any idea why this is .

Also in the document I create on the fly I get an error

CODE AS FOLLOWS

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.createElement("doc"));
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF-16");
XMLSerializer serializer = new XMLSerializer(System.out,docformat);
serializer.serialize(doc);
final Element root1 = doc.createElement((String)param.getNodeName());
//param is a valid node passed in
doc.appendChild(root1);

When I call doc.appendChild(root1) I get an error  
DOM006 Hierarchy request error.  ANy idea why this is..
0
 

Author Comment

by:adamgernon
ID: 7043915
This is really urgent so any help from anyone is really appreciated.
0
 
LVL 7

Expert Comment

by:yoren
ID: 7043990
Ok, for converting an XML string to a Document, you don't need to do any serialization, and there's no encoding involved:

public static
   Document toDocument (final String xml)
       throws ParserConfigurationException, SAXException, IOException
   {
       final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
       final DocumentBuilder db = dbf.newDocumentBuilder ();
       
       final StringReader src = new StringReader(xml);
       InputSource is = new InputSource(src);
       doc = db.parse (is);
       return doc;
   }


Now, regarding the block of code you list for appending a document: what you're trying to do is illegal. An XML document can only have one root element. If you want to append "root1", you have to append it to another element. Maybe you meant to do this instead:

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
final Element root1 = doc.createElement((String)param.getNodeName());
//param is a valid node passed in
doc.appendChild(root1);
0
 

Author Comment

by:adamgernon
ID: 7044000
Yoren, That was the original code that I had before I was trying to change the encoding from UTF-8 to UTF-16 so if u have any ideas about this in particular please let me know a.s.a.p.

cheers,
Adam
0
 
LVL 7

Expert Comment

by:yoren
ID: 7044015
I saw your other question regarding the Euro sign. I think your issue is not in the code you list but elsewhere.

An encoding, such as UTF-8 or UTF-16, is used to convert bytes to characters. A Java String is already composed of characters, so encoding doesn't come into the picture. Maybe your code that constructs the xml String is incorrect.

Where exactly are you getting an error, and what is the error?
0
 

Author Comment

by:adamgernon
ID: 7044223
Yes but if we are using utf-8 on the client then if we have 16 on the server we will get an error in so far as it cannot convert between the two types.
0
 

Author Comment

by:adamgernon
ID: 7044344
<?xml version="1.0" encoding="UTF-8"?>
<comment><commentId/><stageId/><userId>dev</userId><dateTime/><commentType>UI</commentType><commentText>G&#130;¼</commentText><commentSource>Resolve</commentSource><posted>F</posted><postedDate/></comment>

That is the xml that is being passed up to the server as u can see <commenttext> has a lot of funny characters in it i.e. G&#130;¼.  All that is is the € sign, so I am unsure what to do about it.

Anyway, it is definitely this that is causing the problem becuase when I try to load that xml up in my browser I get this error

"An invalid character was found in text content"

Any ideas.

regards,
Adam

0
 
LVL 7

Expert Comment

by:yoren
ID: 7044957
Aha! The file you've posted is not legal UTF-8. It also doesn't appear in my browser as a euro symbol but as a 1/4 symbol.

My suggestion is to escape all those special characters with character entities (&#...;). That way you'll have ASCII which is also legal UTF-8.

If you want to leave the text as it is, you'll have to choose the correct encoding. I know ISO-8859-1 will work, but I'm not sure if it will read those characters correctly.
0
 

Author Comment

by:adamgernon
ID: 7046500
If i was to escape all the characters is there a list of invalid chars and there corresponding escape characters?
Or should I just use ISO-8859-1 to ensure that both server and client can handle these characters?
0
 
LVL 7

Expert Comment

by:yoren
ID: 7046561
I'd recommend against declaring the encoding as UTF-8, since that's not how you're encoding the data. UTF-8 is a very specific encoding that defines a way to encode characters outside the ASCII range using multiple bytes.

ISO-8859-1 is the standard US 8-bit encoding. However, I'm not sure if it has a Euro symbol; the Euro symbol may get changed into something else. Try it and see what happens.

You can also declare the encoding as US-ASCII, which means that you'll have to escape anything with a value greater than hex 7F, including the Euro symbol.
0
 

Author Comment

by:adamgernon
ID: 7046565
So if we are developing for the US market do u recommend using US-ASCII or ISO-8859-1?
0
 
LVL 7

Expert Comment

by:yoren
ID: 7046568
I'd recommend going with ISO-8859-1. That way, your program won't break if you get some extended characters.
0
 

Author Comment

by:adamgernon
ID: 7053617
Yoren That is great but I am still unable to set the encoding on the server side to ISO-8859-1 or to anything else for that matter.  The code above does not work.  
Please advise.
0
 
LVL 7

Expert Comment

by:yoren
ID: 7054100
Encoding is something you set on the client side, in the document. The document your server receives should begin with this string:

<?xml version='1.0' encoding='ISO-8859-1'?>

From everything you've told me, your server code is okay. The problem is that your client (the one creating the document) is declaring an encoding of UTF-8 but then writing the document in a different encoding (probably ISO-8859-1).
0
 
LVL 35

Expert Comment

by:girionis
ID: 8798161
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to yoren

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
0

Featured Post

Is Your AD Toolbox Looking More Like a Toybox?

Managing Active Directory can get complicated.  Often, the native tools for managing AD are just not up to the task.  The largest Active Directory installations in the world have relied on one tool to manage their day-to-day administration tasks: Hyena. Start your trial today.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java had always been an easily readable and understandable language.  Some relatively recent changes in the language seem to be changing this pretty fast, and anyone that had not seen any Java code for the last 5 years will possibly have issues unde…
Are you developing a Java application and want to create Excel Spreadsheets? You have come to the right place, this article will describe how you can create Excel Spreadsheets from a Java Application. For the purposes of this article, I will be u…
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question