Link to home
Start Free TrialLog in
Avatar of adamgernon
adamgernon

asked on

Java and XML, xerces

HOw do I change the encoding of a document to be UTF-16 when using xerces org.w3c.dom please. We are currently using UTF-8
Avatar of adamgernon
adamgernon

ASKER

    final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
        final DocumentBuilder db = dbf.newDocumentBuilder ();
         Document doc = db.newDocument();
        ProcessingInstruction procInst = doc.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'");
        doc.appendChild(procInst);

I have tried this but it does not seem to make a difference.  Please help!!
The above code was what we used to do except I have added in the two links from ProcessingInstruction procInst onwards.  This I believed would change the format used from UTF-8 to UTF-16 ( it did with similar code using the MSXML2.DOMDOcument) but with xerces and java it doesn't seem to do the job at all.

This is urgent! I would appreciate a quick response
ASKER CERTIFIED SOLUTION
Avatar of yoren
yoren

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
There are two places from where I am creating a dom document one is a method that takes in an xml string and creates a DOM Document from it.  The other is a method where we havee to create a document from scratch and on the fly create nodes and import them to the new document.

Here is the code for the main xml string passed in and a document returned  (including your code add on's)

  public static
    Document toDocument (final String xml)
        throws ParserConfigurationException, SAXException, IOException
    {
        final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
        final DocumentBuilder db = dbf.newDocumentBuilder ();
        Document doc = db.newDocument();
        doc.appendChild(doc.createElement("doc"));
        OutputFormat docformat = new OutputFormat(doc);
        docformat.setEncoding("UTF-16");
        XMLSerializer serializer = new XMLSerializer(System.out,docformat);
        serializer.serialize(doc);
       
        final StringReader src = new StringReader(xml);
        InputSource is = new InputSource(src);
        doc = db.parse (is);
        return doc;
    }

There is something wrong here as the xml is still in UTF-8 encoding.   Any idea why this is .

Also in the document I create on the fly I get an error

CODE AS FOLLOWS

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.createElement("doc"));
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF-16");
XMLSerializer serializer = new XMLSerializer(System.out,docformat);
serializer.serialize(doc);
final Element root1 = doc.createElement((String)param.getNodeName());
//param is a valid node passed in
doc.appendChild(root1);

When I call doc.appendChild(root1) I get an error  
DOM006 Hierarchy request error.  ANy idea why this is..
This is really urgent so any help from anyone is really appreciated.
Ok, for converting an XML string to a Document, you don't need to do any serialization, and there's no encoding involved:

public static
   Document toDocument (final String xml)
       throws ParserConfigurationException, SAXException, IOException
   {
       final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
       final DocumentBuilder db = dbf.newDocumentBuilder ();
       
       final StringReader src = new StringReader(xml);
       InputSource is = new InputSource(src);
       doc = db.parse (is);
       return doc;
   }


Now, regarding the block of code you list for appending a document: what you're trying to do is illegal. An XML document can only have one root element. If you want to append "root1", you have to append it to another element. Maybe you meant to do this instead:

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
final Element root1 = doc.createElement((String)param.getNodeName());
//param is a valid node passed in
doc.appendChild(root1);
Yoren, That was the original code that I had before I was trying to change the encoding from UTF-8 to UTF-16 so if u have any ideas about this in particular please let me know a.s.a.p.

cheers,
Adam
I saw your other question regarding the Euro sign. I think your issue is not in the code you list but elsewhere.

An encoding, such as UTF-8 or UTF-16, is used to convert bytes to characters. A Java String is already composed of characters, so encoding doesn't come into the picture. Maybe your code that constructs the xml String is incorrect.

Where exactly are you getting an error, and what is the error?
Yes but if we are using utf-8 on the client then if we have 16 on the server we will get an error in so far as it cannot convert between the two types.
<?xml version="1.0" encoding="UTF-8"?>
<comment><commentId/><stageId/><userId>dev</userId><dateTime/><commentType>UI</commentType><commentText>G&#130;¼</commentText><commentSource>Resolve</commentSource><posted>F</posted><postedDate/></comment>

That is the xml that is being passed up to the server as u can see <commenttext> has a lot of funny characters in it i.e. G&#130;¼.  All that is is the € sign, so I am unsure what to do about it.

Anyway, it is definitely this that is causing the problem becuase when I try to load that xml up in my browser I get this error

"An invalid character was found in text content"

Any ideas.

regards,
Adam

Aha! The file you've posted is not legal UTF-8. It also doesn't appear in my browser as a euro symbol but as a 1/4 symbol.

My suggestion is to escape all those special characters with character entities (&#...;). That way you'll have ASCII which is also legal UTF-8.

If you want to leave the text as it is, you'll have to choose the correct encoding. I know ISO-8859-1 will work, but I'm not sure if it will read those characters correctly.
If i was to escape all the characters is there a list of invalid chars and there corresponding escape characters?
Or should I just use ISO-8859-1 to ensure that both server and client can handle these characters?
I'd recommend against declaring the encoding as UTF-8, since that's not how you're encoding the data. UTF-8 is a very specific encoding that defines a way to encode characters outside the ASCII range using multiple bytes.

ISO-8859-1 is the standard US 8-bit encoding. However, I'm not sure if it has a Euro symbol; the Euro symbol may get changed into something else. Try it and see what happens.

You can also declare the encoding as US-ASCII, which means that you'll have to escape anything with a value greater than hex 7F, including the Euro symbol.
So if we are developing for the US market do u recommend using US-ASCII or ISO-8859-1?
I'd recommend going with ISO-8859-1. That way, your program won't break if you get some extended characters.
Yoren That is great but I am still unable to set the encoding on the server side to ISO-8859-1 or to anything else for that matter.  The code above does not work.  
Please advise.
Encoding is something you set on the client side, in the document. The document your server receives should begin with this string:

<?xml version='1.0' encoding='ISO-8859-1'?>

From everything you've told me, your server code is okay. The problem is that your client (the one creating the document) is declaring an encoding of UTF-8 but then writing the document in a different encoding (probably ISO-8859-1).
Avatar of girionis
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to yoren

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer