adamgernon
asked on
Java and XML, xerces
HOw do I change the encoding of a document to be UTF-16 when using xerces org.w3c.dom please. We are currently using UTF-8
ASKER
The above code was what we used to do except I have added in the two links from ProcessingInstruction procInst onwards. This I believed would change the format used from UTF-8 to UTF-16 ( it did with similar code using the MSXML2.DOMDOcument) but with xerces and java it doesn't seem to do the job at all.
This is urgent! I would appreciate a quick response
This is urgent! I would appreciate a quick response
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
There are two places from where I am creating a dom document one is a method that takes in an xml string and creates a DOM Document from it. The other is a method where we havee to create a document from scratch and on the fly create nodes and import them to the new document.
Here is the code for the main xml string passed in and a document returned (including your code add on's)
public static
Document toDocument (final String xml)
throws ParserConfigurationExcepti on, SAXException, IOException
{
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new Instance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.create Element("d oc"));
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF -16");
XMLSerializer serializer = new XMLSerializer(System.out,d ocformat);
serializer.serialize(doc);
final StringReader src = new StringReader(xml);
InputSource is = new InputSource(src);
doc = db.parse (is);
return doc;
}
There is something wrong here as the xml is still in UTF-8 encoding. Any idea why this is .
Also in the document I create on the fly I get an error
CODE AS FOLLOWS
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new Instance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.create Element("d oc"));
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF -16");
XMLSerializer serializer = new XMLSerializer(System.out,d ocformat);
serializer.serialize(doc);
final Element root1 = doc.createElement((String) param.getN odeName()) ;
//param is a valid node passed in
doc.appendChild(root1);
When I call doc.appendChild(root1) I get an error
DOM006 Hierarchy request error. ANy idea why this is..
Here is the code for the main xml string passed in and a document returned (including your code add on's)
public static
Document toDocument (final String xml)
throws ParserConfigurationExcepti
{
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.create
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF
XMLSerializer serializer = new XMLSerializer(System.out,d
serializer.serialize(doc);
final StringReader src = new StringReader(xml);
InputSource is = new InputSource(src);
doc = db.parse (is);
return doc;
}
There is something wrong here as the xml is still in UTF-8 encoding. Any idea why this is .
Also in the document I create on the fly I get an error
CODE AS FOLLOWS
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.create
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF
XMLSerializer serializer = new XMLSerializer(System.out,d
serializer.serialize(doc);
final Element root1 = doc.createElement((String)
//param is a valid node passed in
doc.appendChild(root1);
When I call doc.appendChild(root1) I get an error
DOM006 Hierarchy request error. ANy idea why this is..
ASKER
This is really urgent so any help from anyone is really appreciated.
Ok, for converting an XML string to a Document, you don't need to do any serialization, and there's no encoding involved:
public static
Document toDocument (final String xml)
throws ParserConfigurationExcepti on, SAXException, IOException
{
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new Instance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
final StringReader src = new StringReader(xml);
InputSource is = new InputSource(src);
doc = db.parse (is);
return doc;
}
Now, regarding the block of code you list for appending a document: what you're trying to do is illegal. An XML document can only have one root element. If you want to append "root1", you have to append it to another element. Maybe you meant to do this instead:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new Instance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
final Element root1 = doc.createElement((String) param.getN odeName()) ;
//param is a valid node passed in
doc.appendChild(root1);
public static
Document toDocument (final String xml)
throws ParserConfigurationExcepti
{
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new
final DocumentBuilder db = dbf.newDocumentBuilder ();
final StringReader src = new StringReader(xml);
InputSource is = new InputSource(src);
doc = db.parse (is);
return doc;
}
Now, regarding the block of code you list for appending a document: what you're trying to do is illegal. An XML document can only have one root element. If you want to append "root1", you have to append it to another element. Maybe you meant to do this instead:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.new
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
final Element root1 = doc.createElement((String)
//param is a valid node passed in
doc.appendChild(root1);
ASKER
Yoren, That was the original code that I had before I was trying to change the encoding from UTF-8 to UTF-16 so if u have any ideas about this in particular please let me know a.s.a.p.
cheers,
Adam
cheers,
Adam
I saw your other question regarding the Euro sign. I think your issue is not in the code you list but elsewhere.
An encoding, such as UTF-8 or UTF-16, is used to convert bytes to characters. A Java String is already composed of characters, so encoding doesn't come into the picture. Maybe your code that constructs the xml String is incorrect.
Where exactly are you getting an error, and what is the error?
An encoding, such as UTF-8 or UTF-16, is used to convert bytes to characters. A Java String is already composed of characters, so encoding doesn't come into the picture. Maybe your code that constructs the xml String is incorrect.
Where exactly are you getting an error, and what is the error?
ASKER
Yes but if we are using utf-8 on the client then if we have 16 on the server we will get an error in so far as it cannot convert between the two types.
ASKER
<?xml version="1.0" encoding="UTF-8"?>
<comment><commentId/><stag eId/><user Id>dev</us erId><date Time/><com mentType>U I</comment Type><comm entText>G& #130;¼</co mmentText> <commentSo urce>Resol ve</commen tSource><p osted>F</p osted><pos tedDate/>< /comment>
That is the xml that is being passed up to the server as u can see <commenttext> has a lot of funny characters in it i.e. G‚¼. All that is is the € sign, so I am unsure what to do about it.
Anyway, it is definitely this that is causing the problem becuase when I try to load that xml up in my browser I get this error
"An invalid character was found in text content"
Any ideas.
regards,
Adam
<comment><commentId/><stag
That is the xml that is being passed up to the server as u can see <commenttext> has a lot of funny characters in it i.e. G‚¼. All that is is the € sign, so I am unsure what to do about it.
Anyway, it is definitely this that is causing the problem becuase when I try to load that xml up in my browser I get this error
"An invalid character was found in text content"
Any ideas.
regards,
Adam
Aha! The file you've posted is not legal UTF-8. It also doesn't appear in my browser as a euro symbol but as a 1/4 symbol.
My suggestion is to escape all those special characters with character entities (&#...;). That way you'll have ASCII which is also legal UTF-8.
If you want to leave the text as it is, you'll have to choose the correct encoding. I know ISO-8859-1 will work, but I'm not sure if it will read those characters correctly.
My suggestion is to escape all those special characters with character entities (&#...;). That way you'll have ASCII which is also legal UTF-8.
If you want to leave the text as it is, you'll have to choose the correct encoding. I know ISO-8859-1 will work, but I'm not sure if it will read those characters correctly.
ASKER
If i was to escape all the characters is there a list of invalid chars and there corresponding escape characters?
Or should I just use ISO-8859-1 to ensure that both server and client can handle these characters?
Or should I just use ISO-8859-1 to ensure that both server and client can handle these characters?
I'd recommend against declaring the encoding as UTF-8, since that's not how you're encoding the data. UTF-8 is a very specific encoding that defines a way to encode characters outside the ASCII range using multiple bytes.
ISO-8859-1 is the standard US 8-bit encoding. However, I'm not sure if it has a Euro symbol; the Euro symbol may get changed into something else. Try it and see what happens.
You can also declare the encoding as US-ASCII, which means that you'll have to escape anything with a value greater than hex 7F, including the Euro symbol.
ISO-8859-1 is the standard US 8-bit encoding. However, I'm not sure if it has a Euro symbol; the Euro symbol may get changed into something else. Try it and see what happens.
You can also declare the encoding as US-ASCII, which means that you'll have to escape anything with a value greater than hex 7F, including the Euro symbol.
ASKER
So if we are developing for the US market do u recommend using US-ASCII or ISO-8859-1?
I'd recommend going with ISO-8859-1. That way, your program won't break if you get some extended characters.
ASKER
Yoren That is great but I am still unable to set the encoding on the server side to ISO-8859-1 or to anything else for that matter. The code above does not work.
Please advise.
Please advise.
Encoding is something you set on the client side, in the document. The document your server receives should begin with this string:
<?xml version='1.0' encoding='ISO-8859-1'?>
From everything you've told me, your server code is okay. The problem is that your client (the one creating the document) is declaring an encoding of UTF-8 but then writing the document in a different encoding (probably ISO-8859-1).
<?xml version='1.0' encoding='ISO-8859-1'?>
From everything you've told me, your server code is okay. The problem is that your client (the one creating the document) is declaring an encoding of UTF-8 but then writing the document in a different encoding (probably ISO-8859-1).
No comment has been added lately, so it's time to clean up this TA.
I will leave a recommendation in the Cleanup topic area that this question is:
- points to yoren
Please leave any comments here within the
next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !
girionis
Cleanup Volunteer
I will leave a recommendation in the Cleanup topic area that this question is:
- points to yoren
Please leave any comments here within the
next seven days.
PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !
girionis
Cleanup Volunteer
ASKER
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
ProcessingInstruction procInst = doc.createProcessingInstru
doc.appendChild(procInst);
I have tried this but it does not seem to make a difference. Please help!!