Solved

Java and XML, xerces

Posted on 2002-05-28
19
540 Views
Last Modified: 2013-11-23
HOw do I change the encoding of a document to be UTF-16 when using xerces org.w3c.dom please. We are currently using UTF-8
0
Comment
Question by:adamgernon
  • 10
  • 7
19 Comments
 

Author Comment

by:adamgernon
ID: 7039476
    final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
        final DocumentBuilder db = dbf.newDocumentBuilder ();
         Document doc = db.newDocument();
        ProcessingInstruction procInst = doc.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'");
        doc.appendChild(procInst);

I have tried this but it does not seem to make a difference.  Please help!!
0
 

Author Comment

by:adamgernon
ID: 7039485
The above code was what we used to do except I have added in the two links from ProcessingInstruction procInst onwards.  This I believed would change the format used from UTF-8 to UTF-16 ( it did with similar code using the MSXML2.DOMDOcument) but with xerces and java it doesn't seem to do the job at all.

This is urgent! I would appreciate a quick response
0
 
LVL 7

Accepted Solution

by:
yoren earned 200 total points
ID: 7039886
Your code is not guaranteed to work (and it won't with Xerces), because the XML declaration is technically not a processing instruction. Your example, using Xerces (and probably other Java DOM implementations), will create a document that is declared as UTF-16 but actually encoded as UTF-8.

Encoding is something you define in the serializer, not in the document. This is actually a good thing; you can write the same document in any encoding you want. Here's how to do it in Xerces 2:

[all your imports, plus:]
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.OutputFormat;

final DocumentBuilderFactory dbf =
  DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.newDocument();
doc.appendChild(doc.createElement("doc"));

OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF-16");
XMLSerializer serializer = new XMLSerializer(System.out,docformat);
serializer.serialize(doc);
0
 

Author Comment

by:adamgernon
ID: 7041322
There are two places from where I am creating a dom document one is a method that takes in an xml string and creates a DOM Document from it.  The other is a method where we havee to create a document from scratch and on the fly create nodes and import them to the new document.

Here is the code for the main xml string passed in and a document returned  (including your code add on's)

  public static
    Document toDocument (final String xml)
        throws ParserConfigurationException, SAXException, IOException
    {
        final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
        final DocumentBuilder db = dbf.newDocumentBuilder ();
        Document doc = db.newDocument();
        doc.appendChild(doc.createElement("doc"));
        OutputFormat docformat = new OutputFormat(doc);
        docformat.setEncoding("UTF-16");
        XMLSerializer serializer = new XMLSerializer(System.out,docformat);
        serializer.serialize(doc);
       
        final StringReader src = new StringReader(xml);
        InputSource is = new InputSource(src);
        doc = db.parse (is);
        return doc;
    }

There is something wrong here as the xml is still in UTF-8 encoding.   Any idea why this is .

Also in the document I create on the fly I get an error

CODE AS FOLLOWS

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
doc.appendChild(doc.createElement("doc"));
OutputFormat docformat = new OutputFormat(doc);
docformat.setEncoding("UTF-16");
XMLSerializer serializer = new XMLSerializer(System.out,docformat);
serializer.serialize(doc);
final Element root1 = doc.createElement((String)param.getNodeName());
//param is a valid node passed in
doc.appendChild(root1);

When I call doc.appendChild(root1) I get an error  
DOM006 Hierarchy request error.  ANy idea why this is..
0
 

Author Comment

by:adamgernon
ID: 7043915
This is really urgent so any help from anyone is really appreciated.
0
 
LVL 7

Expert Comment

by:yoren
ID: 7043990
Ok, for converting an XML string to a Document, you don't need to do any serialization, and there's no encoding involved:

public static
   Document toDocument (final String xml)
       throws ParserConfigurationException, SAXException, IOException
   {
       final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
       final DocumentBuilder db = dbf.newDocumentBuilder ();
       
       final StringReader src = new StringReader(xml);
       InputSource is = new InputSource(src);
       doc = db.parse (is);
       return doc;
   }


Now, regarding the block of code you list for appending a document: what you're trying to do is illegal. An XML document can only have one root element. If you want to append "root1", you have to append it to another element. Maybe you meant to do this instead:

final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance ();
final DocumentBuilder db = dbf.newDocumentBuilder ();
Document doc = db.newDocument();
final Element root1 = doc.createElement((String)param.getNodeName());
//param is a valid node passed in
doc.appendChild(root1);
0
 

Author Comment

by:adamgernon
ID: 7044000
Yoren, That was the original code that I had before I was trying to change the encoding from UTF-8 to UTF-16 so if u have any ideas about this in particular please let me know a.s.a.p.

cheers,
Adam
0
 
LVL 7

Expert Comment

by:yoren
ID: 7044015
I saw your other question regarding the Euro sign. I think your issue is not in the code you list but elsewhere.

An encoding, such as UTF-8 or UTF-16, is used to convert bytes to characters. A Java String is already composed of characters, so encoding doesn't come into the picture. Maybe your code that constructs the xml String is incorrect.

Where exactly are you getting an error, and what is the error?
0
 

Author Comment

by:adamgernon
ID: 7044223
Yes but if we are using utf-8 on the client then if we have 16 on the server we will get an error in so far as it cannot convert between the two types.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:adamgernon
ID: 7044344
<?xml version="1.0" encoding="UTF-8"?>
<comment><commentId/><stageId/><userId>dev</userId><dateTime/><commentType>UI</commentType><commentText>G&#130;¼</commentText><commentSource>Resolve</commentSource><posted>F</posted><postedDate/></comment>

That is the xml that is being passed up to the server as u can see <commenttext> has a lot of funny characters in it i.e. G&#130;¼.  All that is is the € sign, so I am unsure what to do about it.

Anyway, it is definitely this that is causing the problem becuase when I try to load that xml up in my browser I get this error

"An invalid character was found in text content"

Any ideas.

regards,
Adam

0
 
LVL 7

Expert Comment

by:yoren
ID: 7044957
Aha! The file you've posted is not legal UTF-8. It also doesn't appear in my browser as a euro symbol but as a 1/4 symbol.

My suggestion is to escape all those special characters with character entities (&#...;). That way you'll have ASCII which is also legal UTF-8.

If you want to leave the text as it is, you'll have to choose the correct encoding. I know ISO-8859-1 will work, but I'm not sure if it will read those characters correctly.
0
 

Author Comment

by:adamgernon
ID: 7046500
If i was to escape all the characters is there a list of invalid chars and there corresponding escape characters?
Or should I just use ISO-8859-1 to ensure that both server and client can handle these characters?
0
 
LVL 7

Expert Comment

by:yoren
ID: 7046561
I'd recommend against declaring the encoding as UTF-8, since that's not how you're encoding the data. UTF-8 is a very specific encoding that defines a way to encode characters outside the ASCII range using multiple bytes.

ISO-8859-1 is the standard US 8-bit encoding. However, I'm not sure if it has a Euro symbol; the Euro symbol may get changed into something else. Try it and see what happens.

You can also declare the encoding as US-ASCII, which means that you'll have to escape anything with a value greater than hex 7F, including the Euro symbol.
0
 

Author Comment

by:adamgernon
ID: 7046565
So if we are developing for the US market do u recommend using US-ASCII or ISO-8859-1?
0
 
LVL 7

Expert Comment

by:yoren
ID: 7046568
I'd recommend going with ISO-8859-1. That way, your program won't break if you get some extended characters.
0
 

Author Comment

by:adamgernon
ID: 7053617
Yoren That is great but I am still unable to set the encoding on the server side to ISO-8859-1 or to anything else for that matter.  The code above does not work.  
Please advise.
0
 
LVL 7

Expert Comment

by:yoren
ID: 7054100
Encoding is something you set on the client side, in the document. The document your server receives should begin with this string:

<?xml version='1.0' encoding='ISO-8859-1'?>

From everything you've told me, your server code is okay. The problem is that your client (the one creating the document) is declaring an encoding of UTF-8 but then writing the document in a different encoding (probably ISO-8859-1).
0
 
LVL 35

Expert Comment

by:girionis
ID: 8798161
No comment has been added lately, so it's time to clean up this TA.

I will leave a recommendation in the Cleanup topic area that this question is:

- points to yoren

Please leave any comments here within the
next seven days.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER !

girionis
Cleanup Volunteer
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Java contains several comparison operators (e.g., <, <=, >, >=, ==, !=) that allow you to compare primitive values. However, these operators cannot be used to compare the contents of objects. Interface Comparable is used to allow objects of a cl…
Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now