[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1542
  • Last Modified:

Processing UTF-16 encoded xml file

I am using msxml 4 to load a xml file into DOM tree. The problem is, the XML file is not well formed as it:

- does not specify the encoding
- does not have a byte-order mark at the beginning of the document
- contains UTF-16 encoded data

As a result, I got a "invalid character found" error when building the DOM tree. As I have no control on the XML generation side to correct this problem, I ened a workaround. One way I can think of is to dynamically insert the encoding into the xml file, but I am looking for a better way. Is there any option in msxml to specify a "default encoding" in case no encoding is specified? I am looking for something like that:

document->load("myfile.xml", Encoding::UTF-16);

Any helps is appreciated.
0
onlygo
Asked:
onlygo
2 Solutions
 
dfiala13Commented:
You can try this:
Create a new XML document, then add a processing instruction with the proper character type.

var pi = xmldoc.createProcessingInstruction("xml"," version='1.0' encoding='UTF-16'");
xmldoc.appendChild(pi);
xmldoc.save("newfile.xml")

then load in the suspect XML using LoadXML

xmldoc.LoadXML(sXML)

Here's ain interesting link on encoding and MSXML

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/xmlencodings.asp

0
 
rdcproCommented:
That doesn't seem like it will work to me.  When you loadXML() or load(), it blows away anything that was there.  I would be surprised if the previous encoding persisted.

HOWEVER, the loadXML method presumes UTF-16, so using a string-based load rather than the IStream version will probably work by itself.  But if necessary, you can prepend the PI too:

xmldoc.LoadXML("<?xml version='1.0' encoding='utf-16' ?>" + sXML)

With newer versions of MSXML, you can even force UTF-8 encoding by specifying utf-8 in the PI.  Note, however, that the character data in the loadXML still must be UTF-16, because all strings are BStr, which is essentially UTF-16.  It's also worth noting that if you use the xml property, as in:

strXml = xmlDoc.xml

then the data is UTF-16 and there will be no byte order mark either!  But it doesn't matter, unless the byte order is odd anyway.

Summary:

don't use the IStream-based load() method.  Use the string-based (ie: UTF-16) loadXML method.

Regards,
Mike Sharp

0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now