Solved

Processing UTF-16 encoded xml file

Posted on 2004-03-30
4
1,489 Views
Last Modified: 2008-02-01
I am using msxml 4 to load a xml file into DOM tree. The problem is, the XML file is not well formed as it:

- does not specify the encoding
- does not have a byte-order mark at the beginning of the document
- contains UTF-16 encoded data

As a result, I got a "invalid character found" error when building the DOM tree. As I have no control on the XML generation side to correct this problem, I ened a workaround. One way I can think of is to dynamically insert the encoding into the xml file, but I am looking for a better way. Is there any option in msxml to specify a "default encoding" in case no encoding is specified? I am looking for something like that:

document->load("myfile.xml", Encoding::UTF-16);

Any helps is appreciated.
0
Comment
Question by:onlygo
4 Comments
 
LVL 12

Accepted Solution

by:
dfiala13 earned 125 total points
ID: 10720861
You can try this:
Create a new XML document, then add a processing instruction with the proper character type.

var pi = xmldoc.createProcessingInstruction("xml"," version='1.0' encoding='UTF-16'");
xmldoc.appendChild(pi);
xmldoc.save("newfile.xml")

then load in the suspect XML using LoadXML

xmldoc.LoadXML(sXML)

Here's ain interesting link on encoding and MSXML

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/xmlencodings.asp

0
 
LVL 26

Assisted Solution

by:rdcpro
rdcpro earned 125 total points
ID: 10721422
That doesn't seem like it will work to me.  When you loadXML() or load(), it blows away anything that was there.  I would be surprised if the previous encoding persisted.

HOWEVER, the loadXML method presumes UTF-16, so using a string-based load rather than the IStream version will probably work by itself.  But if necessary, you can prepend the PI too:

xmldoc.LoadXML("<?xml version='1.0' encoding='utf-16' ?>" + sXML)

With newer versions of MSXML, you can even force UTF-8 encoding by specifying utf-8 in the PI.  Note, however, that the character data in the loadXML still must be UTF-16, because all strings are BStr, which is essentially UTF-16.  It's also worth noting that if you use the xml property, as in:

strXml = xmlDoc.xml

then the data is UTF-16 and there will be no byte order mark either!  But it doesn't matter, unless the byte order is odd anyway.

Summary:

don't use the IStream-based load() method.  Use the string-based (ie: UTF-16) loadXML method.

Regards,
Mike Sharp

0

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Windows 10 is mostly good. However the one thing that annoys me is how many clicks you have to do to dial a VPN connection. You have to go to settings from the start menu, (2 clicks), Network and Internet (1 click), Click VPN (another click) then fi…
Many functions in Excel can make decisions. The most simple of these is the IF function: it returns a value depending on whether a condition you describe is true or false. Once you get the hang of using the IF function, you will find it easier to us…

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now