Solved

Processing UTF-16 encoded xml file

Posted on 2004-03-30
4
1,510 Views
Last Modified: 2008-02-01
I am using msxml 4 to load a xml file into DOM tree. The problem is, the XML file is not well formed as it:

- does not specify the encoding
- does not have a byte-order mark at the beginning of the document
- contains UTF-16 encoded data

As a result, I got a "invalid character found" error when building the DOM tree. As I have no control on the XML generation side to correct this problem, I ened a workaround. One way I can think of is to dynamically insert the encoding into the xml file, but I am looking for a better way. Is there any option in msxml to specify a "default encoding" in case no encoding is specified? I am looking for something like that:

document->load("myfile.xml", Encoding::UTF-16);

Any helps is appreciated.
0
Comment
Question by:onlygo
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
4 Comments
 
LVL 12

Accepted Solution

by:
dfiala13 earned 125 total points
ID: 10720861
You can try this:
Create a new XML document, then add a processing instruction with the proper character type.

var pi = xmldoc.createProcessingInstruction("xml"," version='1.0' encoding='UTF-16'");
xmldoc.appendChild(pi);
xmldoc.save("newfile.xml")

then load in the suspect XML using LoadXML

xmldoc.LoadXML(sXML)

Here's ain interesting link on encoding and MSXML

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnxml/html/xmlencodings.asp

0
 
LVL 26

Assisted Solution

by:rdcpro
rdcpro earned 125 total points
ID: 10721422
That doesn't seem like it will work to me.  When you loadXML() or load(), it blows away anything that was there.  I would be surprised if the previous encoding persisted.

HOWEVER, the loadXML method presumes UTF-16, so using a string-based load rather than the IStream version will probably work by itself.  But if necessary, you can prepend the PI too:

xmldoc.LoadXML("<?xml version='1.0' encoding='utf-16' ?>" + sXML)

With newer versions of MSXML, you can even force UTF-8 encoding by specifying utf-8 in the PI.  Note, however, that the character data in the loadXML still must be UTF-16, because all strings are BStr, which is essentially UTF-16.  It's also worth noting that if you use the xml property, as in:

strXml = xmlDoc.xml

then the data is UTF-16 and there will be no byte order mark either!  But it doesn't matter, unless the byte order is odd anyway.

Summary:

don't use the IStream-based load() method.  Use the string-based (ie: UTF-16) loadXML method.

Regards,
Mike Sharp

0

Featured Post

The Orion Papers

Are you interested in becoming an AWS Certified Solutions Architect?

Discover a new interactive way of training for the exam.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Create a Windows 10 custom Image with custom task bar and custom start menu using XML for deployment.
NetCrunch network monitor is a highly extensive platform for network monitoring and alert generation. In this video you'll see a live demo of NetCrunch with most notable features explained in a walk-through manner. You'll also get to know the philos…
Do you want to know how to make a graph with Microsoft Access? First, create a query with the data for the chart. Then make a blank form and add a chart control. This video also shows how to change what data is displayed on the graph as well as form…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question