Japanese breaks up XML (msxml2)

Posted on 2003-11-30
Last Modified: 2013-11-19
Hi there,

I have the following problem: I wrote a simple ASP script (see function below) that allows administrators of my website to update text strings in an XML file. This works perfectly for any unicode language, but not for Japanese. The file becomes totally corrupted and unreadable; in fact, is "cut" at the exact point where it should be normally updated.

I suppose the problem lies either in the fact that I use an old msxml parser (ver. 2) or because the Japanese encoding is set to shift-jis i.s.o. UTF-8.

Does anyone here have any experience with this problem, and a possible solution?


Function fncUpdateXML(strLanguage, strScriptName, strNode, strText)
      Set xmlDoc = Server.CreateObject("msxml2.DOMDocument")
      xmlDoc.async = False
      If NOT xmlDoc.Load("c:\testfile.xml") Then
            Response.Write "Page failed to load"
            strText = Replace(strText, "<br>", vblf)
            xmlDoc.SelectSingleNode("/languages/language[@xml:lang='" & strLanguage & "']/pages/page[@xml:page='" & strScriptName & "']/" & strNode).text = strText "c:\testfile.xml"
      End If
      Set xmlDoc = Nothing
End Function
Question by:vpikula
  • 2
  • 2
LVL 26

Expert Comment

ID: 9848729
No, you're using MSXML 3.  The ProgID  "Msxml2.DomDocument" doesn't mean MSMXML version's more like version 2 of the API.  MSXML version 2 had a different API (this was before things were settled at the W3C).  MSXML 4 uses a similar ProgID: "Msxml2.DomDocument.4.0"

Any time you have strings involved, your encoding is actually UTF-16.  This is because BStr's are essentially UTF-16.  So somewhere along the road, your encoding is getting goofed up.  Now, if your document contains Unicode characters, and you were to try to insert shift-jis characters in it, you'd have a problem as the document can only have one encoding.  If you can convert the submitted Shift-JIS to unicode, that would be easiest.  

also, xml:page does not look like correct useage.  There isn't (to my knowledge) any such thing as an xml:page attribute.  "xml" is a reserved key, and you should avoid using it in your own semantic context.  Also, to reliably select nodes with a qualified name (such as foo:bar), you need to specify the selectionNamespaces property in your DomDocument object.  Same goes for using XPath.  selectSingleNode() defaults to the old XSL patterns language for backwards compatibility reasons.

Here's how to set them both (in JScript...sorry):

var xmldoc = new ActiveXObject("Msxml2.DOMDocument");
xmldoc.setProperty("SelectionLanguage", "XPath");
xmldoc.setProperty("SelectionNamespaces", "xmlns:foo='' xmlns:bar=''");

This allows you to select a node using a qualified name, even if the actual prefix is different than the one in your SelectionNamepaces property.  For example, this XML:

<snafu:rootelement xmlns:snafu="">nice root element</snafu:rootelement>

can be selected by:

var xmldoc = new ActiveXObject("Msxml2.DOMDocument");
xmldoc.setProperty("SelectionLanguage", "XPath");
xmldoc.setProperty("SelectionNamespaces", "xmlns:foo=''");
var oNode = xmldoc.selectSingleNode("foo:rootelement")

even though the prefix in the XML is "snafu" and the prefix in the select is "foo".  It's only the namespace that counts.

Mike Sharp



Author Comment

ID: 9849027
Thanks a lot for that detailed info, rdcpro! I'll repair the xml with the syntax pointers you gave once this problem is solved.

Your suggestion is: " If you can convert the submitted Shift-JIS to unicode, that would be easiest."

How do I do this? Right now the pages I present to my users to edit the XML on are encoded in Shift-JIS.  I could easily set these to be UTF-8 (the doc is in UTF-8) so there is no problem. But at the output end, I *have* to display the same text in Shift-JIS; the devices accessing the site are mobile phones that can only accept this.

So in short, I see two solutions:
1) I use a seperate XML doc, encoded in Shift-JIS
2) I let admins post in UTF-8, but convert the output to Shift-JIS

If you have a good solution for 2), I'll do that. Otherwise, I'll go for 1) and make a seperate document for my Japanese texts.

LVL 26

Accepted Solution

rdcpro earned 500 total points
ID: 9850400
How do you serve the content for your site to the users.  By any chance, do you render the XML using XSLT?

You might have to use approach 1, but XSLT does have a nice method for producing different output encodings regardless of what the XML is encoded in.  The tag:

<xsl:output method="xml" encoding="shift-jis"/>

causes all output to be encoded in the desired encoding.  MSXML supports any encoding supported by Internet Explorer.  However, you can't dynamically specify the encoding at runtime (at least not elegantly).  You'd need at least a separate root XSLT for each encoding, and then use the appropriate one at runtime.  Each XSLT would import or include all it's templates, you you wouldn't really have any redundant code.  Something like:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="">
      <xsl:output method="xml" version="1.0" encoding="shift-jis" indent="yes"/>
      <xsl:include href="myTemplates.xslt"/>

On the other hand, if you're not using XSLT, and don't want to, you can probably use the stream object.  I believe you can set various encodings on it.  I don't have a code sample, though, as I usually end up using XSLT.  

It looks like to me that your XML file holds the content for all pages on the site, for all supported languages.  This sounds like a pretty big file, and parsing the entire thing isn't the best use of resources, I should think, considering a single site visitor will only use one locale.  There are a variety of approaches for might think about using a different approach.  For example, localizable resources go in a separate XML file for each locale, stored in a separate folder:

    |_    en_US
    |_    fr_CA

When you discover the site visitors locale or culture code, you modify the path to the XML resource, and cache the content in the user's session, or load it each time, depending on your needs.

.NET has a better way of dealing with localizable resources, too.

Mike Sharp

Author Comment

ID: 9851121
Thanks very much Mike -- that seperate folder solution (or different filenames) will work perfectly for me. No; I do not use XSLT (since I don't really understand it heh) so seperate files works best.

Great job-500 points coming your way!

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Tool to email me when a website changes 29 128
Change to event 1 112
Problem to ToolkitScriptManager 2 59
Grunt script for Build Process 1 27
Most of the sites are being standardized with W3C Web Standards. W3C provides lot of web standard services to the web. They have the web specification, process and documentation for all the web standards. You can apply HTML, CSS and Accessibility st…
Preface In the first article: A Better Website Login System ( I introduced the EE Collaborative Login System and its intended purpose. In this article I will discuss some of the design consideratio…
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Viewers will learn about if statements in Java and their use The if statement: The condition required to create an if statement: Variations of if statements: An example using if statements:

776 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question