Japanese breaks up XML (msxml2)

Hi there,

I have the following problem: I wrote a simple ASP script (see function below) that allows administrators of my website to update text strings in an XML file. This works perfectly for any unicode language, but not for Japanese. The file becomes totally corrupted and unreadable; in fact, is "cut" at the exact point where it should be normally updated.

I suppose the problem lies either in the fact that I use an old msxml parser (ver. 2) or because the Japanese encoding is set to shift-jis i.s.o. UTF-8.

Does anyone here have any experience with this problem, and a possible solution?


Function fncUpdateXML(strLanguage, strScriptName, strNode, strText)
      Set xmlDoc = Server.CreateObject("msxml2.DOMDocument")
      xmlDoc.async = False
      If NOT xmlDoc.Load("c:\testfile.xml") Then
            Response.Write "Page failed to load"
            strText = Replace(strText, "<br>", vblf)
            xmlDoc.SelectSingleNode("/languages/language[@xml:lang='" & strLanguage & "']/pages/page[@xml:page='" & strScriptName & "']/" & strNode).text = strText
      xmlDoc.save "c:\testfile.xml"
      End If
      Set xmlDoc = Nothing
End Function
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

No, you're using MSXML 3.  The ProgID  "Msxml2.DomDocument" doesn't mean MSMXML version 2...it's more like version 2 of the API.  MSXML version 2 had a different API (this was before things were settled at the W3C).  MSXML 4 uses a similar ProgID: "Msxml2.DomDocument.4.0"

Any time you have strings involved, your encoding is actually UTF-16.  This is because BStr's are essentially UTF-16.  So somewhere along the road, your encoding is getting goofed up.  Now, if your document contains Unicode characters, and you were to try to insert shift-jis characters in it, you'd have a problem as the document can only have one encoding.  If you can convert the submitted Shift-JIS to unicode, that would be easiest.  

also, xml:page does not look like correct useage.  There isn't (to my knowledge) any such thing as an xml:page attribute.  "xml" is a reserved key, and you should avoid using it in your own semantic context.  Also, to reliably select nodes with a qualified name (such as foo:bar), you need to specify the selectionNamespaces property in your DomDocument object.  Same goes for using XPath.  selectSingleNode() defaults to the old XSL patterns language for backwards compatibility reasons.

Here's how to set them both (in JScript...sorry):

var xmldoc = new ActiveXObject("Msxml2.DOMDocument");
xmldoc.setProperty("SelectionLanguage", "XPath");
xmldoc.setProperty("SelectionNamespaces", "xmlns:foo='http://myserver.com' xmlns:bar='http://yourserver.com'");

This allows you to select a node using a qualified name, even if the actual prefix is different than the one in your SelectionNamepaces property.  For example, this XML:

<snafu:rootelement xmlns:snafu="http://tempuri.org">nice root element</snafu:rootelement>

can be selected by:

var xmldoc = new ActiveXObject("Msxml2.DOMDocument");
xmldoc.setProperty("SelectionLanguage", "XPath");
xmldoc.setProperty("SelectionNamespaces", "xmlns:foo='http://tempuri.org'");
var oNode = xmldoc.selectSingleNode("foo:rootelement")

even though the prefix in the XML is "snafu" and the prefix in the select is "foo".  It's only the namespace that counts.

Mike Sharp


vpikulaAuthor Commented:
Thanks a lot for that detailed info, rdcpro! I'll repair the xml with the syntax pointers you gave once this problem is solved.

Your suggestion is: " If you can convert the submitted Shift-JIS to unicode, that would be easiest."

How do I do this? Right now the pages I present to my users to edit the XML on are encoded in Shift-JIS.  I could easily set these to be UTF-8 (the doc is in UTF-8) so there is no problem. But at the output end, I *have* to display the same text in Shift-JIS; the devices accessing the site are mobile phones that can only accept this.

So in short, I see two solutions:
1) I use a seperate XML doc, encoded in Shift-JIS
2) I let admins post in UTF-8, but convert the output to Shift-JIS

If you have a good solution for 2), I'll do that. Otherwise, I'll go for 1) and make a seperate document for my Japanese texts.

How do you serve the content for your site to the users.  By any chance, do you render the XML using XSLT?

You might have to use approach 1, but XSLT does have a nice method for producing different output encodings regardless of what the XML is encoded in.  The tag:

<xsl:output method="xml" encoding="shift-jis"/>

causes all output to be encoded in the desired encoding.  MSXML supports any encoding supported by Internet Explorer.  However, you can't dynamically specify the encoding at runtime (at least not elegantly).  You'd need at least a separate root XSLT for each encoding, and then use the appropriate one at runtime.  Each XSLT would import or include all it's templates, you you wouldn't really have any redundant code.  Something like:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output method="xml" version="1.0" encoding="shift-jis" indent="yes"/>
      <xsl:include href="myTemplates.xslt"/>

On the other hand, if you're not using XSLT, and don't want to, you can probably use the stream object.  I believe you can set various encodings on it.  I don't have a code sample, though, as I usually end up using XSLT.  

It looks like to me that your XML file holds the content for all pages on the site, for all supported languages.  This sounds like a pretty big file, and parsing the entire thing isn't the best use of resources, I should think, considering a single site visitor will only use one locale.  There are a variety of approaches for localization...you might think about using a different approach.  For example, localizable resources go in a separate XML file for each locale, stored in a separate folder:

    |_    en_US
    |_    fr_CA

When you discover the site visitors locale or culture code, you modify the path to the XML resource, and cache the content in the user's session, or load it each time, depending on your needs.

.NET has a better way of dealing with localizable resources, too.

Mike Sharp

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
vpikulaAuthor Commented:
Thanks very much Mike -- that seperate folder solution (or different filenames) will work perfectly for me. No; I do not use XSLT (since I don't really understand it heh) so seperate files works best.

Great job-500 points coming your way!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.