• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 658
  • Last Modified:

Incorrect Character encoding for Href... help!

I have an xml file that includes foreign characters such as e-acute (é) that are used for file names.  
if I take the address and type it into the browser, everything works fine.
if I type the address into an HTML file, everything works fine.
Heres the problem:
when the xml file is transformed by the xslt file, IE translates the href into the wrong encoding, and thus what should be "C:\journ%E9e\FrenchJournal.doc" is output as "C:\journ%C3%A9e\FrenchJournal.doc".

I have included the encoding attribute in the XML and XSL file, as well as used the xsl:output element and Meta element in the HTML to make sure that the output would be UTF-8.  Nothing seems to do the trick except changing the output type from html to xml which is not what I need.

Any help is appreciated!


Below is the source to both files:
-----------------------------------------
[FILE: results.xsl ]
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" encoding="UTF-8" omit-xml-declaration="yes" indent="yes" />

<xsl:template match="/">
      <html>
            <head>
                  <META http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            </head>
            <body>
                  <xsl:apply-templates select="Results/SearchResults/SearchResult" />
            </body>
      </html>
</xsl:template>

<xsl:template match="SearchResult">
      This is the URL: <xsl:value-of select="SearchResultUrl" disable-output-escaping="yes" />
      <br />
      This is the literal URL: <a href="C:\journée\FrenchJournal.doc">Test</a>
      <br />
      This is the XSL URL:
      <a charset="UTF-8">
            <xsl:attribute name="href"><xsl:value-of select="SearchResultUrl" /></xsl:attribute>
            <xsl:value-of select="SearchResultUrl" />
      </a>
      <br />
      This is what internet explorer <b><u>incorrectly</u></b> outputs as the href:
      C:\journ%C3%A9e\FrenchJournal.doc
</xsl:template>
</xsl:stylesheet>
----------------------------------------------------------------
[FILE: results.xml ]
<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="results.xsl" ?>
<Results>
      <SearchResults>
            <SearchResult id="0">
                  <SearchResultUrl>file:C:\journée\FrenchJournal.doc</SearchResultUrl>
            </SearchResult>
      </SearchResults>
</Results>
0
yleviel
Asked:
yleviel
  • 3
  • 3
1 Solution
 
rdcproCommented:
I think you have it backwards.  The byte codes for é in UTF-8 are: C3 A9 and in UTF-16 are: 00E9

IE URL-encodes each byte separately.  So assuming your desired encoding is UTF-8, then the correct URL encoded result is:

%C3%A9

and in UTF-16:

%00%E9

What you're probably looking for is to have the URL encoded with ISO-8859-1, which is how Netscape 7 does it.  This is %E9.  This is not strictly correct in my opinion, though, and IE will not do it.

In your case, IE is encoding the URL as UTF-8, which is exactly what you're telling it to do anyway.  It will do this (and it will encode form posts in UTF-8 as well) whenever Unicode is used.  The main reason for this is that relatively recently UTF-16 was extended to use surrogate bytes (the same way UTF-8 does) to express characters outside of it's code point range.  A surrogate is the byte that essentially says what range of bytes the character is in.  

The problem is many web servers cannot  handle UTF-16 surrogates, so IE always uses UTF-8.   In your case, the filesystem is what's processing the URL, and it doesn't understand the UTF-8 encoding.  I suspect that if you were serving the document from, say IIS, it would decode the URL just fine.  

This KB talks about why IE uses UTF-8 for form posts, which is the same problem you're having (because the query string is in the URL):.

http://support.microsoft.com/default.aspx?scid=kb;en-us;303612

I really hate to suggest this, because I think it's just going to get you in trouble elsewhere, but I was able to get this horrible hack to work:

<?xml version='1.0' encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" encoding="ISO-8859-1" omit-xml-declaration="yes" indent="yes"  />

<xsl:template match="/">
     <html>
          <head>
          </head>
          <body>
               <xsl:apply-templates select="Results/SearchResults/SearchResult" />
          </body>
     </html>
</xsl:template>

<xsl:template match="SearchResult">
     This is the URL: <xsl:value-of select="SearchResultUrl" disable-output-escaping="yes" />
     <br />
     This is the literal URL: <a href="C:\journée\FrenchJournal.doc">Test</a>
     <br />
     This is the XSL URL:
     <a charset="ISO-8859-1">
          <xsl:attribute name="href"><xsl:value-of select="SearchResultUrl" /></xsl:attribute>
          <xsl:value-of select="SearchResultUrl" />
     </a>
     <br />
     This is the XSL Hacked URL:
     <xsl:text disable-output-escaping="yes"><![CDATA[<a charset="ISO-8859-1" href="]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[">]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[</a>]]></xsl:text>
     <br />
</xsl:template>
</xsl:stylesheet>

At the best, I would suggest you test the URL for the "file:" protocol, and use this, but if you send it over the internet, keep the UTF-8 encoding.

Good luck!

Regards,
Mike Sharp
0
 
ylevielAuthor Commented:
Mike,

Thank you so much for your help.  I am a bit confused though as to why the href has no encoding issues if I type the characters exactly.  I would expect that if the encoding is wrong, that the example url (C:\journée\FrenchJournal.doc) would be encoded in UTF-8 too.  But for a reason unknown to me, the "literal example" shown above:

<a href="C:\journée\FrenchJournal.doc">Test</a>

works fine.  Why is it only if I attain the href from the XML/XSL that the href gets encoded?

-Yair
0
 
ylevielAuthor Commented:
woops, correction:

the "literal example" shown above:

<a href="C:\journée\FrenchJournal.doc">Test</a>

works fine... When I place the HTML code into a non-XML/XSL page, ie. only a HTML page.
0
Cloud Class® Course: Microsoft Windows 7 Basic

This introductory course to Windows 7 environment will teach you about working with the Windows operating system. You will learn about basic functions including start menu; the desktop; managing files, folders, and libraries.

 
rdcproCommented:
Because when you specify output method HTML in an XSLT, the parser trying to convert your XML output into legal HTML.  This includes URL encoding characters in href's, which is why my ugly hack works--it fools the parser by hiding the fact that there's an href there.  But it's a hack because that simply using the é character in the URL might not work in all cases over the internet.  So characters above 127 are supposed to be encoded...

But I think you'll find the UTF-8 encoding URL will work ok if the resource you're loading is on a web server.  It's the IE/filesystem that's not correctly interpreting the URL encoding.  IE/WebServer should do just fine.   This seems like a bug in Internet Explorer, to me.  That is, it should recognize a UTF-8 encoded character (which is always possible because in a multibyte character, the first character is a surrogate byte), and correctly form %C3%A9 into é

Yes, it is indeed confusing!  I'm not even sure I understand all of it.  According to that KB article I posted, you cannot change this behavior.  If you could somehow persuade the parser that the characters weren't unicode, you might get it to work.  But I tried for quite a while to get the encoding to be ISO-8859-1, hoping to override the behavior, but couldn't get it to work.

Do you have to load the file from the filesystem?  If this is going to be a web resource, then maybe it's not a problem at all?

Regards,
Mike Sharp
0
 
ylevielAuthor Commented:
Mike,

The issue I'm having is that the files being returned could be on the users machine or in fact online on some server.  Since I can not guarantee that the enduser will have IIS installed, I am forced to supply a direct link to the file or web URI.

Maybe there is some way I can trick IE into behaving the way HTML pages work.  Since I know that typing the characters literally into the HTML file works without a hitch, there should be some way to represent the information.  Either way, you've helped me tremendously, and the points will be awarded on reply.

Many Thanks!
-Yair
0
 
rdcproCommented:
Well, in the XSLT you can look at the URL, and figure out whether it's file:// or http://, right?  In the one case, use my ugly hack (which works for filesystem) and in the other, use it normally.  Do this with an xsl:choose...

Or just use that hack for both situations...

     <xsl:text disable-output-escaping="yes"><![CDATA[<a charset="ISO-8859-1" href="]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[">]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[</a>]]></xsl:text>



Regards,
Mike Sharp

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 3
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now