Solved

Incorrect Character encoding for Href... help!

Posted on 2004-04-14
6
593 Views
Last Modified: 2008-03-17
I have an xml file that includes foreign characters such as e-acute (é) that are used for file names.  
if I take the address and type it into the browser, everything works fine.
if I type the address into an HTML file, everything works fine.
Heres the problem:
when the xml file is transformed by the xslt file, IE translates the href into the wrong encoding, and thus what should be "C:\journ%E9e\FrenchJournal.doc" is output as "C:\journ%C3%A9e\FrenchJournal.doc".

I have included the encoding attribute in the XML and XSL file, as well as used the xsl:output element and Meta element in the HTML to make sure that the output would be UTF-8.  Nothing seems to do the trick except changing the output type from html to xml which is not what I need.

Any help is appreciated!


Below is the source to both files:
-----------------------------------------
[FILE: results.xsl ]
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" encoding="UTF-8" omit-xml-declaration="yes" indent="yes" />

<xsl:template match="/">
      <html>
            <head>
                  <META http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            </head>
            <body>
                  <xsl:apply-templates select="Results/SearchResults/SearchResult" />
            </body>
      </html>
</xsl:template>

<xsl:template match="SearchResult">
      This is the URL: <xsl:value-of select="SearchResultUrl" disable-output-escaping="yes" />
      <br />
      This is the literal URL: <a href="C:\journée\FrenchJournal.doc">Test</a>
      <br />
      This is the XSL URL:
      <a charset="UTF-8">
            <xsl:attribute name="href"><xsl:value-of select="SearchResultUrl" /></xsl:attribute>
            <xsl:value-of select="SearchResultUrl" />
      </a>
      <br />
      This is what internet explorer <b><u>incorrectly</u></b> outputs as the href:
      C:\journ%C3%A9e\FrenchJournal.doc
</xsl:template>
</xsl:stylesheet>
----------------------------------------------------------------
[FILE: results.xml ]
<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="results.xsl" ?>
<Results>
      <SearchResults>
            <SearchResult id="0">
                  <SearchResultUrl>file:C:\journée\FrenchJournal.doc</SearchResultUrl>
            </SearchResult>
      </SearchResults>
</Results>
0
Comment
Question by:yleviel
  • 3
  • 3
6 Comments
 
LVL 26

Accepted Solution

by:
rdcpro earned 350 total points
ID: 10836657
I think you have it backwards.  The byte codes for é in UTF-8 are: C3 A9 and in UTF-16 are: 00E9

IE URL-encodes each byte separately.  So assuming your desired encoding is UTF-8, then the correct URL encoded result is:

%C3%A9

and in UTF-16:

%00%E9

What you're probably looking for is to have the URL encoded with ISO-8859-1, which is how Netscape 7 does it.  This is %E9.  This is not strictly correct in my opinion, though, and IE will not do it.

In your case, IE is encoding the URL as UTF-8, which is exactly what you're telling it to do anyway.  It will do this (and it will encode form posts in UTF-8 as well) whenever Unicode is used.  The main reason for this is that relatively recently UTF-16 was extended to use surrogate bytes (the same way UTF-8 does) to express characters outside of it's code point range.  A surrogate is the byte that essentially says what range of bytes the character is in.  

The problem is many web servers cannot  handle UTF-16 surrogates, so IE always uses UTF-8.   In your case, the filesystem is what's processing the URL, and it doesn't understand the UTF-8 encoding.  I suspect that if you were serving the document from, say IIS, it would decode the URL just fine.  

This KB talks about why IE uses UTF-8 for form posts, which is the same problem you're having (because the query string is in the URL):.

http://support.microsoft.com/default.aspx?scid=kb;en-us;303612

I really hate to suggest this, because I think it's just going to get you in trouble elsewhere, but I was able to get this horrible hack to work:

<?xml version='1.0' encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" encoding="ISO-8859-1" omit-xml-declaration="yes" indent="yes"  />

<xsl:template match="/">
     <html>
          <head>
          </head>
          <body>
               <xsl:apply-templates select="Results/SearchResults/SearchResult" />
          </body>
     </html>
</xsl:template>

<xsl:template match="SearchResult">
     This is the URL: <xsl:value-of select="SearchResultUrl" disable-output-escaping="yes" />
     <br />
     This is the literal URL: <a href="C:\journée\FrenchJournal.doc">Test</a>
     <br />
     This is the XSL URL:
     <a charset="ISO-8859-1">
          <xsl:attribute name="href"><xsl:value-of select="SearchResultUrl" /></xsl:attribute>
          <xsl:value-of select="SearchResultUrl" />
     </a>
     <br />
     This is the XSL Hacked URL:
     <xsl:text disable-output-escaping="yes"><![CDATA[<a charset="ISO-8859-1" href="]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[">]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[</a>]]></xsl:text>
     <br />
</xsl:template>
</xsl:stylesheet>

At the best, I would suggest you test the URL for the "file:" protocol, and use this, but if you send it over the internet, keep the UTF-8 encoding.

Good luck!

Regards,
Mike Sharp
0
 
LVL 2

Author Comment

by:yleviel
ID: 10842884
Mike,

Thank you so much for your help.  I am a bit confused though as to why the href has no encoding issues if I type the characters exactly.  I would expect that if the encoding is wrong, that the example url (C:\journée\FrenchJournal.doc) would be encoded in UTF-8 too.  But for a reason unknown to me, the "literal example" shown above:

<a href="C:\journée\FrenchJournal.doc">Test</a>

works fine.  Why is it only if I attain the href from the XML/XSL that the href gets encoded?

-Yair
0
 
LVL 2

Author Comment

by:yleviel
ID: 10842969
woops, correction:

the "literal example" shown above:

<a href="C:\journée\FrenchJournal.doc">Test</a>

works fine... When I place the HTML code into a non-XML/XSL page, ie. only a HTML page.
0
Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

 
LVL 26

Expert Comment

by:rdcpro
ID: 10843110
Because when you specify output method HTML in an XSLT, the parser trying to convert your XML output into legal HTML.  This includes URL encoding characters in href's, which is why my ugly hack works--it fools the parser by hiding the fact that there's an href there.  But it's a hack because that simply using the é character in the URL might not work in all cases over the internet.  So characters above 127 are supposed to be encoded...

But I think you'll find the UTF-8 encoding URL will work ok if the resource you're loading is on a web server.  It's the IE/filesystem that's not correctly interpreting the URL encoding.  IE/WebServer should do just fine.   This seems like a bug in Internet Explorer, to me.  That is, it should recognize a UTF-8 encoded character (which is always possible because in a multibyte character, the first character is a surrogate byte), and correctly form %C3%A9 into é

Yes, it is indeed confusing!  I'm not even sure I understand all of it.  According to that KB article I posted, you cannot change this behavior.  If you could somehow persuade the parser that the characters weren't unicode, you might get it to work.  But I tried for quite a while to get the encoding to be ISO-8859-1, hoping to override the behavior, but couldn't get it to work.

Do you have to load the file from the filesystem?  If this is going to be a web resource, then maybe it's not a problem at all?

Regards,
Mike Sharp
0
 
LVL 2

Author Comment

by:yleviel
ID: 10845546
Mike,

The issue I'm having is that the files being returned could be on the users machine or in fact online on some server.  Since I can not guarantee that the enduser will have IIS installed, I am forced to supply a direct link to the file or web URI.

Maybe there is some way I can trick IE into behaving the way HTML pages work.  Since I know that typing the characters literally into the HTML file works without a hitch, there should be some way to represent the information.  Either way, you've helped me tremendously, and the points will be awarded on reply.

Many Thanks!
-Yair
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 10846645
Well, in the XSLT you can look at the URL, and figure out whether it's file:// or http://, right?  In the one case, use my ugly hack (which works for filesystem) and in the other, use it normally.  Do this with an xsl:choose...

Or just use that hack for both situations...

     <xsl:text disable-output-escaping="yes"><![CDATA[<a charset="ISO-8859-1" href="]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[">]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[</a>]]></xsl:text>



Regards,
Mike Sharp

0

Featured Post

Master Your Team's Linux and Cloud Stack

Come see why top tech companies like Mailchimp and Media Temple use Linux Academy to build their employee training programs.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
RSS Feed Parsing for Images 5 75
Unattended XML settings 4 112
Optimum versions of Selenium Webdriver with Python On Windows 7 1 113
Image decoding from Camera 3 88
Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
The Email Laundry PDF encryption service allows companies to send confidential encrypted  emails to anybody. The PDF document can also contain attachments that are embedded in the encrypted PDF. The password is randomly generated by The Email Laundr…
A short tutorial showing how to set up an email signature in Outlook on the Web (previously known as OWA). For free email signatures designs, visit https://www.mail-signatures.com/articles/signature-templates/?sts=6651 If you want to manage em…

821 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question