Solved

Incorrect Character encoding for Href... help!

Posted on 2004-04-14
6
588 Views
Last Modified: 2008-03-17
I have an xml file that includes foreign characters such as e-acute (é) that are used for file names.  
if I take the address and type it into the browser, everything works fine.
if I type the address into an HTML file, everything works fine.
Heres the problem:
when the xml file is transformed by the xslt file, IE translates the href into the wrong encoding, and thus what should be "C:\journ%E9e\FrenchJournal.doc" is output as "C:\journ%C3%A9e\FrenchJournal.doc".

I have included the encoding attribute in the XML and XSL file, as well as used the xsl:output element and Meta element in the HTML to make sure that the output would be UTF-8.  Nothing seems to do the trick except changing the output type from html to xml which is not what I need.

Any help is appreciated!


Below is the source to both files:
-----------------------------------------
[FILE: results.xsl ]
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" encoding="UTF-8" omit-xml-declaration="yes" indent="yes" />

<xsl:template match="/">
      <html>
            <head>
                  <META http-equiv="Content-Type" content="text/html; charset=UTF-8" />
            </head>
            <body>
                  <xsl:apply-templates select="Results/SearchResults/SearchResult" />
            </body>
      </html>
</xsl:template>

<xsl:template match="SearchResult">
      This is the URL: <xsl:value-of select="SearchResultUrl" disable-output-escaping="yes" />
      <br />
      This is the literal URL: <a href="C:\journée\FrenchJournal.doc">Test</a>
      <br />
      This is the XSL URL:
      <a charset="UTF-8">
            <xsl:attribute name="href"><xsl:value-of select="SearchResultUrl" /></xsl:attribute>
            <xsl:value-of select="SearchResultUrl" />
      </a>
      <br />
      This is what internet explorer <b><u>incorrectly</u></b> outputs as the href:
      C:\journ%C3%A9e\FrenchJournal.doc
</xsl:template>
</xsl:stylesheet>
----------------------------------------------------------------
[FILE: results.xml ]
<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="results.xsl" ?>
<Results>
      <SearchResults>
            <SearchResult id="0">
                  <SearchResultUrl>file:C:\journée\FrenchJournal.doc</SearchResultUrl>
            </SearchResult>
      </SearchResults>
</Results>
0
Comment
Question by:yleviel
  • 3
  • 3
6 Comments
 
LVL 26

Accepted Solution

by:
rdcpro earned 350 total points
ID: 10836657
I think you have it backwards.  The byte codes for é in UTF-8 are: C3 A9 and in UTF-16 are: 00E9

IE URL-encodes each byte separately.  So assuming your desired encoding is UTF-8, then the correct URL encoded result is:

%C3%A9

and in UTF-16:

%00%E9

What you're probably looking for is to have the URL encoded with ISO-8859-1, which is how Netscape 7 does it.  This is %E9.  This is not strictly correct in my opinion, though, and IE will not do it.

In your case, IE is encoding the URL as UTF-8, which is exactly what you're telling it to do anyway.  It will do this (and it will encode form posts in UTF-8 as well) whenever Unicode is used.  The main reason for this is that relatively recently UTF-16 was extended to use surrogate bytes (the same way UTF-8 does) to express characters outside of it's code point range.  A surrogate is the byte that essentially says what range of bytes the character is in.  

The problem is many web servers cannot  handle UTF-16 surrogates, so IE always uses UTF-8.   In your case, the filesystem is what's processing the URL, and it doesn't understand the UTF-8 encoding.  I suspect that if you were serving the document from, say IIS, it would decode the URL just fine.  

This KB talks about why IE uses UTF-8 for form posts, which is the same problem you're having (because the query string is in the URL):.

http://support.microsoft.com/default.aspx?scid=kb;en-us;303612

I really hate to suggest this, because I think it's just going to get you in trouble elsewhere, but I was able to get this horrible hack to work:

<?xml version='1.0' encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" encoding="ISO-8859-1" omit-xml-declaration="yes" indent="yes"  />

<xsl:template match="/">
     <html>
          <head>
          </head>
          <body>
               <xsl:apply-templates select="Results/SearchResults/SearchResult" />
          </body>
     </html>
</xsl:template>

<xsl:template match="SearchResult">
     This is the URL: <xsl:value-of select="SearchResultUrl" disable-output-escaping="yes" />
     <br />
     This is the literal URL: <a href="C:\journée\FrenchJournal.doc">Test</a>
     <br />
     This is the XSL URL:
     <a charset="ISO-8859-1">
          <xsl:attribute name="href"><xsl:value-of select="SearchResultUrl" /></xsl:attribute>
          <xsl:value-of select="SearchResultUrl" />
     </a>
     <br />
     This is the XSL Hacked URL:
     <xsl:text disable-output-escaping="yes"><![CDATA[<a charset="ISO-8859-1" href="]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[">]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[</a>]]></xsl:text>
     <br />
</xsl:template>
</xsl:stylesheet>

At the best, I would suggest you test the URL for the "file:" protocol, and use this, but if you send it over the internet, keep the UTF-8 encoding.

Good luck!

Regards,
Mike Sharp
0
 
LVL 2

Author Comment

by:yleviel
ID: 10842884
Mike,

Thank you so much for your help.  I am a bit confused though as to why the href has no encoding issues if I type the characters exactly.  I would expect that if the encoding is wrong, that the example url (C:\journée\FrenchJournal.doc) would be encoded in UTF-8 too.  But for a reason unknown to me, the "literal example" shown above:

<a href="C:\journée\FrenchJournal.doc">Test</a>

works fine.  Why is it only if I attain the href from the XML/XSL that the href gets encoded?

-Yair
0
 
LVL 2

Author Comment

by:yleviel
ID: 10842969
woops, correction:

the "literal example" shown above:

<a href="C:\journée\FrenchJournal.doc">Test</a>

works fine... When I place the HTML code into a non-XML/XSL page, ie. only a HTML page.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 26

Expert Comment

by:rdcpro
ID: 10843110
Because when you specify output method HTML in an XSLT, the parser trying to convert your XML output into legal HTML.  This includes URL encoding characters in href's, which is why my ugly hack works--it fools the parser by hiding the fact that there's an href there.  But it's a hack because that simply using the é character in the URL might not work in all cases over the internet.  So characters above 127 are supposed to be encoded...

But I think you'll find the UTF-8 encoding URL will work ok if the resource you're loading is on a web server.  It's the IE/filesystem that's not correctly interpreting the URL encoding.  IE/WebServer should do just fine.   This seems like a bug in Internet Explorer, to me.  That is, it should recognize a UTF-8 encoded character (which is always possible because in a multibyte character, the first character is a surrogate byte), and correctly form %C3%A9 into é

Yes, it is indeed confusing!  I'm not even sure I understand all of it.  According to that KB article I posted, you cannot change this behavior.  If you could somehow persuade the parser that the characters weren't unicode, you might get it to work.  But I tried for quite a while to get the encoding to be ISO-8859-1, hoping to override the behavior, but couldn't get it to work.

Do you have to load the file from the filesystem?  If this is going to be a web resource, then maybe it's not a problem at all?

Regards,
Mike Sharp
0
 
LVL 2

Author Comment

by:yleviel
ID: 10845546
Mike,

The issue I'm having is that the files being returned could be on the users machine or in fact online on some server.  Since I can not guarantee that the enduser will have IIS installed, I am forced to supply a direct link to the file or web URI.

Maybe there is some way I can trick IE into behaving the way HTML pages work.  Since I know that typing the characters literally into the HTML file works without a hitch, there should be some way to represent the information.  Either way, you've helped me tremendously, and the points will be awarded on reply.

Many Thanks!
-Yair
0
 
LVL 26

Expert Comment

by:rdcpro
ID: 10846645
Well, in the XSLT you can look at the URL, and figure out whether it's file:// or http://, right?  In the one case, use my ugly hack (which works for filesystem) and in the other, use it normally.  Do this with an xsl:choose...

Or just use that hack for both situations...

     <xsl:text disable-output-escaping="yes"><![CDATA[<a charset="ISO-8859-1" href="]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[">]]></xsl:text><xsl:value-of select="SearchResultUrl" /><xsl:text disable-output-escaping="yes"><![CDATA[</a>]]></xsl:text>



Regards,
Mike Sharp

0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.
This video gives you a great overview about bandwidth monitoring with SNMP and WMI with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're looking for how to monitor bandwidth using netflow or packet s…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now