dc197
asked on
Getting non-ascii chars in URLs passed correctly. Getting 404s due to "ascii-ifying" URLs
I have some files on my web server with non-ascii (in fact, non-Latin) chars in their filenames.
For example:
ről.txt
That is, letter r, then letter o with double acute accent, letter L, and the extension. The double acute accent is detailed here:
http://en.wikipedia.org/wiki/Double_acute_accent
It looks like this site is also unablwe to show the character correctly. Tsk tsk.
When I make a request for the page, I always get a 404. Looking in the logd I see that the server is seeking rol.txt, and of course not finding it.
THis is true using
http://server/ről.txt
or
http://servcer/r%c5%91l.txt
In the IIS logs I see:
21:45:38 127.0.0.1 127.0.0.1 GET /rol.txt - 404 3535 819 40 ...
My Web.sitemap contains the correct URL with the accented character.
It appears that the browser is making the correct request, but that the server is not seeing the right chars, URL-encoded or not.
Please advise how to make this work.
For example:
ről.txt
That is, letter r, then letter o with double acute accent, letter L, and the extension. The double acute accent is detailed here:
http://en.wikipedia.org/wiki/Double_acute_accent
It looks like this site is also unablwe to show the character correctly. Tsk tsk.
When I make a request for the page, I always get a 404. Looking in the logd I see that the server is seeking rol.txt, and of course not finding it.
THis is true using
http://server/ről.txt
or
http://servcer/r%c5%91l.txt
In the IIS logs I see:
21:45:38 127.0.0.1 127.0.0.1 GET /rol.txt - 404 3535 819 40 ...
My Web.sitemap contains the correct URL with the accented character.
It appears that the browser is making the correct request, but that the server is not seeing the right chars, URL-encoded or not.
Please advise how to make this work.
Note that many URL encoding schemes encode URLs according to the non-internationalised standard, i.e. they only support 7-bit ASCII, or if you are lucky 8-bit. Neither of these support Unicode directly.
ASKER
I developed the site using VS2005 and its internal webserver. This fully supports unicode chars, and the site works just fine, whether one uses URL-encoded requests or not.
I copied the source over to my dev machine for testing, which is running IIS5 as the webserver. This is where the problem arose. Is IIS5 the culprit? Is IIS6 better?
I copied the source over to my dev machine for testing, which is running IIS5 as the webserver. This is where the problem arose. Is IIS5 the culprit? Is IIS6 better?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
OK it looks like IIS6 has support for unicode URLs (http://msdn.microsoft.com/msdnmag/issues/02/03/IIS6/) as well as better logging. That's good, because the site I wish to host heavily features the Hungarian language, in which o is not the same as ó, ö or ő.
Cheers.
Cheers.
ASKER
It would appear that this site's ability to handle non-standard chars sucks, too.
Actually, they have recently opened an Experts Exchange bug on this issue, and I have been assisting!
It's a fairly standard issue. For example the PHP language doesn't support Unicode particularly well, and it certainly can make coding harder.
Nevertheless, in my view that is no excuse and I always make sure any forms I code support Unicode fully and do not assume Latin character sets unless appropriate!
It's a fairly standard issue. For example the PHP language doesn't support Unicode particularly well, and it certainly can make coding harder.
Nevertheless, in my view that is no excuse and I always make sure any forms I code support Unicode fully and do not assume Latin character sets unless appropriate!
So, if you are coding for Hungarian, I recommend that you encode your pages in UTF-8, that you use HTML entities for posting/returning form values, that you always set the HTTP headers (not just those in the HTML page!) and that you TEST, TEST, TEST!
Good luck :)
Good luck :)
Your best bet is to sanitise filenames before they are stored on your server.
Otherwise, you may find that the international encoding described above will help.