Solved

Getting non-ascii chars in URLs passed correctly.  Getting 404s due to "ascii-ifying" URLs

Posted on 2007-11-23
8
297 Views
Last Modified: 2012-08-13
I have some files on my web server with non-ascii (in fact, non-Latin) chars in their filenames.
For example:
ről.txt

That is, letter r, then letter o with double acute accent, letter L, and the extension.  The double acute accent is detailed here:
http://en.wikipedia.org/wiki/Double_acute_accent

It looks like this site is also unablwe to show the character correctly.  Tsk tsk.

When I make a request for the page, I always get a 404.  Looking in the logd I see that the server is seeking rol.txt, and of course not finding it.
THis is true using
http://server/ről.txt
or
http://servcer/r%c5%91l.txt

In the IIS logs I see:
21:45:38 127.0.0.1 127.0.0.1 GET /rol.txt - 404 3535 819 40  ...

My Web.sitemap contains the correct URL with the accented character.

It appears that the browser is making the correct request, but that the server is not seeing the right chars, URL-encoded or not.
Please advise how to make this work.
0
Comment
Question by:dc197
  • 5
  • 3
8 Comments
 
LVL 19

Expert Comment

by:SteveH_UK
Comment Utility
URLs should be encoded before passing.  At the moment, not all systems support internationalised URL encoding as described here http://www.w3.org/International/O-URL-code.html

Your best bet is to sanitise filenames before they are stored on your server.

Otherwise, you may find that the international encoding described above will help.
0
 
LVL 19

Expert Comment

by:SteveH_UK
Comment Utility
Note that many URL encoding schemes encode URLs according to the non-internationalised standard, i.e. they only support 7-bit ASCII, or if you are lucky 8-bit.  Neither of these support Unicode directly.
0
 
LVL 5

Author Comment

by:dc197
Comment Utility
I developed the site using VS2005 and its internal webserver.  This fully supports unicode chars, and the site works just fine, whether one uses URL-encoded requests or not.

I copied the source over to my dev machine for testing, which is running IIS5 as the webserver.  This is where the problem arose.   Is IIS5 the culprit?  Is IIS6 better?
0
 
LVL 19

Accepted Solution

by:
SteveH_UK earned 200 total points
Comment Utility
See http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/23ec8be2-649a-47b7-8d75-ffd937f16fe8.mspx?mfr=true

Yes, new in IIS 6

The paragraph at the bottom of the above page reads:

Because IIS 6.0 now supports UTF-8 URLs, you can now log those URL requests to an ASCII log file. UTF-8 is a double-byte character set standard. Because ASCII is a single-byte character set standard, logging UTF-8 information to an ASCII file presents a problem. In such a case, ? is logged for the characters that cannot be converted to the codepage of the server.
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 5

Author Comment

by:dc197
Comment Utility
OK it looks like IIS6 has support for unicode URLs (http://msdn.microsoft.com/msdnmag/issues/02/03/IIS6/) as well as better logging. That's good, because the site I wish to host heavily features the Hungarian language, in which o is not the same as ó, ö or ő.

Cheers.
0
 
LVL 5

Author Comment

by:dc197
Comment Utility
It would appear that this site's ability to handle non-standard chars sucks, too.
0
 
LVL 19

Expert Comment

by:SteveH_UK
Comment Utility
Actually, they have recently opened an Experts Exchange bug on this issue, and I have been assisting!

It's a fairly standard issue.  For example the PHP language doesn't support Unicode particularly well, and it certainly can make coding harder.

Nevertheless, in my view that is no excuse and I always make sure any forms I code support Unicode fully and do not assume Latin character sets unless appropriate!
0
 
LVL 19

Expert Comment

by:SteveH_UK
Comment Utility
So, if you are coding for Hungarian, I recommend that you encode your pages in UTF-8, that you use HTML entities for posting/returning form values, that you always set the HTTP headers (not just those in the HTML page!) and that you TEST, TEST, TEST!

Good luck :)
0

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Sometimes in DotNetNuke module development you want to swap controls within the same module definition.  In doing this DNN (somewhat annoyingly) swaps the Skin and Container definitions to the default admin selections.  To get around this you need t…
User art_snob (http://www.experts-exchange.com/M_6114203.html) encountered strange behavior of Android Web browser on his Mobile Web site. It took a while to find the true cause. It happens so, that the Android Web browser (at least up to OS ver. 2.…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now