We help IT Professionals succeed at work.

Check out our new AWS podcast with Certified Expert, Phil Phillips! Listen to "How to Execute a Seamless AWS Migration" on EE or on your favorite podcast platform. Listen Now

x

URL with Unicode characters

Medium Priority
1,380 Views
Last Modified: 2013-11-19
Hi,
I’ve “Asp.net” website running on Windows 2008 server.
Since I’m using URL rewriting sometimes my URL’s contains Unicode characters.
Since URL’s should only be in ascii I’m encoding the URL and when I need to read data
From the URL I’m decoding it.
I’m using “HttpUtility.UrlEncode(strURL)” while building the URL string
And then “HttpUtility.UrlDecode(strURL)” to convert and read the data.
This works on 95% of the time, but not always.
When I’m trying from my 2 machines with different browsers it works, but my log file shows that
For other visitors it doesn’t always work.

I’ve tried to add Encoding definition to the function like “HttpUtility.UrlEncode(urlHome, Encoding.GetEncoding("ISO-8859-1"))” but still it doesn’t work every time.
My main problem is with Polish characters and Scandinavic ones like:
Poznan
Brøndby

How can I solve this issue, so it would work 100% of the times.

Thanks,
Assaf.
Comment
Watch Question

Top Expert 2007

Commented:
I solved this in a different way. I have urls that use all kind of characters, Polish, Turkish, even Chinese. I translate all characters to ISO-8859-1, if I know them. I haven't seen any problem with it, but I should take a better look probably. So I translate a ç to c, é to e, etc. For Chinese I use English translations. This way I keep the url as clean as possible, readable by anyone, and probably understandable. I think this is best for Google and humans. I should mention that I program in PHP, and we use Apache as webserver, so that might be a big difference. Characters like spaces and quotes are translated to dashes.

I use the url as parameter, so if you take a name as Brøndby, we translate it to brondby (lowercase) to create the url in the first place. When we read the parameter, we translate the original name to the url-version, and compare that to the parameter.

If you know that the parameter can be different values, so brondby and br%4Endby (just made up something), and you know how it gets to that different spelling, you can check for that too. So I wonder what strange parameters you see, and if you see a consistency there, one that you can use.

Author

Commented:
Hi R7AF,
thanks for your answer.
did you mean you manually translated all the words with the unicode chars? or have you
done something programmatically.

Thanks,
Assaf.
Dana SeamanDana Seaman (danaseaman)
CERTIFIED EXPERT

Commented:
Have you tried converting strURL to UTF8 before calling UrlEncode?
Likewise convert Utf8 to Utf16 before calling UrlDecode.\
CERTIFIED EXPERT
Top Expert 2004

Commented:
>>> So I translate a ç to c, é to e, etc.

AFAIK, this is the only realistic solution.  Unicode, UTF, et al, do not translate directly to ISO-8859-1 because they allow for many more characters than a purely Latin character set.  Really, there is no way to encode a URL with unicode or UTF8 characters into the ISO set.  You will always be left with untranslatable characters, artifacts, or embedded Uni/UTF notation (such as &#FE00, or whatever).

Author

Commented:
Thanks for your answers.
it looks like the problem lies on the client machine settings, if the client (browser) supports
Danish chars then the convertion works otherwise it doesn't. the same for other char sets.

i agree that the realistic solution would be to convert all special chars to regular chars like you both said "So I translate a ç to c, é to e, etc". but in my DB i've 15k words that grows all the time. manually going through every word and create a new entry without the special chars is huge headace.
isn't there a method to convert programically ç to c etc...
thanks!!!

Author

Commented:
on .Net i found a way to eliminate the special chars, the problem it just remove the char and not convert it.
so Brøndby will be converted to Brndby
which is not good enough.

Encoding encoder = ASCIIEncoding.GetEncoding("us-ascii", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

byte[] bAsciiString = encoder.GetBytes(str);

string cleanString =  ASCIIEncoding.ASCII.GetString(bAsciiString);
Dana Seaman (danaseaman)
CERTIFIED EXPERT
Commented:
Unlock this solution with a free trial preview.
(No credit card required)
Get Preview

Author

Commented:
thanks danaseaman.
this makes me close to a solution.
but i don't know how to use this API.
i've looked on the web for something in .Net, haven't found a full solution yet.
as i understand there's no way to avoide a DB of unicode chars and the char i want to convert to.
 
Dana SeamanDana Seaman (danaseaman)
CERTIFIED EXPERT

Commented:
You should be able to wrap this API into a DLL and call it from Asp.Net using Server.CreateObject(myDLL.myClass)

See http://forums.devx.com/showthread.php?t=8257

Author

Commented:
i finally found a solution in .Net
i doesn't cover 100% of the cases but 99% is enough.
http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx

    static string RemoveDiacritics(string stIn)
    {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int ich = 0; ich < stFormD.Length; ich++)
        {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if (uc != UnicodeCategory.NonSpacingMark)
            {
                sb.Append(stFormD[ich]);
            }
        }

        return (sb.ToString().Normalize(NormalizationForm.FormC));
    }

thanks for all your help.
Top Expert 2007

Commented:
Sorry, couldn't reply earlier. We use a (self created) function that translates é into e, etc. For each character that we know, we do this. This poses some problems though. For instance, in German, the character ä should be translated to ae, while in Dutch, it could be translated to a or -a as well. So depending on the language and context, translations can differ.
Unlock the solution to this question.
Thanks for using Experts Exchange.

Please provide your email to receive a free trial preview!

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.