URL with Unicode characters

Hi,
I’ve “Asp.net” website running on Windows 2008 server.
Since I’m using URL rewriting sometimes my URL’s contains Unicode characters.
Since URL’s should only be in ascii I’m encoding the URL and when I need to read data
From the URL I’m decoding it.
I’m using “HttpUtility.UrlEncode(strURL)” while building the URL string
And then “HttpUtility.UrlDecode(strURL)” to convert and read the data.
This works on 95% of the time, but not always.
When I’m trying from my 2 machines with different browsers it works, but my log file shows that
For other visitors it doesn’t always work.

I’ve tried to add Encoding definition to the function like “HttpUtility.UrlEncode(urlHome, Encoding.GetEncoding("ISO-8859-1"))” but still it doesn’t work every time.
My main problem is with Polish characters and Scandinavic ones like:
Poznan
Brøndby

How can I solve this issue, so it would work 100% of the times.

Thanks,
Assaf.
AssafSTAsked:
Who is Participating?
 
danaseamanCommented:
API FoldString will strip diacritics and replace them with base letter.
See attached Screenshot.
Source code attached in Zip file.

 Strip Diacritics DemoStripDiacritics.zip
0
 
R7AFCommented:
I solved this in a different way. I have urls that use all kind of characters, Polish, Turkish, even Chinese. I translate all characters to ISO-8859-1, if I know them. I haven't seen any problem with it, but I should take a better look probably. So I translate a ç to c, é to e, etc. For Chinese I use English translations. This way I keep the url as clean as possible, readable by anyone, and probably understandable. I think this is best for Google and humans. I should mention that I program in PHP, and we use Apache as webserver, so that might be a big difference. Characters like spaces and quotes are translated to dashes.

I use the url as parameter, so if you take a name as Brøndby, we translate it to brondby (lowercase) to create the url in the first place. When we read the parameter, we translate the original name to the url-version, and compare that to the parameter.

If you know that the parameter can be different values, so brondby and br%4Endby (just made up something), and you know how it gets to that different spelling, you can check for that too. So I wonder what strange parameters you see, and if you see a consistency there, one that you can use.
0
 
AssafSTAuthor Commented:
Hi R7AF,
thanks for your answer.
did you mean you manually translated all the words with the unicode chars? or have you
done something programmatically.

Thanks,
Assaf.
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
danaseamanCommented:
Have you tried converting strURL to UTF8 before calling UrlEncode?
Likewise convert Utf8 to Utf16 before calling UrlDecode.\
0
 
Steve BinkCommented:
>>> So I translate a ç to c, é to e, etc.

AFAIK, this is the only realistic solution.  Unicode, UTF, et al, do not translate directly to ISO-8859-1 because they allow for many more characters than a purely Latin character set.  Really, there is no way to encode a URL with unicode or UTF8 characters into the ISO set.  You will always be left with untranslatable characters, artifacts, or embedded Uni/UTF notation (such as &#FE00, or whatever).
0
 
AssafSTAuthor Commented:
Thanks for your answers.
it looks like the problem lies on the client machine settings, if the client (browser) supports
Danish chars then the convertion works otherwise it doesn't. the same for other char sets.

i agree that the realistic solution would be to convert all special chars to regular chars like you both said "So I translate a ç to c, é to e, etc". but in my DB i've 15k words that grows all the time. manually going through every word and create a new entry without the special chars is huge headace.
isn't there a method to convert programically ç to c etc...
thanks!!!
0
 
AssafSTAuthor Commented:
on .Net i found a way to eliminate the special chars, the problem it just remove the char and not convert it.
so Brøndby will be converted to Brndby
which is not good enough.

Encoding encoder = ASCIIEncoding.GetEncoding("us-ascii", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

byte[] bAsciiString = encoder.GetBytes(str);

string cleanString =  ASCIIEncoding.ASCII.GetString(bAsciiString);
0
 
AssafSTAuthor Commented:
thanks danaseaman.
this makes me close to a solution.
but i don't know how to use this API.
i've looked on the web for something in .Net, haven't found a full solution yet.
as i understand there's no way to avoide a DB of unicode chars and the char i want to convert to.
 
0
 
danaseamanCommented:
You should be able to wrap this API into a DLL and call it from Asp.Net using Server.CreateObject(myDLL.myClass)

See http://forums.devx.com/showthread.php?t=8257
0
 
AssafSTAuthor Commented:
i finally found a solution in .Net
i doesn't cover 100% of the cases but 99% is enough.
http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx

    static string RemoveDiacritics(string stIn)
    {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int ich = 0; ich < stFormD.Length; ich++)
        {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if (uc != UnicodeCategory.NonSpacingMark)
            {
                sb.Append(stFormD[ich]);
            }
        }

        return (sb.ToString().Normalize(NormalizationForm.FormC));
    }

thanks for all your help.
0
 
R7AFCommented:
Sorry, couldn't reply earlier. We use a (self created) function that translates é into e, etc. For each character that we know, we do this. This poses some problems though. For instance, in German, the character ä should be translated to ae, while in Dutch, it could be translated to a or -a as well. So depending on the language and context, translations can differ.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.