Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

URL with Unicode characters

Posted on 2011-04-25
13
Medium Priority
?
1,308 Views
Last Modified: 2013-11-19
Hi,
I’ve “Asp.net” website running on Windows 2008 server.
Since I’m using URL rewriting sometimes my URL’s contains Unicode characters.
Since URL’s should only be in ascii I’m encoding the URL and when I need to read data
From the URL I’m decoding it.
I’m using “HttpUtility.UrlEncode(strURL)” while building the URL string
And then “HttpUtility.UrlDecode(strURL)” to convert and read the data.
This works on 95% of the time, but not always.
When I’m trying from my 2 machines with different browsers it works, but my log file shows that
For other visitors it doesn’t always work.

I’ve tried to add Encoding definition to the function like “HttpUtility.UrlEncode(urlHome, Encoding.GetEncoding("ISO-8859-1"))” but still it doesn’t work every time.
My main problem is with Polish characters and Scandinavic ones like:
Poznan
Brøndby

How can I solve this issue, so it would work 100% of the times.

Thanks,
Assaf.
0
Comment
Question by:AssafST
  • 5
  • 3
  • 2
  • +1
11 Comments
 
LVL 13

Expert Comment

by:R7AF
ID: 35470902
I solved this in a different way. I have urls that use all kind of characters, Polish, Turkish, even Chinese. I translate all characters to ISO-8859-1, if I know them. I haven't seen any problem with it, but I should take a better look probably. So I translate a ç to c, é to e, etc. For Chinese I use English translations. This way I keep the url as clean as possible, readable by anyone, and probably understandable. I think this is best for Google and humans. I should mention that I program in PHP, and we use Apache as webserver, so that might be a big difference. Characters like spaces and quotes are translated to dashes.

I use the url as parameter, so if you take a name as Brøndby, we translate it to brondby (lowercase) to create the url in the first place. When we read the parameter, we translate the original name to the url-version, and compare that to the parameter.

If you know that the parameter can be different values, so brondby and br%4Endby (just made up something), and you know how it gets to that different spelling, you can check for that too. So I wonder what strange parameters you see, and if you see a consistency there, one that you can use.
0
 

Author Comment

by:AssafST
ID: 35471041
Hi R7AF,
thanks for your answer.
did you mean you manually translated all the words with the unicode chars? or have you
done something programmatically.

Thanks,
Assaf.
0
 
LVL 22

Expert Comment

by:danaseaman
ID: 35471250
Have you tried converting strURL to UTF8 before calling UrlEncode?
Likewise convert Utf8 to Utf16 before calling UrlDecode.\
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 51

Expert Comment

by:Steve Bink
ID: 35472602
>>> So I translate a ç to c, é to e, etc.

AFAIK, this is the only realistic solution.  Unicode, UTF, et al, do not translate directly to ISO-8859-1 because they allow for many more characters than a purely Latin character set.  Really, there is no way to encode a URL with unicode or UTF8 characters into the ISO set.  You will always be left with untranslatable characters, artifacts, or embedded Uni/UTF notation (such as &#FE00, or whatever).
0
 

Author Comment

by:AssafST
ID: 35473620
Thanks for your answers.
it looks like the problem lies on the client machine settings, if the client (browser) supports
Danish chars then the convertion works otherwise it doesn't. the same for other char sets.

i agree that the realistic solution would be to convert all special chars to regular chars like you both said "So I translate a ç to c, é to e, etc". but in my DB i've 15k words that grows all the time. manually going through every word and create a new entry without the special chars is huge headace.
isn't there a method to convert programically ç to c etc...
thanks!!!
0
 

Author Comment

by:AssafST
ID: 35473665
on .Net i found a way to eliminate the special chars, the problem it just remove the char and not convert it.
so Brøndby will be converted to Brndby
which is not good enough.

Encoding encoder = ASCIIEncoding.GetEncoding("us-ascii", new EncoderReplacementFallback(string.Empty), new DecoderExceptionFallback());

byte[] bAsciiString = encoder.GetBytes(str);

string cleanString =  ASCIIEncoding.ASCII.GetString(bAsciiString);
0
 
LVL 22

Accepted Solution

by:
danaseaman earned 2000 total points
ID: 35473743
API FoldString will strip diacritics and replace them with base letter.
See attached Screenshot.
Source code attached in Zip file.

 Strip Diacritics DemoStripDiacritics.zip
0
 

Author Comment

by:AssafST
ID: 35474022
thanks danaseaman.
this makes me close to a solution.
but i don't know how to use this API.
i've looked on the web for something in .Net, haven't found a full solution yet.
as i understand there's no way to avoide a DB of unicode chars and the char i want to convert to.
 
0
 
LVL 22

Expert Comment

by:danaseaman
ID: 35474044
You should be able to wrap this API into a DLL and call it from Asp.Net using Server.CreateObject(myDLL.myClass)

See http://forums.devx.com/showthread.php?t=8257
0
 

Author Comment

by:AssafST
ID: 35474946
i finally found a solution in .Net
i doesn't cover 100% of the cases but 99% is enough.
http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx

    static string RemoveDiacritics(string stIn)
    {
        string stFormD = stIn.Normalize(NormalizationForm.FormD);
        StringBuilder sb = new StringBuilder();

        for (int ich = 0; ich < stFormD.Length; ich++)
        {
            UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
            if (uc != UnicodeCategory.NonSpacingMark)
            {
                sb.Append(stFormD[ich]);
            }
        }

        return (sb.ToString().Normalize(NormalizationForm.FormC));
    }

thanks for all your help.
0
 
LVL 13

Expert Comment

by:R7AF
ID: 35486509
Sorry, couldn't reply earlier. We use a (self created) function that translates é into e, etc. For each character that we know, we do this. This poses some problems though. For instance, in German, the character ä should be translated to ae, while in Dutch, it could be translated to a or -a as well. So depending on the language and context, translations can differ.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

When it comes to write a Context Sensitive Help (an online help that is obtained from a specific point in state of software to provide help with that state) ,  first we need to make the file that contains all topics, which are given exclusive IDs. …
Australian government abolished Visa 457 earlier this April and this article describes how this decision might affect Australian IT scene and IT experts.
Use Wufoo, an online form creation tool, to make powerful forms. Learn how to selectively show certain fields based on user input using rules to gather relevant information and data from your forms. The rules feature provides you with an opportunity…
The is a quite short video tutorial. In this video, I'm going to show you how to create self-host WordPress blog with free hosting service.
Suggested Courses
Course of the Month21 days, 6 hours left to enroll

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question