Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Want to know more about urlencode and urldecode

Posted on 2004-11-01
9
Medium Priority
?
5,638 Views
Last Modified: 2013-12-13
I know that urlencode function can convert some special characters and also Chinese characters into %xxx form.

However, what I'd like to know more about is the encoding mechanism of this function.

For example, how is it known that "%2C" stands for the "," character? And for Chinese characters, it would become even more complicated! For instance, the encoded word for the Chinese character "我" is "%A7%DA". How could this be done?!

Also, is there any mapping table for the convertion that I could refer to?

Many Thanks!!!
0
Comment
Question by:hellohelloworld
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 48

Accepted Solution

by:
hernst42 earned 500 total points
ID: 12465581

Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value).
See
http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.2
0
 

Author Comment

by:hellohelloworld
ID: 12466184
Is there a way that I could find the mapping table in my PC?
0
 
LVL 48

Expert Comment

by:hernst42
ID: 12466393
the mapping is easy to build

$table = array();
for ($i = 0; $i<= 255; ++$i) {
   $table[chr($i)] = url_encode(chr($i));
}

As said the character is just encoded as hex to get the value of a character use ord
see http://de3.php.net/manual/en/function.ord.php
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 

Author Comment

by:hellohelloworld
ID: 12469630
yes, but what about the Chinese characters?
Is there anything I could trace so that if I see "%xyz%A23", I know that represents a certain Chinese character without using urldecode?
0
 
LVL 48

Expert Comment

by:hernst42
ID: 12516662
No, you can't guess it by the format of the %xx%yy. As chines characters are stored als multi-byte-characters you might need a very long list (all chinese characters) to get those things known if %xx%yy is a chinese character.
0
 
LVL 1

Expert Comment

by:hallvors
ID: 12637972
What you want is possible but it is considerably more work that it is practical to put in. Just say decode and let PHP do the calculations :)

Anyway, thanks for an interesting question. Researching it taught me about both how UTF-8 works and about URL encoding in general.

First, link to an explanation of URL encoding:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
(disclosure: it's written by someone I know ;)

Secondly, here is how to find the character from a URL encoding - manually!

Your character above - "&#25105;" (according to babelfish.altavista.com it means "I" in Chinese, if you can't see it in your browser try to copy this and paste in your address bar: javascript:'<html>&#25105;</html>' ) is actually encoded as %E6%88%91.

First tool we use is the Windows calculator: open it and change to Scientific mode in the View menu. Then choose "Hex" format and type the hex value from above (simply strip out the % -signs): e68891.

Now click the "Bin" option to get the binary value of this hexadecimal number. Copy it and paste it in Notepad.

111001101000100010010001

This is the binary, UTF-8 encoded string. We want to un-UTF-8 it to find the Unicode value. Here is a technical documentation for UTF-8:
ftp://ftp.isi.edu/in-notes/rfc2279.txt

First, start at the end of the string add linebreaks for each 8 digits.

11100110
10001000
10010001

From the first line, remove all the initial 1 - digits. From each of the next lines, remove the inital "10" - it will now look like this:

00110
001000
010001

Remove the line breaks and put it all on one line again:

00110001000010001

Copy that whole string and go back to the calculator. It should still be on "Binary" format, so just paste this new string.

If you now click "Dec" (for decimal or "normal" format), this is the exact number given in your first post because your browser translated a character not supported in the POST encoding to a HTML entity - 25105.

Next, click "Hex". The calculator will say "6211". Now open the Windows "character map" utility. Activate "Advanced view" if it doesn't show the "Go to Unicode" box. Then, in the "Go to Unicode" box type 6211. Voila, it shows the character you are looking for.

I'm sure you agree it is simpler to just type <? urldecode('%E6%88%91') ?> :-)
0
 
LVL 1

Expert Comment

by:hallvord
ID: 12828899
I spent a long time on that reply though :(
and posted it with my wrong and now deleted profile :((
Oh well. It made me wiser and I also posted the mini-tutorial on my website..
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Introduction This article is intended for those who are new to PHP error handling (https://www.experts-exchange.com/articles/11769/And-by-the-way-I-am-New-to-PHP.html).  It addresses one of the most common problems that plague beginning PHP develop…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Suggested Courses

609 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question