encoding a query string

maltomeal8
maltomeal8 used Ask the Experts™
on
I looked at http://www.experts-exchange.com/Web/Web_Languages/JavaScript/Q_20335762.html when I was trying to find a way to encode a query string with #'s and +'s in it.  

I used the hexcode function posted there by dirge, and it worked great until I tried to get it to work with Korean characters.  Some Korean characters use 2 bytes, and the hexcode function only converts to 1 byte in hex.  I tried, just to see, changing the hexcode function to:

function hexnib(d) {
  if(d<10) return d; else return String.fromCharCode(65+d-10);
}

function hexcode(url) {
     var result="";
     for(var i=0;i<url.length;i++) {
        var cc=url.charCodeAt(i);
        var hex= "00" + hexnib((cc&240)>>4)+""+hexnib(cc&15);
        result+="%"+hex;
     }
     return result;
}

The only change I made was I added the "00" + in the line: var hex= "00" + hexnib((cc&240)>>4)+""+hexnib(cc&15);  to make it 2 bytes.

I used this to see if it would work for English characters (all of which would have zeros for the first two digits in a 4-digit hex number), but it didn't work.  When it gets to the server, it is not decoded correctly.  It gets converted on the server to empty string (presumably it was only seeing the 0's?)  Does this mean query strings cannot be encoded to the form %A492%B61A%AE53 etc. ?

If not, then how can Korean characters be passed in a query string?

thanks for the help!
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
I had the same problem with double byte chars and encoding html entities for the querystring... I was not able to find a solution and ended up replacing the '#' char with something before adding it to the qs, then doing another replace on the receiving end... not an elegent solution by any means but it worked for me... :p

Commented:
Have you tried using the escape() method ?


Author

Commented:
The fact that escape() does not handle + correctly was why I used HexCode in the first place

I just noticed something interesting.  On Google, they seem to take what the user types in and put it into a query string.  So, I tried searching for the word français and I noticed it puts this string in the address bar:

http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=fran%C3%A7ais&btnG=Google+Search

it looks like the ç was converted to %C3%A7 but how is that possible?  When I use javascript's charCodeAt function on ç, it gives me 231, which is %00%E7 in hex.
Also, they are passing ie=UTF-8 which looks like a flag to say to decode unicode characters?
Become a CompTIA Certified Healthcare IT Tech

This course will help prep you to earn the CompTIA Healthcare IT Technician certification showing that you have the knowledge and skills needed to succeed in installing, managing, and troubleshooting IT systems in medical and clinical settings.

Author

Commented:
I think I have answered my own question (so I guess I'll keep my points).  Apparently a query string can only handle single byte characters.

I found on http://www.w3.org/TR/html4/interact/forms.html#h-17.13.1

that:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.
Commented:
The following is an update on my old script. It works fine with ??? for instance, when compared to what Google generates.

You may want to check out http://www1.tip.nl/~t876506/utf8tbl.html and http://selfaktuell.teamone.de/artikel/javascript/utf8b64/utf8.htm (German)

<html>
<head>
<script language="javascript">
<!--

function hexnib(d) {
   if(d<10) return d; else return String.fromCharCode(65+d-10);
}

function hexbyte(d) {
        return "%"+hexnib((d&240)>>4)+""+hexnib(d&15);
}

function hexcode(url) {
     var result="";
    var hex="";
     for(var i=0;i<url.length; i++) {
             var cc=url.charCodeAt(i);
             if (cc<128) {
                 result+=hexbyte(cc);
             } else if((cc>127) && (cc<2048)) {
                result+=  hexbyte((cc>>6)|192)
                        + hexbyte((cc&63)|128);
             } else {
                result+=  hexbyte((cc>>12)|224)
                        + hexbyte(((cc>>6)&63)|128)
                        + hexbyte((cc&63)|128);
             }
     }
    return result;
}

function encoder() {
   document.forms.test.r.value=hexcode(document.forms.test.s.value);
}

// -->
</script>
</head>
<body>
   <form name="test">
      URL (without http://) <input type="text" name="s"><br>
      Result: <input type="text" name="r"><br>
      <input type="button" value="Encode" onClick="encoder()">
      <input type="reset" value="Clear">
   </form>
</body>
</html>

Commented:
That's 'fine with "Korea" (in Korean)' -- not sure if you see it in your browser, but I don't. I just copied the characters from http://kr.yahoo.com/ 

Commented:
And..... ;-D it's not Google which generates the codes -- it's the browser, once you press Submit.

'Nuff said. Good luck.

Author

Commented:
Thank you dirge!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Commented:
(un)escape is NOT the same as url-encode/decode in IE and opera;

url-encode/decode = characters are translated in 1 to 4 "%xx" strings, which represent the unicode bytes:
the algorithm of url-encoding works like this:

                byte[] bytes = the_char.getBytes("UTF-8");
                for (int j = 0; j < bytes.length; j++)
                {
                    buffer.append("%");
                    String hex = Integer.toHexString(255 & bytes[j]);
                    buffer.append("00".substring(hex.length()));
                    buffer.append(hex);
                }

In javascript, i don't know how to do this (i don't know how to find the unicode index for a char in javascript), but for sure, the browser does it when you submit a form that contains "international" input (like chinese). Thats what happens when you look for the euro sign in google.

Netscape's (un)escape IS url-encode/decode; while IE and opera's (un)escape is NOT: in those browsers, escape translates "simple accented chars" to on single "%xx" expression, probably by using a table, because there is no relation between the hex code and the unicode value for the char. For more complex characters, the escape returns a "%uxxxx" where xxxx = the hex unicode for the character.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial