• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 7410
  • Last Modified:

encoding a query string

I looked at http://www.experts-exchange.com/Web/Web_Languages/JavaScript/Q_20335762.html when I was trying to find a way to encode a query string with #'s and +'s in it.  

I used the hexcode function posted there by dirge, and it worked great until I tried to get it to work with Korean characters.  Some Korean characters use 2 bytes, and the hexcode function only converts to 1 byte in hex.  I tried, just to see, changing the hexcode function to:

function hexnib(d) {
  if(d<10) return d; else return String.fromCharCode(65+d-10);
}

function hexcode(url) {
     var result="";
     for(var i=0;i<url.length;i++) {
        var cc=url.charCodeAt(i);
        var hex= "00" + hexnib((cc&240)>>4)+""+hexnib(cc&15);
        result+="%"+hex;
     }
     return result;
}

The only change I made was I added the "00" + in the line: var hex= "00" + hexnib((cc&240)>>4)+""+hexnib(cc&15);  to make it 2 bytes.

I used this to see if it would work for English characters (all of which would have zeros for the first two digits in a 4-digit hex number), but it didn't work.  When it gets to the server, it is not decoded correctly.  It gets converted on the server to empty string (presumably it was only seeing the 0's?)  Does this mean query strings cannot be encoded to the form %A492%B61A%AE53 etc. ?

If not, then how can Korean characters be passed in a query string?

thanks for the help!
0
maltomeal8
Asked:
maltomeal8
1 Solution
 
SquareHeadCommented:
I had the same problem with double byte chars and encoding html entities for the querystring... I was not able to find a solution and ended up replacing the '#' char with something before adding it to the qs, then doing another replace on the receiving end... not an elegent solution by any means but it worked for me... :p
0
 
avnerCommented:
Have you tried using the escape() method ?


0
 
maltomeal8Author Commented:
The fact that escape() does not handle + correctly was why I used HexCode in the first place

I just noticed something interesting.  On Google, they seem to take what the user types in and put it into a query string.  So, I tried searching for the word français and I noticed it puts this string in the address bar:

http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=fran%C3%A7ais&btnG=Google+Search

it looks like the ç was converted to %C3%A7 but how is that possible?  When I use javascript's charCodeAt function on ç, it gives me 231, which is %00%E7 in hex.
Also, they are passing ie=UTF-8 which looks like a flag to say to decode unicode characters?
0
Never miss a deadline with monday.com

The revolutionary project management tool is here!   Plan visually with a single glance and make sure your projects get done.

 
maltomeal8Author Commented:
I think I have answered my own question (so I guess I'll keep my points).  Apparently a query string can only handle single byte characters.

I found on http://www.w3.org/TR/html4/interact/forms.html#h-17.13.1

that:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.
0
 
dirgeCommented:
The following is an update on my old script. It works fine with ??? for instance, when compared to what Google generates.

You may want to check out http://www1.tip.nl/~t876506/utf8tbl.html and http://selfaktuell.teamone.de/artikel/javascript/utf8b64/utf8.htm (German)

<html>
<head>
<script language="javascript">
<!--

function hexnib(d) {
   if(d<10) return d; else return String.fromCharCode(65+d-10);
}

function hexbyte(d) {
        return "%"+hexnib((d&240)>>4)+""+hexnib(d&15);
}

function hexcode(url) {
     var result="";
    var hex="";
     for(var i=0;i<url.length; i++) {
             var cc=url.charCodeAt(i);
             if (cc<128) {
                 result+=hexbyte(cc);
             } else if((cc>127) && (cc<2048)) {
                result+=  hexbyte((cc>>6)|192)
                        + hexbyte((cc&63)|128);
             } else {
                result+=  hexbyte((cc>>12)|224)
                        + hexbyte(((cc>>6)&63)|128)
                        + hexbyte((cc&63)|128);
             }
     }
    return result;
}

function encoder() {
   document.forms.test.r.value=hexcode(document.forms.test.s.value);
}

// -->
</script>
</head>
<body>
   <form name="test">
      URL (without http://) <input type="text" name="s"><br>
      Result: <input type="text" name="r"><br>
      <input type="button" value="Encode" onClick="encoder()">
      <input type="reset" value="Clear">
   </form>
</body>
</html>

0
 
dirgeCommented:
That's 'fine with "Korea" (in Korean)' -- not sure if you see it in your browser, but I don't. I just copied the characters from http://kr.yahoo.com/ 
0
 
dirgeCommented:
And..... ;-D it's not Google which generates the codes -- it's the browser, once you press Submit.

'Nuff said. Good luck.

0
 
maltomeal8Author Commented:
Thank you dirge!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
0
 
justkeysCommented:
(un)escape is NOT the same as url-encode/decode in IE and opera;

url-encode/decode = characters are translated in 1 to 4 "%xx" strings, which represent the unicode bytes:
the algorithm of url-encoding works like this:

                byte[] bytes = the_char.getBytes("UTF-8");
                for (int j = 0; j < bytes.length; j++)
                {
                    buffer.append("%");
                    String hex = Integer.toHexString(255 & bytes[j]);
                    buffer.append("00".substring(hex.length()));
                    buffer.append(hex);
                }

In javascript, i don't know how to do this (i don't know how to find the unicode index for a char in javascript), but for sure, the browser does it when you submit a form that contains "international" input (like chinese). Thats what happens when you look for the euro sign in google.

Netscape's (un)escape IS url-encode/decode; while IE and opera's (un)escape is NOT: in those browsers, escape translates "simple accented chars" to on single "%xx" expression, probably by using a table, because there is no relation between the hex code and the unicode value for the char. For more complex characters, the escape returns a "%uxxxx" where xxxx = the hex unicode for the character.
0

Featured Post

The new generation of project management tools

With monday.com’s project management tool, you can see what everyone on your team is working in a single glance. Its intuitive dashboards are customizable, so you can create systems that work for you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now