Link to home
Start Free TrialLog in
Avatar of maltomeal8
maltomeal8

asked on

encoding a query string

I looked at https://www.experts-exchange.com/questions/20335762/URL-encoding.html when I was trying to find a way to encode a query string with #'s and +'s in it.  

I used the hexcode function posted there by dirge, and it worked great until I tried to get it to work with Korean characters.  Some Korean characters use 2 bytes, and the hexcode function only converts to 1 byte in hex.  I tried, just to see, changing the hexcode function to:

function hexnib(d) {
  if(d<10) return d; else return String.fromCharCode(65+d-10);
}

function hexcode(url) {
     var result="";
     for(var i=0;i<url.length;i++) {
        var cc=url.charCodeAt(i);
        var hex= "00" + hexnib((cc&240)>>4)+""+hexnib(cc&15);
        result+="%"+hex;
     }
     return result;
}

The only change I made was I added the "00" + in the line: var hex= "00" + hexnib((cc&240)>>4)+""+hexnib(cc&15);  to make it 2 bytes.

I used this to see if it would work for English characters (all of which would have zeros for the first two digits in a 4-digit hex number), but it didn't work.  When it gets to the server, it is not decoded correctly.  It gets converted on the server to empty string (presumably it was only seeing the 0's?)  Does this mean query strings cannot be encoded to the form %A492%B61A%AE53 etc. ?

If not, then how can Korean characters be passed in a query string?

thanks for the help!
Avatar of SquareHead
SquareHead

I had the same problem with double byte chars and encoding html entities for the querystring... I was not able to find a solution and ended up replacing the '#' char with something before adding it to the qs, then doing another replace on the receiving end... not an elegent solution by any means but it worked for me... :p
Have you tried using the escape() method ?


Avatar of maltomeal8

ASKER

The fact that escape() does not handle + correctly was why I used HexCode in the first place

I just noticed something interesting.  On Google, they seem to take what the user types in and put it into a query string.  So, I tried searching for the word français and I noticed it puts this string in the address bar:

http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=fran%C3%A7ais&btnG=Google+Search

it looks like the ç was converted to %C3%A7 but how is that possible?  When I use javascript's charCodeAt function on ç, it gives me 231, which is %00%E7 in hex.
Also, they are passing ie=UTF-8 which looks like a flag to say to decode unicode characters?
I think I have answered my own question (so I guess I'll keep my points).  Apparently a query string can only handle single byte characters.

I found on http://www.w3.org/TR/html4/interact/forms.html#h-17.13.1

that:
Note. The "get" method restricts form data set values to ASCII characters. Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.
ASKER CERTIFIED SOLUTION
Avatar of dirge
dirge

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That's 'fine with "Korea" (in Korean)' -- not sure if you see it in your browser, but I don't. I just copied the characters from http://kr.yahoo.com/ 
And..... ;-D it's not Google which generates the codes -- it's the browser, once you press Submit.

'Nuff said. Good luck.

Thank you dirge!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
(un)escape is NOT the same as url-encode/decode in IE and opera;

url-encode/decode = characters are translated in 1 to 4 "%xx" strings, which represent the unicode bytes:
the algorithm of url-encoding works like this:

                byte[] bytes = the_char.getBytes("UTF-8");
                for (int j = 0; j < bytes.length; j++)
                {
                    buffer.append("%");
                    String hex = Integer.toHexString(255 & bytes[j]);
                    buffer.append("00".substring(hex.length()));
                    buffer.append(hex);
                }

In javascript, i don't know how to do this (i don't know how to find the unicode index for a char in javascript), but for sure, the browser does it when you submit a form that contains "international" input (like chinese). Thats what happens when you look for the euro sign in google.

Netscape's (un)escape IS url-encode/decode; while IE and opera's (un)escape is NOT: in those browsers, escape translates "simple accented chars" to on single "%xx" expression, probably by using a table, because there is no relation between the hex code and the unicode value for the char. For more complex characters, the escape returns a "%uxxxx" where xxxx = the hex unicode for the character.