emsttam
asked on
Unicode / UTF-8 Weirdness with JSP / Servlets on Tomcat
My code actually works, but I do not understand how it is working, and I need to in order to fix another part of the system.
I have a jsp page which contains a contenteditable div which is used (IE only) as a simple html editor. The charset in the page is utf-8. For the purposes of this example I copy-and-paste a Greek beta character into the div, and nothing else, and submit the contents to the server. I then intercept the passed parameter on the server thus :
String html = request.getParameter("doch tml");
Here is what I don't understand. The beta character is not represented as I would expect as u03B2, but instead as two characters u00CE and u00B2. These are 'the Latin capital letter I with circumflex' and 'Superscript 2' (http://www.unicode.org/charts/PDF/U0080.pdf and see the attached image).
So what is going on ? Why am I seeing these characters ?
To confuse matters even more, this code is working - if I save the html in a mysql db (using utf8), and then pull it out into the browser, sure enough there is the greek beta character again. How is this happening ?
Thanks in advance for any pointers.
debug.png
I have a jsp page which contains a contenteditable div which is used (IE only) as a simple html editor. The charset in the page is utf-8. For the purposes of this example I copy-and-paste a Greek beta character into the div, and nothing else, and submit the contents to the server. I then intercept the passed parameter on the server thus :
String html = request.getParameter("doch
Here is what I don't understand. The beta character is not represented as I would expect as u03B2, but instead as two characters u00CE and u00B2. These are 'the Latin capital letter I with circumflex' and 'Superscript 2' (http://www.unicode.org/charts/PDF/U0080.pdf and see the attached image).
So what is going on ? Why am I seeing these characters ?
To confuse matters even more, this code is working - if I save the html in a mysql db (using utf8), and then pull it out into the browser, sure enough there is the greek beta character again. How is this happening ?
Thanks in advance for any pointers.
debug.png
What is happening is the String html is the default os encoded string which I think is not UTF-8, most probably cp1252, which represents the UTF-8 (as they are double byte characters) as two separate single byte chars. When the page is loaded into the browser, and as you have set the encoding of the content to UTF-8, the browser does a decode and it is shown properly
ASKER
gibu george,
Where / when do you think this 'conversion' is happening? Java uses Unicode internally, so why any OS related conversion ?
Where / when do you think this 'conversion' is happening? Java uses Unicode internally, so why any OS related conversion ?
you mean the cp1252, it is the normal windows charset
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
Thanks, you got me pretty much there. I think rather than the default platform charset it may be using 8859-1 (see the second entry here http://www.jguru.com/faq/view.jsp?EID=137049), but that's splitting hairs.