Trysten
asked on
UTF-16 in XML
So i'm trying to parse some xml with ElementTree, but it's got smileys in what seems to be UTF-16 decimal.
it's got this `��` in it but says it's UTF-8 in the <?xml?> tag.
How do I decode UTF-16? Is that the right question to ask?
it's got this `��` in it but says it's UTF-8 in the <?xml?> tag.
How do I decode UTF-16? Is that the right question to ask?
ASKER
Yes, it's a surrogate code for encoding UTF-16, apparently. Un-escaping seemed to try to use UTF-8 encoding, but didn't work. This site claims it's UTF-16 Dec encoding: http://www.iemoji.com/view/emoji/3/smileys-people/smiling-face-with-open-mouth
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
I eventually figured it out, and posted some example code for future experts.
So, after you extract that string from your XML, you need to unescape it. Since Python 3.4, you can use the unescape function from the html module -- see https://docs.python.org/3/library/html.html#html.unescape
Another problem is that your environment may not be able to display those characters (if your font does not have glyphs for the UNICODE characters). Anyway, you can write them to a text file and observe them later, like this:
Open in new window
Anyway, you should check for what the 55357 in UNICODE means. (I did not check.) They need not to be printable characters at all. They may be special "escape codes" for what follows.