asked on

UTF-16 in XML

So i'm trying to parse some xml with ElementTree, but it's got smileys in what seems to be UTF-16 decimal.
it's got this `&#55357;&#56835;` in it but says it's UTF-8 in the <?xml?> tag.

How do I decode UTF-16? Is that the right question to ask?

pepr

The UTF-8 information is probably correct. The &#55357; and the like are really the 8 characters called character references. In other words, the target character in UNICODE has that number that you can see, but what you see is the textual representation of that number -- as a special kind of an escape sequence.

So, after you extract that string from your XML, you need to unescape it. Since Python 3.4, you can use the unescape function from the html module -- see https://docs.python.org/3/library/html.html#html.unescape

Another problem is that your environment may not be able to display those characters (if your font does not have glyphs for the UNICODE characters). Anyway, you can write them to a text file and observe them later, like this:

import html

s = '&#62;'

with open('out.txt', 'w', encoding='utf-8') as f:
    f.write(html.unescape(s))

Open in new window

Anyway, you should check for what the 55357 in UNICODE means. (I did not check.) They need not to be printable characters at all. They may be special "escape codes" for what follows.

Trysten

ASKER

Yes, it's a surrogate code for encoding UTF-16, apparently. Un-escaping seemed to try to use UTF-8 encoding, but didn't work. This site claims it's UTF-16 Dec encoding: http://www.iemoji.com/view/emoji/3/smileys-people/smiling-face-with-open-mouth

SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

Trysten

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Trysten

ASKER

I eventually figured it out, and posted some example code for future experts.