Link to home
Start Free TrialLog in
Avatar of Trysten
TrystenFlag for United States of America

asked on

UTF-16 in XML

So i'm trying to parse some xml with ElementTree, but it's got smileys in what seems to be UTF-16 decimal.
it's got this `&#55357;&#56835;` in it but says it's UTF-8 in the <?xml?> tag.


How do I decode UTF-16? Is that the right question to ask?
Avatar of pepr
pepr

The UTF-8 information is probably correct.  The &#55357; and the like are really the 8 characters called character references. In other words, the target character in UNICODE has that number that you can see, but what you see is the textual representation of that number -- as a special kind of an escape sequence.

So, after you extract that string from your XML, you need to unescape it. Since Python 3.4, you can use the unescape function from the html module -- see https://docs.python.org/3/library/html.html#html.unescape

Another problem is that your environment may not be able to display those characters (if your font does not have glyphs for the UNICODE characters). Anyway, you can write them to a text file and observe them later, like this:
import html

s = '&#62;'

with open('out.txt', 'w', encoding='utf-8') as f:
    f.write(html.unescape(s))

Open in new window


Anyway, you should check for what the 55357 in UNICODE means. (I did not check.) They need not to be printable characters at all. They may be special "escape codes" for what follows.
Avatar of Trysten

ASKER

Yes, it's a surrogate code for encoding UTF-16, apparently. Un-escaping seemed to try to use UTF-8 encoding, but didn't work. This site claims it's UTF-16 Dec encoding: http://www.iemoji.com/view/emoji/3/smileys-people/smiling-face-with-open-mouth
SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Trysten

ASKER

I eventually figured it out, and posted some example code for future experts.