Avatar of SithHax
SithHax asked on

Help with strings

Hi,

I have an XML file with some strings.
The strings are mixture of normal alphabets with chinese character and eastern european char sets (UTF-8).

After I extract the strings I want to get the length.
I'm expecting  'í' to return length of 1 instead of 2.
Can somebody help.
Python

Avatar of undefined
Last Comment
SithHax

8/22/2022 - Mon
pepr

You have to convert a "plain old string" to unicode string. Basically, the plain old string is a sequence of bytes and UTF-8 is a way how to encode a unicode string into a sequence of bytes. On the other hand, unicode string is somehow more specific. There is no encoding (almost). The values of the characters are stored all the same way.
ASKER CERTIFIED SOLUTION
pepr

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
See how we're fighting big data
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
ASKER
SithHax

I'm using minidom module. When I read the XML, it'll save it in unicode (i.e. u'\x??\x??).
When I use the len(), it'll give me the number of bytes stored instead of what I'm expecting which is the displayed character.
I'm kinda confused on how to get around this problem. The string in the XML would include 20+ languages, chinese, japanese to russian.

I'll try the links u provided and see if it'll help.
ASKER
SithHax

I found my mistake. I set the XML encoding wrongly. I set as ascii instead of utf-8. Anyway thanks for the reply.
Experts Exchange is like having an extremely knowledgeable team sitting and waiting for your call. Couldn't do my job half as well as I do without it!
James Murphy