Encoding issues in Python

Ok so I have a string called $title that includes characters of titles of movies in all languages (English, Russian, Japanese) etc etc...

And my script is erroring out like crazy always being unable to encode and save to MySQL database the title.

So help me out, how do I encode (unicode or something) so it works for all languages and character sets. Right now my code is:
  title = result.group(1).strip().replace("'", "")[0:40]+'...'
            title = unicode(title, "utf-8")

Open in new window

GVNPublic123Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

GVNPublic123Author Commented:
Sample errors:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 246-254: ordinal not in range(256)


UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data')


0
GVNPublic123Author Commented:
Also my table collation is utf8-general-ci
0
GVNPublic123Author Commented:
Looks like mysql python handled I used captured collation from database, not table, so changing database collation to utf-8 fixed all latin-1 errors. Now Im stuck with utf8 ones like:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data') 

Open in new window


How should I sanitize strings to only allow utf-8 encodable characters?
0
peprCommented:
Is the result.group(1) a product of some regular expression?  If you get a string from MySQL to a Python variable s, can you try to print type(s)?  Is it really unicode string or is it the stream of bytes in utf-8 encoding?

It seems to me that result.group(1) was obtained via a regular expression.  How the regular expression looks like?  Is the pattern written as unicode string? If yes and if the original string is the unicode one, then also the 'result' should be unicode string and the pattern in .replace() should be unicode.

If you have utf-8 encoded string, then it is actually a stream of bytes.  The encoding is not captured in the string object (in Python 2.x).  The utf-8 uses sequences of variable length for a single character.  Then, when you slice like [:40], it is likely that you once cut the sequence for one char in the middle and it will cause error when trying to convert to unicode string.  You have to convert to unicode first and to slice after.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.