• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 366
  • Last Modified:

Encoding issues in Python

Ok so I have a string called $title that includes characters of titles of movies in all languages (English, Russian, Japanese) etc etc...

And my script is erroring out like crazy always being unable to encode and save to MySQL database the title.

So help me out, how do I encode (unicode or something) so it works for all languages and character sets. Right now my code is:
  title = result.group(1).strip().replace("'", "")[0:40]+'...'
            title = unicode(title, "utf-8")

Open in new window

0
GVNPublic123
Asked:
GVNPublic123
  • 3
1 Solution
 
GVNPublic123Author Commented:
Sample errors:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 246-254: ordinal not in range(256)


UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data')


0
 
GVNPublic123Author Commented:
Also my table collation is utf8-general-ci
0
 
GVNPublic123Author Commented:
Looks like mysql python handled I used captured collation from database, not table, so changing database collation to utf-8 fixed all latin-1 errors. Now Im stuck with utf8 ones like:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data') 

Open in new window


How should I sanitize strings to only allow utf-8 encodable characters?
0
 
peprCommented:
Is the result.group(1) a product of some regular expression?  If you get a string from MySQL to a Python variable s, can you try to print type(s)?  Is it really unicode string or is it the stream of bytes in utf-8 encoding?

It seems to me that result.group(1) was obtained via a regular expression.  How the regular expression looks like?  Is the pattern written as unicode string? If yes and if the original string is the unicode one, then also the 'result' should be unicode string and the pattern in .replace() should be unicode.

If you have utf-8 encoded string, then it is actually a stream of bytes.  The encoding is not captured in the string object (in Python 2.x).  The utf-8 uses sequences of variable length for a single character.  Then, when you slice like [:40], it is likely that you once cut the sequence for one char in the middle and it will cause error when trying to convert to unicode string.  You have to convert to unicode first and to slice after.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now