asked on

Encoding issues in Python

Ok so I have a string called $title that includes characters of titles of movies in all languages (English, Russian, Japanese) etc etc...

And my script is erroring out like crazy always being unable to encode and save to MySQL database the title.

So help me out, how do I encode (unicode or something) so it works for all languages and character sets. Right now my code is:

  title = result.group(1).strip().replace("&#39;", "")[0:40]+'...'
            title = unicode(title, "utf-8")

Open in new window

GVNPublic123

ASKER

Sample errors:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 246-254: ordinal not in range(256)

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data')

GVNPublic123

ASKER

Also my table collation is utf8-general-ci

GVNPublic123

ASKER

Looks like mysql python handled I used captured collation from database, not table, so changing database collation to utf-8 fixed all latin-1 errors. Now Im stuck with utf8 ones like:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data')

Open in new window

How should I sanitize strings to only allow utf-8 encodable characters?

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial