Link to home
Start Free TrialLog in
Avatar of GVNPublic123
GVNPublic123

asked on

Encoding issues in Python

Ok so I have a string called $title that includes characters of titles of movies in all languages (English, Russian, Japanese) etc etc...

And my script is erroring out like crazy always being unable to encode and save to MySQL database the title.

So help me out, how do I encode (unicode or something) so it works for all languages and character sets. Right now my code is:
  title = result.group(1).strip().replace("'", "")[0:40]+'...'
            title = unicode(title, "utf-8")

Open in new window

Avatar of GVNPublic123
GVNPublic123

ASKER

Sample errors:
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 246-254: ordinal not in range(256)


UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data')


Also my table collation is utf8-general-ci
Looks like mysql python handled I used captured collation from database, not table, so changing database collation to utf-8 fixed all latin-1 errors. Now Im stuck with utf8 ones like:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 39-40: invalid data
      args = ('utf8', '\xd0\x92\xd0\xbb\xd0\xb0\xd0\xb4\xd0\xb8\xd0\xbc\xd0\xb8\xd1\x80 \xd0\x92\xd1\x8b\xd1\x81\xd0\xbe\xd1\x86\xd0\xba\xd0\xb8\xd0\xb9 \xd0\xb2 \xd1\x81\xd0...', 39, 41, 'invalid data') 

Open in new window


How should I sanitize strings to only allow utf-8 encodable characters?
ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial