Link to home
Start Free TrialLog in
Avatar of bschwarting
bschwarting

asked on

'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I'm getting this error:
Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Using this code:
json_obj = urllib.request.urlopen(url).read() 

response = urllib.request.urlopen(url).read()

json_obj = str(response, 'utf-8')

data = json.loads(json_obj)

Open in new window

Avatar of gelonida
gelonida
Flag of France image

it seems the response is not encoded with utf-8

you might try:

json_obj = str(response, 'cp1252') 

Open in new window


this is probably the second popular encoding.

if the http response is 'clean', then the http headers of the response should tell you which encoding was used for the response and instead of guessing you can use this encoding.
Avatar of bschwarting
bschwarting

ASKER

This is the response I get now.

Exception has occurred: UnicodeDecodeError
'charmap' codec can't decode byte 0x9d in position 246: character maps to <undefined>
Perhaps I am missing something but it appears that json_obj is loaded and then almost immediately overwritten.

json_obj = urllib.request.urlopen(url).read()
response = urllib.request.urlopen(url).read()
json_obj = str(response, 'utf-8')
I changed it to this, just to make sure, and the same error:

json_obj = urllib.request.urlopen(url).read()
response = urllib.request.urlopen(url).read()
json_obj2 = str(response, 'utf-8')
data = json.loads(json_obj2)
What's the proper syntax I should use to send the encoding on the open? I found an example below:

with open('unicode.txt', encoding='utf-8') as f:
    for line in f:
        print(repr(line))

Open in new window

Any thoughts?
Try the following code for your URL to learn what is actually read.
import binascii
import urllib.request

url = 'http://python.org/'
response = urllib.request.urlopen(url)  # it returns HTTPResponse object open for reading
buf = response.read(50)                 # read 50 bytes of the response
print(binascii.hexlify(buf))
print(repr(buf))

Open in new window


This one prints
d:\__Python\ee29119512>py a.py
b'3c21646f63747970652068746d6c3e0a3c212d2d5b6966206c7420494520375d3e2020203c68746d6c20636c6173733d226e'
b'<!doctype html>\n<!--[if lt IE 7]>   <html class="n'

Open in new window

Here is the result.  What is this?

b'1f8b0800000000000000ed9c7f6fdb389ac7ff9f57c10db0e9dd229445fd568b62e03869e39bb4c9d5c9748bbb43415194ad'
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed\x9c\x7fo\xdb8\x9a\xc7\xff\x9fW\xc1\r\xb0\xe9\xdd"\x94E\xfdV\x8bb\xe08i\xe3\x9b\xb4\xc9\xd5\xc9t\x8b\xbbCAQ\x94\xad'
ASKER CERTIFIED SOLUTION
Avatar of pepr
pepr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial