Solved

Python - Internet access

Posted on 2014-09-30
2
258 Views
Last Modified: 2014-09-30
I am adapting the following code from https://docs.python.org/3/tutorial/ section 10.7 Internet access:

from urllib.request import urlopen:
    for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586'):
    for line in urlopen('http://plantilus.com/plantdb/RlczLepr/index.html'):
        line = line.decode('utf-8')
        if 'Genus' in line or 'Seed' in line:
        print(line)

Got expected results without any error.

When I changed to a different URL http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586

I got the UnicideDecodeError (invalid start byte)

Note: Similar code runs successfully in Perl.    I hope to be able to do this in Python.

Hope someone here could explain what the problem is.
pax
0
Comment
Question by:cpeters5
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 40353854
I can't get to ID=171586 but checking another page that I can get to (ID=114368) and the problem appears to be that even though the page claims (based on it's meta charset tag) to be UTF-8 encoded, the Google Tag manager comment contains a non utf-8 character:

  <!-- Google Tag Manager ? Carat -->

from urllib.request import urlopen

# for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586'):
# for line in urlopen('http://plantilus.com/plantdb/RlczLepr/index.html'):

for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=114368'):
     try:
         line = line.decode('utf-8')
     except:
         line = line.decode('utf-8', 'replace')  # you also have the option to 'ignore' / skip the character
         print("Non UTF8 found in: {0}".format(line))

     if 'Genus' in line or 'Seed' in line:
         print(line)

Open in new window


In other words, change your decode to either:
line=line.decode('utf-8', 'ignore')

Open in new window

or
line = line.decode('utf-8', 'replace')

Open in new window


To handle pages that aren't truly utf-8 (like your orchid detail page).
0
 

Author Closing Comment

by:cpeters5
ID: 40353860
Clockwatcher, Thank you!
pax
0

Featured Post

Secure Your Active Directory - April 20, 2017

Active Directory plays a critical role in your company’s IT infrastructure and keeping it secure in today’s hacker-infested world is a must.
Microsoft published 300+ pages of guidance, but who has the time, money, and resources to implement? Register now to find an easier way.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
When we want to run, execute or repeat a statement multiple times, a loop is necessary. This article covers the two types of loops in Python: the while loop and the for loop.
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question