Solved

Python - Internet access

Posted on 2014-09-30
2
256 Views
Last Modified: 2014-09-30
I am adapting the following code from https://docs.python.org/3/tutorial/ section 10.7 Internet access:

from urllib.request import urlopen:
    for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586'):
    for line in urlopen('http://plantilus.com/plantdb/RlczLepr/index.html'):
        line = line.decode('utf-8')
        if 'Genus' in line or 'Seed' in line:
        print(line)

Got expected results without any error.

When I changed to a different URL http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586

I got the UnicideDecodeError (invalid start byte)

Note: Similar code runs successfully in Perl.    I hope to be able to do this in Python.

Hope someone here could explain what the problem is.
pax
0
Comment
Question by:cpeters5
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 40353854
I can't get to ID=171586 but checking another page that I can get to (ID=114368) and the problem appears to be that even though the page claims (based on it's meta charset tag) to be UTF-8 encoded, the Google Tag manager comment contains a non utf-8 character:

  <!-- Google Tag Manager ? Carat -->

from urllib.request import urlopen

# for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586'):
# for line in urlopen('http://plantilus.com/plantdb/RlczLepr/index.html'):

for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=114368'):
     try:
         line = line.decode('utf-8')
     except:
         line = line.decode('utf-8', 'replace')  # you also have the option to 'ignore' / skip the character
         print("Non UTF8 found in: {0}".format(line))

     if 'Genus' in line or 'Seed' in line:
         print(line)

Open in new window


In other words, change your decode to either:
line=line.decode('utf-8', 'ignore')

Open in new window

or
line = line.decode('utf-8', 'replace')

Open in new window


To handle pages that aren't truly utf-8 (like your orchid detail page).
0
 

Author Closing Comment

by:cpeters5
ID: 40353860
Clockwatcher, Thank you!
pax
0

Featured Post

Networking for the Cloud Era

Join Microsoft and Riverbed for a discussion and demonstration of enhancements to SteelConnect:
-One-click orchestration and cloud connectivity in Azure environments
-Tight integration of SD-WAN and WAN optimization capabilities
-Scalability and resiliency equal to a data center

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
need cgi page to refresh one time only after launch 11 149
Error catching in Python 8 55
python question 5 81
How to scan rdp  ''only'' open port 3333? 5 130
Dictionaries contain key:value pairs. Which means a collection of tuples with an attribute name and an assigned value to it. The semicolon present in between each key and values and attribute with values are delimited with a comma.  In python we can…
Article by: Swadhin
Introduction of Lists in Python: There are six built-in types of sequences. Lists and tuples are the most common one. In this article we will see how to use Lists in python and how we can utilize it while doing our own program. In general we can al…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

808 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question