Solved

Python - Internet access

Posted on 2014-09-30
2
254 Views
Last Modified: 2014-09-30
I am adapting the following code from https://docs.python.org/3/tutorial/ section 10.7 Internet access:

from urllib.request import urlopen:
    for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586'):
    for line in urlopen('http://plantilus.com/plantdb/RlczLepr/index.html'):
        line = line.decode('utf-8')
        if 'Genus' in line or 'Seed' in line:
        print(line)

Got expected results without any error.

When I changed to a different URL http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586

I got the UnicideDecodeError (invalid start byte)

Note: Similar code runs successfully in Perl.    I hope to be able to do this in Python.

Hope someone here could explain what the problem is.
pax
0
Comment
Question by:cpeters5
2 Comments
 
LVL 25

Accepted Solution

by:
clockwatcher earned 500 total points
ID: 40353854
I can't get to ID=171586 but checking another page that I can get to (ID=114368) and the problem appears to be that even though the page claims (based on it's meta charset tag) to be UTF-8 encoded, the Google Tag manager comment contains a non utf-8 character:

  <!-- Google Tag Manager ? Carat -->

from urllib.request import urlopen

# for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=171586'):
# for line in urlopen('http://plantilus.com/plantdb/RlczLepr/index.html'):

for line in urlopen('http://apps.rhs.org.uk/horticulturaldatabase/orchidregister/orchiddetails.asp?ID=114368'):
     try:
         line = line.decode('utf-8')
     except:
         line = line.decode('utf-8', 'replace')  # you also have the option to 'ignore' / skip the character
         print("Non UTF8 found in: {0}".format(line))

     if 'Genus' in line or 'Seed' in line:
         print(line)

Open in new window


In other words, change your decode to either:
line=line.decode('utf-8', 'ignore')

Open in new window

or
line = line.decode('utf-8', 'replace')

Open in new window


To handle pages that aren't truly utf-8 (like your orchid detail page).
0
 

Author Closing Comment

by:cpeters5
ID: 40353860
Clockwatcher, Thank you!
pax
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Installing Python 2.7.3 version on Windows operating system For installing Python first we need to download Python's latest version from URL" www.python.org " You can also get information on Python scripting language from the above mentioned we…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

823 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question