[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Python, HTML parse error due to malformed code. Help!

Posted on 2009-04-25
7
Medium Priority
?
3,145 Views
Last Modified: 2012-05-06
As per the code below you can see that there is an error in the line:
 '<option selected">Conference</option>', it should read:
 '<option selected="selected">Conference</option>'

The problem is that when I parse the webpage using the HTMLParser module, it falls over saying "HTMLParser.HTMLParseError: malformed start tag, at line 107, column 17" and I agree!

Has anyone any suggestions as to how I overcome the error?

Many many thanks,

Andrew
Actual from extract:
http://www.intute.ac.uk/artsandhumanities/cgi-bin/conferences.pl?type=Conference&subject=artifact1|Visual&term=&submit=Show+matching+events
 
Amusingly, their motto is "best of the web".
 
<select name="type">
<option selected">Conference</option>
<option>All events</option>
<option>Conference</option>
</select>
 
Firefox correction as per html standard:
 
<select name="type">
<option selected="selected">Conference</option>
<option>All events</option>
<option>Conference</option>
</select>

Open in new window

0
Comment
Question by:ScriberUK
  • 3
  • 2
  • 2
7 Comments
 
LVL 29

Expert Comment

by:pepr
ID: 24233606
It may also depend on your intention. If you want just to extract some information from the page, then the BeautifulSoup parser may be a better choice for you (see http://www.crummy.com/software/BeautifulSoup/). It was designed to cope also with malformed pages. Also, it has very nice features for searching for the extracted information.

If this is the case, reformulate your wish here.
0
 
LVL 29

Expert Comment

by:pepr
ID: 24233701
Taking back my promises ;) I have just tried to store the page to b.html and to use BeautifulSoup. The truth is that it internally uses HTMLParser and reports the same error when trying to parse it. The snippet below produces:

C:\tmp\_Python\ScriberUK>b.py
Traceback (most recent call last):
  File "C:\tmp\_Python\ScriberUK\b.py", line 9, in <module>
    soup = BeautifulSoup(page)
  File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "C:\Python26\lib\site-packages\BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "C:\Python26\lib\HTMLParser.py", line 226, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "C:\Python26\lib\HTMLParser.py", line 301, in check_for_whole_start_tag
    self.error("malformed start tag")
  File "C:\Python26\lib\HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: malformed start tag, at line 113, column 17
from BeautifulSoup import BeautifulSoup
 
# Get the content of your document (somehow) into one string.
f = open('b.html')
page = f.read()
f.close()
 
# Parse the string.
soup = BeautifulSoup(page)
 
for opt in soup.findAll('option', ['selected']):
    print opt.string

Open in new window

0
 
LVL 29

Accepted Solution

by:
pepr earned 1200 total points
ID: 24233721
However, you may be interested in "HTML Tidy" application (http://tidy.sourceforge.net/) which is capable to fix the page content. There even is a Python wrapper (http://utidylib.berlios.de/), but I have no experience with that.

(By the way, my code above fails at line 11 -- there is a bug in .findAll() second argument which did not manifestated because of earlier problems at line 9.)
0
New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

 
LVL 8

Assisted Solution

by:LunarNRG
LunarNRG earned 300 total points
ID: 24234488
BeautifulSoup is an excellent tool and I highly recommend it. The reason for the failure is explained in great detail here:

http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

To summarize, the 3.1.x series of BeautifulSoup was released for compatibility with python 3+, and as such uses HTMLParser instead of SGMLParser which was removed from the python standard library starting with python 3.0. Unfortunately, HTMLParser is not very good at handling malformed html.

So, as suggested in the article, if you're still using python <= 2.6 you can continue with the 3.0.x series of BeautifulSoup (3.0.7a). Otherwise, you can try one of the other options listed in the article, of which the front runner seems to be html5lib.

Good luck!
0
 

Author Comment

by:ScriberUK
ID: 24236051
Thank you for your answers but I appear to be having a nightmare here...

BeautifulSoup 3.0.x and 3.1.x fall over, I cannot get html5lib to install under windows. Has anyone installed html5lib?
0
 
LVL 8

Expert Comment

by:LunarNRG
ID: 24236085
I can imagine how 3.1.x fails, but what error messages do you receive with 3.0.x?

I have not installed html5lib, but could you problems on windows be related to this issue? http://code.google.com/p/html5lib/issues/detail?id=72

The html5lib page recommend using the 0.12 version from subversion, not sure if you tried that.
0
 

Author Comment

by:ScriberUK
ID: 24237157
Thank you all. I've still not had any luck with BeautifulSoup or html5lib, however µTidylib (http://utidylib.berlios.de/) does appear to fix the problem!

However now I have another... how do I pass the doucment object result back into my script? Please see new question: http://www.experts-exchange.com/Programming/Languages/Scripting/Python/Q_24356467.html
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Here I am using Python IDLE(GUI) to write a simple program and save it, so that we can just execute it in future. Because when we write any program and exit from Python then program that we have written will be lost. So for not losing our program we…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Suggested Courses

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question