ScriberUK
asked on
Python HTMLParser, formatting issue
Quite simply, I have some python 2.6 code (below) that parses an html file (also below)...
The python output displays as:
STAREAST Software Testing Analysis
&
Review
On three lines with odd spacing! Why doesn't it display on one line? How do I get it to display as "STAREAST Software Testing Analysis & Review"?
Many thanks!
The python output displays as:
STAREAST Software Testing Analysis
&
Review
On three lines with odd spacing! Why doesn't it display on one line? How do I get it to display as "STAREAST Software Testing Analysis & Review"?
Many thanks!
== filename: test.py ==
from HTMLParser import HTMLParser
import urllib2
import formatter
class Parser(HTMLParser):
inHeading = False
def handle_starttag(self, tag, attrs):
if tag == "a":
self.inHeading = True
def handle_data(self, data):
if self.inHeading:
print data
def handle_endtag(self, tag):
if tag =="a":
self.inHeading = False
hParser = Parser()
hParser.feed(open("test.htm", "r").read())
hParser.close()
== filename: test.htm ==
<a href=/seeconf.mv?q=ca1xi06x>STAREAST Software Testing Analysis & Review</a>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you both very much!
cxr, your code works perfectly but could you explain that ''.join(self.data) does and why 'anything'.join(self.data) does exactly the same thing despite the obvious difference?
Cheers,
cxr, your code works perfectly but could you explain that ''.join(self.data) does and why 'anything'.join(self.data)
Cheers,
'anything'.join(self.data) does not do the same thing as ''.join(self.data).
String objects has a join method which takes a sequence (list) as a parameter. The string is used as a separator when the items in the sequence is joined. For example:
'-'.join(['a','b','c'])
output: a-b-c
Using the empty string '' the items in the sequence is joined together without any separator:
''.join(['a','b','c'])
output: abc
String objects has a join method which takes a sequence (list) as a parameter. The string is used as a separator when the items in the sequence is joined. For example:
'-'.join(['a','b','c'])
output: a-b-c
Using the empty string '' the items in the sequence is joined together without any separator:
''.join(['a','b','c'])
output: abc
ASKER
cxr thank you.
I probably didn't make myself clear enough... in my specific example, going back to you original answer, it doesn't matter if you make it 'anything'.join(self.data) ; it that example, why is that?
I would have expected:
"anythingSTAREAST Software Testing Analysis & Review"?
I probably didn't make myself clear enough... in my specific example, going back to you original answer, it doesn't matter if you make it 'anything'.join(self.data)
I would have expected:
"anythingSTAREAST Software Testing Analysis & Review"?
In your original question, your data is fetched in three parts:
STAREAST Software Testing Analysis
&
Review
...so with 'anything' as separator I would expect this:
"STAREAST Software Testing Analysis anything&anything Review"
When your data does not contain any & characters, it is fetched as a list with a single item. In that case the separator is not used:
'anything'.join(['This is a link text'])
output: This is a link text
STAREAST Software Testing Analysis
&
Review
...so with 'anything' as separator I would expect this:
"STAREAST Software Testing Analysis anything&anything Review"
When your data does not contain any & characters, it is fetched as a list with a single item. In that case the separator is not used:
'anything'.join(['This is a link text'])
output: This is a link text
As crx wrote, the '&' character is the special one. If you want to see it at the output, you have to replace it by some sequence. Or you can use the sequence for numeric representation of a character (http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.3), or (better) you replace it by the sequence '&' (http://www.w3.org/TR/html401/sgml/entities.html).