[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1159
  • Last Modified:

Python HTMLParser, formatting issue

Quite simply, I have some python 2.6 code (below) that parses an html file (also below)...

The python output displays as:

STAREAST Software Testing Analysis
&
 Review

On three lines with odd spacing! Why doesn't it display on one line? How do I get it to display as "STAREAST Software Testing Analysis & Review"?

Many thanks!


== filename: test.py ==
 
from HTMLParser import HTMLParser
import urllib2
import formatter
 
class Parser(HTMLParser):
  inHeading = False
 
  def handle_starttag(self, tag, attrs):
    if tag == "a":
      self.inHeading = True
      
  def handle_data(self, data):
    if self.inHeading:
      print data
 
  def handle_endtag(self, tag):
    if tag =="a":
      self.inHeading = False
 
hParser = Parser()
hParser.feed(open("test.htm", "r").read())
hParser.close()
 
== filename: test.htm ==
 
<a href=/seeconf.mv?q=ca1xi06x>STAREAST Software Testing Analysis & Review</a>

Open in new window

0
ScriberUK
Asked:
ScriberUK
  • 3
  • 2
1 Solution
 
Roger BaklundCommented:
The & is a special character in html, so the parser splits the content and treats this as a separate node. Try this:
class Parser(HTMLParser):
  inHeading = False
 
  def handle_starttag(self, tag, attrs):
    if tag == "a":
      self.inHeading = True
      self.data = []
      
  def handle_data(self, data):
    if self.inHeading:
      self.data.append(data)
 
  def handle_endtag(self, tag):
    if tag =="a":
      self.inHeading = False
      print ''.join(self.data)

Open in new window

0
 
peprCommented:
Firstly, it is always good idea to put quote the value of attributes -- here href="/seeconf.mv?q=ca1xi06x"

As crx wrote, the '&' character is the special one. If you want to see it at the output, you have to replace it by some sequence. Or you can use the sequence for numeric representation of a character (http://www.w3.org/TR/html401/intro/sgmltut.html#h-3.2.3), or (better) you replace it by the sequence '&amp;' (http://www.w3.org/TR/html401/sgml/entities.html).

0
 
ScriberUKAuthor Commented:
Thank you both very much!

cxr, your code works perfectly but could you explain that ''.join(self.data) does and why 'anything'.join(self.data) does exactly the same thing despite the obvious difference?

Cheers,
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
Roger BaklundCommented:
'anything'.join(self.data) does not do the same thing as ''.join(self.data).

String objects has a join method which takes a sequence (list) as a parameter. The string is used as a separator when the items in the sequence is joined. For example:

'-'.join(['a','b','c'])

output: a-b-c

Using the empty string '' the items in the sequence is joined together without any separator:

''.join(['a','b','c'])

output: abc
0
 
ScriberUKAuthor Commented:
cxr thank you.

I probably didn't make myself clear enough... in my specific example, going back to you original answer, it doesn't matter if you make it 'anything'.join(self.data); it that example, why is that?

I would have expected:

"anythingSTAREAST Software Testing Analysis & Review"?
0
 
Roger BaklundCommented:
In your original question, your data is fetched in three parts:

STAREAST Software Testing Analysis
&
 Review

...so with 'anything' as separator I would expect this:

"STAREAST Software Testing Analysis anything&anything Review"

When your data does not contain any & characters, it is fetched as a list with a single item. In that case the separator is not used:

'anything'.join(['This is a link text'])

output: This is a link text
0

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now