cpeters5
asked on
Python regular expression
I need to convert the following regex in Perl to Python
input:
my $string = " <span class="name"><i class="genus">Corysanthes< /i> <i class="species">grumula</i > <span class="authorship">D.L.Jon es</span>< /span>"
I want to extract the value of genus (in this case, "Corysanthes") into a variable.
In Perl, I would write something like:
$string =~ /class=\"genus\">([^<]+)<\ /i>/;
my $genus = $1;
How do you write this in Python?
Thanks:
Python
input:
my $string = " <span class="name"><i class="genus">Corysanthes<
I want to extract the value of genus (in this case, "Corysanthes") into a variable.
In Perl, I would write something like:
$string =~ /class=\"genus\">([^<]+)<\
my $genus = $1;
How do you write this in Python?
Thanks:
Python
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
It may be better to use a XML parser that a part of Python distribution.
ASKER
Thanks pepr. I will take a look. (Still very green, havn't gotten to the XML section yet.) The files I am parsing are just HTML, they are not well formed. Would this be a problem?
Take a look at BeautifulSoup. It deals with sloppy HTML well.
+1 for BeautifulSoup. Anyway, if you separate good HTML fragment, you can use the standard xml.etree.ElementTree:
#!python3
import xml.etree.ElementTree as ET
s = ' <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>'
element = ET.fromstring(s)
ET.dump(element)
print('------------')
# To find the specific <i > element wherever it is.
genus = element.find('.//i[@class="genus"]')
print(genus.text)
# Similarly for the species.
species = element.find('./i[@class="species"]')
print(species.text)
print('------------')
# Looping through the structure. The `.attrib` is a dictionary
# of the element attributes; the `element` behaves as the list
# of children
print(element.tag, element.attrib)
for e in element:
print(e.tag, e.attrib['class'], e.text)
It prints on console:
c:\__Python\cpeters5\Q_28532295>a.py
<span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</
i> <span class="authorship">D.L.Jones</span></span>
------------
Corysanthes
grumula
------------
span {'class': 'name'}
i genus Corysanthes
i species grumula
span authorship D.L.Jones