Link to home
Start Free TrialLog in
Avatar of cpeters5
cpeters5

asked on

Python regular expression

I need to convert the following regex in Perl to Python

input:
my $string = "      <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>"

I want to extract the value of genus (in this case, "Corysanthes") into a variable.
In Perl, I would write something like:

           $string =~ /class=\"genus\">([^<]+)<\/i>/;  
            my $genus = $1;


How do you write this in Python?
Thanks:


Python
ASKER CERTIFIED SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of pepr
pepr

It may be better to use a XML parser that a part of Python distribution.
Avatar of cpeters5

ASKER

Thanks pepr.  I will take a look.  (Still very green, havn't gotten to the XML section yet.)    The files I am parsing are just HTML, they are not well formed.  Would this be a problem?
Take a look at BeautifulSoup. It deals with sloppy HTML well.
+1 for BeautifulSoup. Anyway, if you separate good HTML fragment, you can use the standard xml.etree.ElementTree:
#!python3

import xml.etree.ElementTree as ET

s = '      <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>'

element = ET.fromstring(s)
ET.dump(element)

print('------------')

# To find the specific <i > element wherever it is.
genus = element.find('.//i[@class="genus"]')
print(genus.text)

# Similarly for the species.
species = element.find('./i[@class="species"]')
print(species.text)

print('------------')

# Looping through the structure. The `.attrib` is a dictionary
# of the element attributes; the `element` behaves as the list 
# of children
print(element.tag, element.attrib)
for e in element:
    print(e.tag, e.attrib['class'], e.text)

Open in new window

It prints on console:
c:\__Python\cpeters5\Q_28532295>a.py
<span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</
i> <span class="authorship">D.L.Jones</span></span>
------------
Corysanthes
grumula
------------
span {'class': 'name'}
i genus Corysanthes
i species grumula
span authorship D.L.Jones

Open in new window