asked on

Python regular expression

I need to convert the following regex in Perl to Python

input:
my $string = " Corysanthes grumula D.L.Jones"

I want to extract the value of genus (in this case, "Corysanthes") into a variable.
In Perl, I would write something like:

$string =~ /class=\"genus\">([^<]+)<\/i>/;
my $genus = $1;

How do you write this in Python?
Thanks:

Python

ASKER CERTIFIED SOLUTION

kaufmed

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pepr

It may be better to use a XML parser that a part of Python distribution.

cpeters5

ASKER

Thanks pepr. I will take a look. (Still very green, havn't gotten to the XML section yet.) The files I am parsing are just HTML, they are not well formed. Would this be a problem?

kaufmed

Take a look at BeautifulSoup. It deals with sloppy HTML well.

pepr

+1 for BeautifulSoup. Anyway, if you separate good HTML fragment, you can use the standard xml.etree.ElementTree:

#!python3

import xml.etree.ElementTree as ET

s = '      <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>'

element = ET.fromstring(s)
ET.dump(element)

print('------------')

# To find the specific <i > element wherever it is.
genus = element.find('.//i[@class="genus"]')
print(genus.text)

# Similarly for the species.
species = element.find('./i[@class="species"]')
print(species.text)

print('------------')

# Looping through the structure. The `.attrib` is a dictionary
# of the element attributes; the `element` behaves as the list 
# of children
print(element.tag, element.attrib)
for e in element:
    print(e.tag, e.attrib['class'], e.text)

Open in new window

It prints on console:

c:\__Python\cpeters5\Q_28532295>a.py
<span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</
i> <span class="authorship">D.L.Jones</span></span>
------------
Corysanthes
grumula
------------
span {'class': 'name'}
i genus Corysanthes
i species grumula
span authorship D.L.Jones

Open in new window