Python regular expression

Posted on 2014-10-06
Medium Priority
Last Modified: 2014-10-06
I need to convert the following regex in Perl to Python

my $string = "      <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>"

I want to extract the value of genus (in this case, "Corysanthes") into a variable.
In Perl, I would write something like:

           $string =~ /class=\"genus\">([^<]+)<\/i>/;  
            my $genus = $1;

How do you write this in Python?

Question by:cpeters5
  • 2
  • 2
LVL 75

Accepted Solution

käµfm³d   👽 earned 2000 total points
ID: 40364537
One approach:

import re

regex = re.compile('class=\"genus\">([^<]+)</i>')
string = '      <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>'
match = regex.search(string)
genus = match.group(1)

Open in new window

LVL 29

Expert Comment

ID: 40364734
It may be better to use a XML parser that a part of Python distribution.

Author Comment

ID: 40364813
Thanks pepr.  I will take a look.  (Still very green, havn't gotten to the XML section yet.)    The files I am parsing are just HTML, they are not well formed.  Would this be a problem?
LVL 75

Expert Comment

by:käµfm³d 👽
ID: 40365080
Take a look at BeautifulSoup. It deals with sloppy HTML well.
LVL 29

Expert Comment

ID: 40365316
+1 for BeautifulSoup. Anyway, if you separate good HTML fragment, you can use the standard xml.etree.ElementTree:

import xml.etree.ElementTree as ET

s = '      <span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</i> <span class="authorship">D.L.Jones</span></span>'

element = ET.fromstring(s)


# To find the specific <i > element wherever it is.
genus = element.find('.//i[@class="genus"]')

# Similarly for the species.
species = element.find('./i[@class="species"]')


# Looping through the structure. The `.attrib` is a dictionary
# of the element attributes; the `element` behaves as the list 
# of children
print(element.tag, element.attrib)
for e in element:
    print(e.tag, e.attrib['class'], e.text)

Open in new window

It prints on console:
<span class="name"><i class="genus">Corysanthes</i> <i class="species">grumula</
i> <span class="authorship">D.L.Jones</span></span>
span {'class': 'name'}
i genus Corysanthes
i species grumula
span authorship D.L.Jones

Open in new window


Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
Sequence is something that used to store data in it in very simple words. Let us just create a list first. To create a list first of all we need to give a name to our list which I have taken as “COURSE” followed by equals sign and finally enclosed …
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …
Learn the basics of modules and packages in Python. Every Python file is a module, ending in the suffix: .py: Modules are a collection of functions and variables.: Packages are a collection of modules.: Module functions and variables are accessed us…
Suggested Courses

621 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question