Solved

Regular expression matching in Python

Posted on 2014-10-25
7
175 Views
Last Modified: 2014-10-31
I am writing a Python script to fetch information from a list of web pages.  The script contains a loop, fetching the content of urls in each iteration.  Each URL contains a line which I want to extract data from.  These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match.   Here are two examples if the input lines:

1)  <i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>  
2) <i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>  

I want to extract species and, if exists the × sign, into species and genushybrid variables.

The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^<]+)',line)
                  genushybrid = ""
                  genus = w.group(0)

But I am not able to check if the space character and the × character exists.

Thanks
0
Comment
Question by:cpeters5
  • 4
  • 2
7 Comments
 
LVL 28

Expert Comment

by:pepr
ID: 40404308
You may finally make it work, but regular expressions are not that good for parsing HTML or XML when the things get more complex. I suggest to use the standard xml.etree.ElementTree module. As you use HTML fragment and the ElementTree uses XML parser, it does not know &nbsp; and the like entities. See the comments in the code:
#!python3

import xml.etree.ElementTree as ET

# The test data.
lst = [
    '<i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>',
    '<i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>'
]

for s in lst:

    # The first line is malformed. The & should be &amp;
    s = s.replace(' & ', ' &amp; ')

    # The XMLParser is to be used, and it knows only XML entities,
    # not the HTML ones. This way, I am going to replace the &nbsp;
    # by a normal space.
    s = s.replace('&nbsp;', ' ')
    
    # The ElementTree parser returns a single element object. 
    # This requires to wrap the string using tags. The 'fragment'
    # may be replace by whatever you like better.
    s = '<fragment>' + s + '</fragment>'
    root = ET.fromstring(s)
    ##ET.dump(root)

    # The ElementTree returns the parsed document as the element object
    # that can be used as a list of child elements. Here the root contains
    # the 'i' elements. 
    #    Each element has .attrib member, which behaves as a dictionary
    # of the element attributes. It also has the .text element that
    # contains the text wrapped by the tag.
    
    # I assume you always have the 'i' with the genus class as the first one.
    assert root[0].attrib['class'] == 'genus'
    genus = root[0].text
    genushybrid = '-'   # init
    
    # Find the species. Here XPath expression is used. The dot means
    # from this element, the two slashes means elsewhere, then find 
    # the 'i' element with the class equal to 'species'.
    spec_el = root.find('.//i[@class="species"]')
    assert spec_el is not None
    species = spec_el.text
        
    # Test if the hybrid element exists and if it contains the mark.
    # Here the genushybrid is simply copied.
    hybrid_element = root.find('.//i[@class="specieshybrid"]')
    if hybrid_element is not None and hybrid_element.text == '×':
        genushybrid = genus
        
    # Print everything. You may want to change the logic.
    print('genus:       ', genus)
    print('genus hybrid:', genushybrid)
    print('species:     ', species)
    print('-------------------')

Open in new window

The code prints:
genus:        Cypripedium
genus hybrid: -
species:      singchii
-------------------
genus:        Cypripedium
genus hybrid: Cypripedium
species:      smithii
-------------------

Open in new window

0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40404449
It isn't really an "x" as much as it is a place holder for the browser render engine.  It is a a 215 ASCII value.

You should try this pattern:
class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?class="authorship">([^<]+)</span>

Open in new window

0
 

Author Closing Comment

by:cpeters5
ID: 40416062
aikimark,
Thanks for the solution.  I have a few more questions in separate posts
0
Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

 

Author Comment

by:cpeters5
ID: 40416069
pepr,
Thank you for your response.  Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts.  It will require a lot of changes in the script to switch to XML parser in a short time.
pax
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40416075
I have a few more questions in separate posts
Please post links to your new questions in this thread
0
 

Author Comment

by:cpeters5
ID: 40416763
0
 

Author Comment

by:cpeters5
ID: 40416890
0

Featured Post

Courses: Start Training Online With Pros, Today

Brush up on the basics or master the advanced techniques required to earn essential industry certifications, with Courses. Enroll in a course and start learning today. Training topics range from Android App Dev to the Xen Virtualization Platform.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction On September 29, 2012, the Python 3.3.0 was released; nothing extremely unexpected,  yet another, better version of Python. But, if you work in Microsoft Windows, you should notice that the Python Launcher for Windows was introduced wi…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of if, else, and elif statements in Python 2.7. Use "if" statements to test a specified condition.: The structure of an if statement is as follows: (CODE) Use "else" statements to allow the execution of an alternative, if the …

786 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question