Solved

Regular expression matching in Python

Posted on 2014-10-25
7
168 Views
Last Modified: 2014-10-31
I am writing a Python script to fetch information from a list of web pages.  The script contains a loop, fetching the content of urls in each iteration.  Each URL contains a line which I want to extract data from.  These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match.   Here are two examples if the input lines:

1)  <i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>  
2) <i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>  

I want to extract species and, if exists the × sign, into species and genushybrid variables.

The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^<]+)',line)
                  genushybrid = ""
                  genus = w.group(0)

But I am not able to check if the space character and the × character exists.

Thanks
0
Comment
Question by:cpeters5
  • 4
  • 2
7 Comments
 
LVL 28

Expert Comment

by:pepr
ID: 40404308
You may finally make it work, but regular expressions are not that good for parsing HTML or XML when the things get more complex. I suggest to use the standard xml.etree.ElementTree module. As you use HTML fragment and the ElementTree uses XML parser, it does not know &nbsp; and the like entities. See the comments in the code:
#!python3

import xml.etree.ElementTree as ET

# The test data.
lst = [
    '<i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>',
    '<i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>'
]

for s in lst:

    # The first line is malformed. The & should be &amp;
    s = s.replace(' & ', ' &amp; ')

    # The XMLParser is to be used, and it knows only XML entities,
    # not the HTML ones. This way, I am going to replace the &nbsp;
    # by a normal space.
    s = s.replace('&nbsp;', ' ')
    
    # The ElementTree parser returns a single element object. 
    # This requires to wrap the string using tags. The 'fragment'
    # may be replace by whatever you like better.
    s = '<fragment>' + s + '</fragment>'
    root = ET.fromstring(s)
    ##ET.dump(root)

    # The ElementTree returns the parsed document as the element object
    # that can be used as a list of child elements. Here the root contains
    # the 'i' elements. 
    #    Each element has .attrib member, which behaves as a dictionary
    # of the element attributes. It also has the .text element that
    # contains the text wrapped by the tag.
    
    # I assume you always have the 'i' with the genus class as the first one.
    assert root[0].attrib['class'] == 'genus'
    genus = root[0].text
    genushybrid = '-'   # init
    
    # Find the species. Here XPath expression is used. The dot means
    # from this element, the two slashes means elsewhere, then find 
    # the 'i' element with the class equal to 'species'.
    spec_el = root.find('.//i[@class="species"]')
    assert spec_el is not None
    species = spec_el.text
        
    # Test if the hybrid element exists and if it contains the mark.
    # Here the genushybrid is simply copied.
    hybrid_element = root.find('.//i[@class="specieshybrid"]')
    if hybrid_element is not None and hybrid_element.text == '×':
        genushybrid = genus
        
    # Print everything. You may want to change the logic.
    print('genus:       ', genus)
    print('genus hybrid:', genushybrid)
    print('species:     ', species)
    print('-------------------')

Open in new window

The code prints:
genus:        Cypripedium
genus hybrid: -
species:      singchii
-------------------
genus:        Cypripedium
genus hybrid: Cypripedium
species:      smithii
-------------------

Open in new window

0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40404449
It isn't really an "x" as much as it is a place holder for the browser render engine.  It is a a 215 ASCII value.

You should try this pattern:
class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?class="authorship">([^<]+)</span>

Open in new window

0
 

Author Closing Comment

by:cpeters5
ID: 40416062
aikimark,
Thanks for the solution.  I have a few more questions in separate posts
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:cpeters5
ID: 40416069
pepr,
Thank you for your response.  Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts.  It will require a lot of changes in the script to switch to XML parser in a short time.
pax
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40416075
I have a few more questions in separate posts
Please post links to your new questions in this thread
0
 

Author Comment

by:cpeters5
ID: 40416763
0
 

Author Comment

by:cpeters5
ID: 40416890
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Flask is a microframework for Python based on Werkzeug and Jinja 2. This requires you to have a good understanding of Python 2.7. Lets install Flask! To install Flask you can use a python repository for libraries tool called pip. Download this f…
The purpose of this article is to demonstrate how we can upgrade Python from version 2.7.6 to Python 2.7.10 on the Linux Mint operating system. I am using an Oracle Virtual Box where I have installed Linux Mint operating system version 17.2. Once yo…
Learn the basics of strings in Python: declaration, operations, indices, and slicing. Strings are declared with quotations; for example: s = "string": Strings are immutable.: Strings may be concatenated or multiplied using the addition and multiplic…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now