Solved

Regular expression matching in Python

Posted on 2014-10-25
7
178 Views
Last Modified: 2014-10-31
I am writing a Python script to fetch information from a list of web pages.  The script contains a loop, fetching the content of urls in each iteration.  Each URL contains a line which I want to extract data from.  These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match.   Here are two examples if the input lines:

1)  <i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>  
2) <i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>  

I want to extract species and, if exists the × sign, into species and genushybrid variables.

The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^<]+)',line)
                  genushybrid = ""
                  genus = w.group(0)

But I am not able to check if the space character and the × character exists.

Thanks
0
Comment
Question by:cpeters5
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
7 Comments
 
LVL 29

Expert Comment

by:pepr
ID: 40404308
You may finally make it work, but regular expressions are not that good for parsing HTML or XML when the things get more complex. I suggest to use the standard xml.etree.ElementTree module. As you use HTML fragment and the ElementTree uses XML parser, it does not know &nbsp; and the like entities. See the comments in the code:
#!python3

import xml.etree.ElementTree as ET

# The test data.
lst = [
    '<i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>',
    '<i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>'
]

for s in lst:

    # The first line is malformed. The & should be &amp;
    s = s.replace(' & ', ' &amp; ')

    # The XMLParser is to be used, and it knows only XML entities,
    # not the HTML ones. This way, I am going to replace the &nbsp;
    # by a normal space.
    s = s.replace('&nbsp;', ' ')
    
    # The ElementTree parser returns a single element object. 
    # This requires to wrap the string using tags. The 'fragment'
    # may be replace by whatever you like better.
    s = '<fragment>' + s + '</fragment>'
    root = ET.fromstring(s)
    ##ET.dump(root)

    # The ElementTree returns the parsed document as the element object
    # that can be used as a list of child elements. Here the root contains
    # the 'i' elements. 
    #    Each element has .attrib member, which behaves as a dictionary
    # of the element attributes. It also has the .text element that
    # contains the text wrapped by the tag.
    
    # I assume you always have the 'i' with the genus class as the first one.
    assert root[0].attrib['class'] == 'genus'
    genus = root[0].text
    genushybrid = '-'   # init
    
    # Find the species. Here XPath expression is used. The dot means
    # from this element, the two slashes means elsewhere, then find 
    # the 'i' element with the class equal to 'species'.
    spec_el = root.find('.//i[@class="species"]')
    assert spec_el is not None
    species = spec_el.text
        
    # Test if the hybrid element exists and if it contains the mark.
    # Here the genushybrid is simply copied.
    hybrid_element = root.find('.//i[@class="specieshybrid"]')
    if hybrid_element is not None and hybrid_element.text == '×':
        genushybrid = genus
        
    # Print everything. You may want to change the logic.
    print('genus:       ', genus)
    print('genus hybrid:', genushybrid)
    print('species:     ', species)
    print('-------------------')

Open in new window

The code prints:
genus:        Cypripedium
genus hybrid: -
species:      singchii
-------------------
genus:        Cypripedium
genus hybrid: Cypripedium
species:      smithii
-------------------

Open in new window

0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 40404449
It isn't really an "x" as much as it is a place holder for the browser render engine.  It is a a 215 ASCII value.

You should try this pattern:
class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?class="authorship">([^<]+)</span>

Open in new window

0
 

Author Closing Comment

by:cpeters5
ID: 40416062
aikimark,
Thanks for the solution.  I have a few more questions in separate posts
0
Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

 

Author Comment

by:cpeters5
ID: 40416069
pepr,
Thank you for your response.  Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts.  It will require a lot of changes in the script to switch to XML parser in a short time.
pax
0
 
LVL 45

Expert Comment

by:aikimark
ID: 40416075
I have a few more questions in separate posts
Please post links to your new questions in this thread
0

Featured Post

Enroll in June's Course of the Month

June’s Course of the Month is now available! Experts Exchange’s Premium Members, Team Accounts, and Qualified Experts have access to a complimentary course each month as part of their membership—an extra way to sharpen your skills and increase training.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Less strange, but still introduction This introduction was added (1st August, 2011) to reflect some reactions.  Firstly, the term basics in the title of the article...  As any other word, it is a symbol with meaning attached to the word by some a…
This article will show the steps for installing Python on Ubuntu Operating System. I have created a virtual machine with Ubuntu Operating system 8.10 and this installing process also works with upgraded version of Ubuntu OS. For installing Py…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question