Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Regular expression matching in Python

Posted on 2014-10-25
7
Medium Priority
?
181 Views
Last Modified: 2014-10-31
I am writing a Python script to fetch information from a list of web pages.  The script contains a loop, fetching the content of urls in each iteration.  Each URL contains a line which I want to extract data from.  These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match.   Here are two examples if the input lines:

1)  <i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>  
2) <i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>  

I want to extract species and, if exists the × sign, into species and genushybrid variables.

The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^<]+)',line)
                  genushybrid = ""
                  genus = w.group(0)

But I am not able to check if the space character and the × character exists.

Thanks
0
Comment
Question by:cpeters5
  • 4
  • 2
7 Comments
 
LVL 29

Expert Comment

by:pepr
ID: 40404308
You may finally make it work, but regular expressions are not that good for parsing HTML or XML when the things get more complex. I suggest to use the standard xml.etree.ElementTree module. As you use HTML fragment and the ElementTree uses XML parser, it does not know &nbsp; and the like entities. See the comments in the code:
#!python3

import xml.etree.ElementTree as ET

# The test data.
lst = [
    '<i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>',
    '<i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>'
]

for s in lst:

    # The first line is malformed. The & should be &amp;
    s = s.replace(' & ', ' &amp; ')

    # The XMLParser is to be used, and it knows only XML entities,
    # not the HTML ones. This way, I am going to replace the &nbsp;
    # by a normal space.
    s = s.replace('&nbsp;', ' ')
    
    # The ElementTree parser returns a single element object. 
    # This requires to wrap the string using tags. The 'fragment'
    # may be replace by whatever you like better.
    s = '<fragment>' + s + '</fragment>'
    root = ET.fromstring(s)
    ##ET.dump(root)

    # The ElementTree returns the parsed document as the element object
    # that can be used as a list of child elements. Here the root contains
    # the 'i' elements. 
    #    Each element has .attrib member, which behaves as a dictionary
    # of the element attributes. It also has the .text element that
    # contains the text wrapped by the tag.
    
    # I assume you always have the 'i' with the genus class as the first one.
    assert root[0].attrib['class'] == 'genus'
    genus = root[0].text
    genushybrid = '-'   # init
    
    # Find the species. Here XPath expression is used. The dot means
    # from this element, the two slashes means elsewhere, then find 
    # the 'i' element with the class equal to 'species'.
    spec_el = root.find('.//i[@class="species"]')
    assert spec_el is not None
    species = spec_el.text
        
    # Test if the hybrid element exists and if it contains the mark.
    # Here the genushybrid is simply copied.
    hybrid_element = root.find('.//i[@class="specieshybrid"]')
    if hybrid_element is not None and hybrid_element.text == '×':
        genushybrid = genus
        
    # Print everything. You may want to change the logic.
    print('genus:       ', genus)
    print('genus hybrid:', genushybrid)
    print('species:     ', species)
    print('-------------------')

Open in new window

The code prints:
genus:        Cypripedium
genus hybrid: -
species:      singchii
-------------------
genus:        Cypripedium
genus hybrid: Cypripedium
species:      smithii
-------------------

Open in new window

0
 
LVL 46

Accepted Solution

by:
aikimark earned 2000 total points
ID: 40404449
It isn't really an "x" as much as it is a place holder for the browser render engine.  It is a a 215 ASCII value.

You should try this pattern:
class="genus">([^<]+)</i>.*?(?:class="specieshybrid">([^<]*)</i>){0,1}.*?class="species">([^<]+)</i>.*?class="authorship">([^<]+)</span>

Open in new window

0
 

Author Closing Comment

by:cpeters5
ID: 40416062
aikimark,
Thanks for the solution.  I have a few more questions in separate posts
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 

Author Comment

by:cpeters5
ID: 40416069
pepr,
Thank you for your response.  Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts.  It will require a lot of changes in the script to switch to XML parser in a short time.
pax
0
 
LVL 46

Expert Comment

by:aikimark
ID: 40416075
I have a few more questions in separate posts
Please post links to your new questions in this thread
0

Featured Post

Vote for the Most Valuable Expert

It’s time to recognize experts that go above and beyond with helpful solutions and engagement on site. Choose from the top experts in the Hall of Fame or on the right rail of your favorite topic page. Look for the blue “Nominate” button on their profile to vote.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The really strange introduction Once upon a time there were individuals who intentionally put the grass seeds to the soil with anticipation of solving their nutrition problems. Or they maybe only played with seeds and noticed what happened... Som…
Strings in Python are the set of characters that, once defined, cannot be changed by any other method like replace. Even if we use the replace method it still does not modify the original string that we use, but just copies the string and then modif…
Learn the basics of lists in Python. Lists, as their name suggests, are a means for ordering and storing values. : Lists are declared using brackets; for example: t = [1, 2, 3]: Lists may contain a mix of data types; for example: t = ['string', 1, T…
Learn the basics of while and for loops in Python.  while loops are used for testing while, or until, a condition is met: The structure of a while loop is as follows:     while <condition>:         do something         repeate: The break statement m…
Suggested Courses
Course of the Month10 days, 18 hours left to enroll

885 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question