Link to home
Start Free TrialLog in
Avatar of cpeters5
cpeters5

asked on

Regular expression matching in Python

I am writing a Python script to fetch information from a list of web pages.  The script contains a loop, fetching the content of urls in each iteration.  Each URL contains a line which I want to extract data from.  These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match.   Here are two examples if the input lines:

1)  <i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>  
2) <i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>  

I want to extract species and, if exists the × sign, into species and genushybrid variables.

The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^<]+)',line)
                  genushybrid = ""
                  genus = w.group(0)

But I am not able to check if the space character and the × character exists.

Thanks
Avatar of pepr
pepr

You may finally make it work, but regular expressions are not that good for parsing HTML or XML when the things get more complex. I suggest to use the standard xml.etree.ElementTree module. As you use HTML fragment and the ElementTree uses XML parser, it does not know &nbsp; and the like entities. See the comments in the code:
#!python3

import xml.etree.ElementTree as ET

# The test data.
lst = [
    '<i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>',
    '<i class="genus">Cypripedium</i> <i class="specieshybrid">×</i>&nbsp;<i class="species">smithii</i> <span class="authorship">B.S.Williams</span>'
]

for s in lst:

    # The first line is malformed. The & should be &amp;
    s = s.replace(' & ', ' &amp; ')

    # The XMLParser is to be used, and it knows only XML entities,
    # not the HTML ones. This way, I am going to replace the &nbsp;
    # by a normal space.
    s = s.replace('&nbsp;', ' ')
    
    # The ElementTree parser returns a single element object. 
    # This requires to wrap the string using tags. The 'fragment'
    # may be replace by whatever you like better.
    s = '<fragment>' + s + '</fragment>'
    root = ET.fromstring(s)
    ##ET.dump(root)

    # The ElementTree returns the parsed document as the element object
    # that can be used as a list of child elements. Here the root contains
    # the 'i' elements. 
    #    Each element has .attrib member, which behaves as a dictionary
    # of the element attributes. It also has the .text element that
    # contains the text wrapped by the tag.
    
    # I assume you always have the 'i' with the genus class as the first one.
    assert root[0].attrib['class'] == 'genus'
    genus = root[0].text
    genushybrid = '-'   # init
    
    # Find the species. Here XPath expression is used. The dot means
    # from this element, the two slashes means elsewhere, then find 
    # the 'i' element with the class equal to 'species'.
    spec_el = root.find('.//i[@class="species"]')
    assert spec_el is not None
    species = spec_el.text
        
    # Test if the hybrid element exists and if it contains the mark.
    # Here the genushybrid is simply copied.
    hybrid_element = root.find('.//i[@class="specieshybrid"]')
    if hybrid_element is not None and hybrid_element.text == '×':
        genushybrid = genus
        
    # Print everything. You may want to change the logic.
    print('genus:       ', genus)
    print('genus hybrid:', genushybrid)
    print('species:     ', species)
    print('-------------------')

Open in new window

The code prints:
genus:        Cypripedium
genus hybrid: -
species:      singchii
-------------------
genus:        Cypripedium
genus hybrid: Cypripedium
species:      smithii
-------------------

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of cpeters5

ASKER

aikimark,
Thanks for the solution.  I have a few more questions in separate posts
pepr,
Thank you for your response.  Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts.  It will require a lot of changes in the script to switch to XML parser in a short time.
pax
I have a few more questions in separate posts
Please post links to your new questions in this thread