Regular expression matching in Python
Posted on 2014-10-25
I am writing a Python script to fetch information from a list of web pages. The script contains a loop, fetching the content of urls in each iteration. Each URL contains a line which I want to extract data from. These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match. Here are two examples if the input lines:
1) <i class="genus">Cypripedium</i> <i class="species">singchii</i> <span class="authorship">Z.J.Liu & L.J.Chen</span>
2) <i class="genus">Cypripedium</i> <i class="specieshybrid">×</i> <i class="species">smithii</i> <span class="authorship">B.S.Williams</span>
I want to extract species and, if exists the × sign, into species and genushybrid variables.
The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^<]+)',line)
genushybrid = ""
genus = w.group(0)
But I am not able to check if the space character and the × character exists.