cpeters5
asked on
Regular expression matching in Python
I am writing a Python script to fetch information from a list of web pages. The script contains a loop, fetching the content of urls in each iteration. Each URL contains a line which I want to extract data from. These lines are slightly different, some may contain the hard space ( ) character and × sign which cause problem for the match. Here are two examples if the input lines:
1) <i class="genus">Cypripedium< /i> <i class="species">singchii</ i> <span class="authorship">Z.J.Liu & L.J.Chen</span>
2) <i class="genus">Cypripedium< /i> <i class="specieshybrid">×</i > <i class="species">smithii</i > <span class="authorship">B.S.Wil liams</spa n>
I want to extract species and, if exists the × sign, into species and genushybrid variables.
The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^ <]+)',line )
genushybrid = ""
genus = w.group(0)
But I am not able to check if the space character and the × character exists.
Thanks
1) <i class="genus">Cypripedium<
2) <i class="genus">Cypripedium<
I want to extract species and, if exists the × sign, into species and genushybrid variables.
The following code works partially if I ignore the × sign:
w = re.search('(?<=genus">)([^
genushybrid = ""
genus = w.group(0)
But I am not able to check if the space character and the × character exists.
Thanks
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
aikimark,
Thanks for the solution. I have a few more questions in separate posts
Thanks for the solution. I have a few more questions in separate posts
ASKER
pepr,
Thank you for your response. Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts. It will require a lot of changes in the script to switch to XML parser in a short time.
pax
Thank you for your response. Unfortunately, I cannot get it to work cleanly. Besides, this problem is a part of a much bigger scripts. It will require a lot of changes in the script to switch to XML parser in a short time.
pax
I have a few more questions in separate postsPlease post links to your new questions in this thread
ASKER
Here is my follow up question
https://www.experts-exchange.com/questions/28548801/Python-TypeError.html
https://www.experts-exchange.com/questions/28548801/Python-TypeError.html
ASKER
Open in new window
The code prints:Open in new window