mmalik15
asked on
concatenate and seperate group values from a regex using python
My regular expression would return two gruops. I need to concatenate these group values and seperate each value with a full stop and space. e.g. my string is shown below and regex in code part. How can we have some thing like this
"687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis. "
<tr>
<th scope="row">Drug Name: </th>
<td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
</tr>
<tr>
<th width="25%" scope="row">SMC Drug ID: </th>
<td >687/11</td>
</tr>
<tr>
<th scope="row">Manufacturer:< /th>
<td>ALK-Abello Ltd</td>
</tr>
<tr>
<th scope="row">Indication:</t h>
<td>
Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
</td>
</tr>
"687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis. "
<tr>
<th scope="row">Drug Name: </th>
<td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
</tr>
<tr>
<th width="25%" scope="row">SMC Drug ID: </th>
<td >687/11</td>
</tr>
<tr>
<th scope="row">Manufacturer:<
<td>ALK-Abello Ltd</td>
</tr>
<tr>
<th scope="row">Indication:</t
<td>
Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
</td>
</tr>
(?s)(?i)<th.*?>SMC Drug ID: </th>.*?<td ?>(.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(.*?)</td>
ASKER
Thanks for the comment.
I'm a begginer in python; cn do it easily in .net. Can you possibly give me a simple example of string concatination and group extraction using regex in python thanks
I'm a begginer in python; cn do it easily in .net. Can you possibly give me a simple example of string concatination and group extraction using regex in python thanks
Try the following:
a.py
It prints on my console (captured from the screen, thus lines wraped):
a.py
import xml.etree.ElementTree as ET
content = '''<tr>
<th scope="row">Drug Name: </th>
<td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
</tr>
<tr>
<th width="25%" scope="row">SMC Drug ID: </th>
<td >687/11</td>
</tr>
<tr>
<th scope="row">Manufacturer:</th>
<td>ALK-Abello Ltd</td>
</tr>
<tr>
<th scope="row">Indication:</th>
<td>
Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
</td>
</tr>'''
# If you know it is a fragment, you want to add the dummy element tags
# around to get a single element fragment.
e = ET.fromstring('<dummy>' + content + '</dummy>')
# Now, the fragment was parsed, and the element represents a structure
# with the content.
print 'Element tag:', e.tag # we know it is the dummy
print 'Children of the element are the tr elements with th and td elements'
for child in e:
print '-' * 70 # just a separator line
print child.tag # this is expected to be 'tr'
print child.attrib # no tr attributes => empty dictionary structure
# The following children contain the information...
for x in child:
print x.tag # 'th' or 'td'
print x.text # text from inside the element (but not the nested subelements
It prints on my console (captured from the screen, thus lines wraped):
c:\tmp\_Python\mmalik15\Q_27511432>python a.py
Element tag: dummy
Children of the element are the tr elements with th and td elements
----------------------------------------------------------------------
tr
{}
th
Drug Name:
td
adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen
(Jext)
----------------------------------------------------------------------
tr
{}
th
SMC Drug ID:
td
687/11
----------------------------------------------------------------------
tr
{}
th
Manufacturer:
td
ALK-Abello Ltd
----------------------------------------------------------------------
tr
{}
th
Indication:
td
Emergency treatment of allergic reactions (anaphylaxis) to insect st
ings or bites, foods, drugs and other allergens as well as idiopathic or exercis
e induced anaphylaxis
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If you wan to use the regular pattern intensively, it may be a good idea to compile it. Then you get a regular expression object that captures the compiled regular expression pattern inside. Repeated searching is much faster in such cases. The (?i) and (?s) can then be compiled inside also via flags (see below).
Another good idea may be to name the regular pattern groups. See the following example:
c.py
Another good idea may be to name the regular pattern groups. See the following example:
c.py
import re
content = '''<tr>
<th scope="row">Drug Name: </th>
<td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
</tr>
<tr>
<th width="25%" scope="row">SMC Drug ID: </th>
<td >687/11</td>
</tr>
<tr>
<th scope="row">Manufacturer:</th>
<td>ALK-Abello Ltd</td>
</tr>
<tr>
<th scope="row">Indication:</th>
<td>
Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
</td>
</tr>'''
# Notice that (?i)(?s) was removed. The groups were given names
# via the (?P<name>...) syntax.
pattern = r'<th.*?>SMC Drug ID: </th>.*?<td ?>(?P<id>.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(?P<des>.*?)</td>'
regex = re.compile(pattern, re.I | re.S) # flags given not through patternexplicitly
m = regex.search(content) # no pattern argument here
if m is not None:
drugId = m.group('id').strip() # now string identifiers can be used...
description = m.group('des').strip() # ... instead of numbers.
print '{0}. {1}'.format(drugId, description) # C#-like formatting
ASKER
Awesome buddy thank!
You are welcome. Thanks ;)
If the fragment comes from well formed XML, then I suggest to use the standard Python xml.etree.ElementTree capabilities. If the fragment comes from not-so-well formed HTML, then BeautifulSoup may be the right tool for the task.