asked on

concatenate and seperate group values from a regex using python

My regular expression would return two gruops. I need to concatenate these group values and seperate each value with a full stop and space. e.g. my string is shown below and regex in code part. How can we have some thing like this

"687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis. "

<tr>
<th scope="row">Drug Name: </th>
<td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
</tr>
<tr>
<th width="25%" scope="row">SMC Drug ID: </th>

<td >687/11</td>
</tr>
<tr>
<th scope="row">Manufacturer:</th>
<td>ALK-Abello Ltd</td>
</tr>
<tr>

<th scope="row">Indication:</th>
<td>
Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
</td>
</tr>

(?s)(?i)<th.*?>SMC Drug ID: </th>.*?<td ?>(.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(.*?)</td>

Open in new window

pepr

My suggestion is not to focus on regular expressions in the case. The reason is that regular expressions are not powerful enough to do the task reliably. Regular expressions are not capable to to describe patterns with nested pairs of braces (i.e. with <tr>...</tr> or the like. You need more powerful means for parsing that kind of information.

If the fragment comes from well formed XML, then I suggest to use the standard Python xml.etree.ElementTree capabilities. If the fragment comes from not-so-well formed HTML, then BeautifulSoup may be the right tool for the task.

mmalik15

ASKER

Thanks for the comment.

I'm a begginer in python; cn do it easily in .net. Can you possibly give me a simple example of string concatination and group extraction using regex in python thanks

pepr

Try the following:

a.py

import xml.etree.ElementTree as ET

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


# If you know it is a fragment, you want to add the dummy element tags
# around to get a single element fragment.      
e = ET.fromstring('<dummy>' + content + '</dummy>')

# Now, the fragment was parsed, and the element represents a structure 
# with the content.
print 'Element tag:', e.tag           # we know it is the dummy
print 'Children of the element are the tr elements with th and td elements'
for child in e:
   print '-' * 70       # just a separator line
   print child.tag      # this is expected to be 'tr'
   print child.attrib   # no tr attributes => empty dictionary structure
   
   # The following children contain the information...
   for x in child:      
       print x.tag      # 'th' or 'td'
       print x.text     # text from inside the element (but not the nested subelements

Open in new window

It prints on my console (captured from the screen, thus lines wraped):

c:\tmp\_Python\mmalik15\Q_27511432>python a.py
Element tag: dummy
Children of the element are the tr elements with th and td elements
----------------------------------------------------------------------
tr
{}
th
Drug Name:
td
adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen
(Jext)
----------------------------------------------------------------------
tr
{}
th
SMC Drug ID:
td
687/11
----------------------------------------------------------------------
tr
{}
th
Manufacturer:
td
ALK-Abello Ltd
----------------------------------------------------------------------
tr
{}
th
Indication:
td

            Emergency treatment of allergic reactions (anaphylaxis) to insect st
ings or bites, foods, drugs and other allergens as well as idiopathic or exercis
e induced anaphylaxis

Open in new window

ASKER CERTIFIED SOLUTION

pepr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pepr

If you wan to use the regular pattern intensively, it may be a good idea to compile it. Then you get a regular expression object that captures the compiled regular expression pattern inside. Repeated searching is much faster in such cases. The (?i) and (?s) can then be compiled inside also via flags (see below).

Another good idea may be to name the regular pattern groups. See the following example:

c.py

import re

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


# Notice that (?i)(?s) was removed. The groups were given names
# via the (?P<name>...) syntax.
pattern = r'<th.*?>SMC Drug ID: </th>.*?<td ?>(?P<id>.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(?P<des>.*?)</td>'

regex = re.compile(pattern, re.I | re.S) # flags given not through patternexplicitly
m = regex.search(content)                # no pattern argument here
if m is not None:
    drugId = m.group('id').strip()       # now string identifiers can be used...
    description = m.group('des').strip() # ... instead of numbers.
    print '{0}. {1}'.format(drugId, description)  # C#-like formatting

Open in new window

mmalik15

ASKER

Awesome buddy thank!

pepr

You are welcome. Thanks ;)