concatenate and seperate group values from a regex using python

My regular expression would return two gruops. I need to concatenate these group values and seperate each value with a full stop and space. e.g. my string is  shown below and regex in code part. How can we have some thing like this

"687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis. "


 <tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>
(?s)(?i)<th.*?>SMC Drug ID: </th>.*?<td ?>(.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(.*?)</td>

Open in new window

mmalik15Asked:
Who is Participating?
 
peprConnect With a Mentor Commented:
OK the simplistic way is:

b.py
import re

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


pattern = r'(?s)(?i)<th.*?>SMC Drug ID: </th>.*?<td ?>(.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(.*?)</td>'

m = re.search(pattern, content)
if m is not None:
    print m.group(1)
    print m.group(2)
    
    print '-' * 70
    drugId = m.group(1).strip()      # remove leading and trailing spaces
    description = m.group(2).strip()
    print drugId + '. ' + description

Open in new window


It prints (wrapped):

c:\tmp\_Python\mmalik15\Q_27511432>python b.py
687/11

            Emergency treatment of allergic reactions (anaphylaxis) to insect st
ings or bites, foods, drugs and other allergens as well as idiopathic or exercis
e induced anaphylaxis

----------------------------------------------------------------------
687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings
 or bites, foods, drugs and other allergens as well as idiopathic or exercise in
duced anaphylaxis

Open in new window


Basically, you want to use the re.search() operation to apply the pattern to the string.  If you get some object back, then it so called "match object".  Its .group() method can be used to access the matched groups.  See http://docs.python.org/library/re.html#re.search.

The other operations are just string operations, including the concatenation by the + operator.
0
 
peprCommented:
My suggestion is not to focus on regular expressions in the case.  The reason is that regular expressions are not powerful enough to do the task reliably.  Regular expressions are not capable to to describe patterns with nested pairs of braces (i.e. with <tr>...</tr> or the like.  You need more powerful means for parsing that kind of information.

If the fragment comes from well formed XML, then I suggest to use the standard Python xml.etree.ElementTree capabilities.  If the fragment comes from not-so-well formed HTML, then BeautifulSoup may be the right tool for the task.
0
 
mmalik15Author Commented:
Thanks for the comment.

I'm a begginer in python; cn do it easily in .net. Can you possibly give me a simple example of string concatination and group extraction using regex  in python thanks



0
Cloud Class® Course: Microsoft Office 2010

This course will introduce you to the interfaces and features of Microsoft Office 2010 Word, Excel, PowerPoint, Outlook, and Access. You will learn about the features that are shared between all products in the Office suite, as well as the new features that are product specific.

 
peprCommented:
Try the following:

a.py
import xml.etree.ElementTree as ET

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


# If you know it is a fragment, you want to add the dummy element tags
# around to get a single element fragment.      
e = ET.fromstring('<dummy>' + content + '</dummy>')

# Now, the fragment was parsed, and the element represents a structure 
# with the content.
print 'Element tag:', e.tag           # we know it is the dummy
print 'Children of the element are the tr elements with th and td elements'
for child in e:
   print '-' * 70       # just a separator line
   print child.tag      # this is expected to be 'tr'
   print child.attrib   # no tr attributes => empty dictionary structure
   
   # The following children contain the information...
   for x in child:      
       print x.tag      # 'th' or 'td'
       print x.text     # text from inside the element (but not the nested subelements

Open in new window


It prints on my console (captured from the screen, thus lines wraped):

c:\tmp\_Python\mmalik15\Q_27511432>python a.py
Element tag: dummy
Children of the element are the tr elements with th and td elements
----------------------------------------------------------------------
tr
{}
th
Drug Name:
td
adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen
(Jext)
----------------------------------------------------------------------
tr
{}
th
SMC Drug ID:
td
687/11
----------------------------------------------------------------------
tr
{}
th
Manufacturer:
td
ALK-Abello Ltd
----------------------------------------------------------------------
tr
{}
th
Indication:
td

            Emergency treatment of allergic reactions (anaphylaxis) to insect st
ings or bites, foods, drugs and other allergens as well as idiopathic or exercis
e induced anaphylaxis

Open in new window

0
 
peprCommented:
If you wan to use the regular pattern intensively, it may be a good idea to compile it.  Then you get a regular expression object that captures the compiled regular expression pattern inside.  Repeated searching is much faster in such cases.  The (?i) and (?s) can then be compiled inside also via flags (see below).

Another good idea may be to name the regular pattern groups.  See the following example:

c.py
import re

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


# Notice that (?i)(?s) was removed. The groups were given names
# via the (?P<name>...) syntax.
pattern = r'<th.*?>SMC Drug ID: </th>.*?<td ?>(?P<id>.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(?P<des>.*?)</td>'

regex = re.compile(pattern, re.I | re.S) # flags given not through patternexplicitly
m = regex.search(content)                # no pattern argument here
if m is not None:
    drugId = m.group('id').strip()       # now string identifiers can be used...
    description = m.group('des').strip() # ... instead of numbers.
    print '{0}. {1}'.format(drugId, description)  # C#-like formatting

Open in new window

0
 
mmalik15Author Commented:
Awesome buddy thank!
0
 
peprCommented:
You are welcome.  Thanks ;)
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.