concatenate and seperate group values from a regex using python

My regular expression would return two gruops. I need to concatenate these group values and seperate each value with a full stop and space. e.g. my string is  shown below and regex in code part. How can we have some thing like this

"687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis. "


 <tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>
(?s)(?i)<th.*?>SMC Drug ID: </th>.*?<td ?>(.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(.*?)</td>

Open in new window

mmalik15Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

peprCommented:
My suggestion is not to focus on regular expressions in the case.  The reason is that regular expressions are not powerful enough to do the task reliably.  Regular expressions are not capable to to describe patterns with nested pairs of braces (i.e. with <tr>...</tr> or the like.  You need more powerful means for parsing that kind of information.

If the fragment comes from well formed XML, then I suggest to use the standard Python xml.etree.ElementTree capabilities.  If the fragment comes from not-so-well formed HTML, then BeautifulSoup may be the right tool for the task.
0
mmalik15Author Commented:
Thanks for the comment.

I'm a begginer in python; cn do it easily in .net. Can you possibly give me a simple example of string concatination and group extraction using regex  in python thanks



0
peprCommented:
Try the following:

a.py
import xml.etree.ElementTree as ET

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


# If you know it is a fragment, you want to add the dummy element tags
# around to get a single element fragment.      
e = ET.fromstring('<dummy>' + content + '</dummy>')

# Now, the fragment was parsed, and the element represents a structure 
# with the content.
print 'Element tag:', e.tag           # we know it is the dummy
print 'Children of the element are the tr elements with th and td elements'
for child in e:
   print '-' * 70       # just a separator line
   print child.tag      # this is expected to be 'tr'
   print child.attrib   # no tr attributes => empty dictionary structure
   
   # The following children contain the information...
   for x in child:      
       print x.tag      # 'th' or 'td'
       print x.text     # text from inside the element (but not the nested subelements

Open in new window


It prints on my console (captured from the screen, thus lines wraped):

c:\tmp\_Python\mmalik15\Q_27511432>python a.py
Element tag: dummy
Children of the element are the tr elements with th and td elements
----------------------------------------------------------------------
tr
{}
th
Drug Name:
td
adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen
(Jext)
----------------------------------------------------------------------
tr
{}
th
SMC Drug ID:
td
687/11
----------------------------------------------------------------------
tr
{}
th
Manufacturer:
td
ALK-Abello Ltd
----------------------------------------------------------------------
tr
{}
th
Indication:
td

            Emergency treatment of allergic reactions (anaphylaxis) to insect st
ings or bites, foods, drugs and other allergens as well as idiopathic or exercis
e induced anaphylaxis

Open in new window

0
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

peprCommented:
OK the simplistic way is:

b.py
import re

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


pattern = r'(?s)(?i)<th.*?>SMC Drug ID: </th>.*?<td ?>(.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(.*?)</td>'

m = re.search(pattern, content)
if m is not None:
    print m.group(1)
    print m.group(2)
    
    print '-' * 70
    drugId = m.group(1).strip()      # remove leading and trailing spaces
    description = m.group(2).strip()
    print drugId + '. ' + description

Open in new window


It prints (wrapped):

c:\tmp\_Python\mmalik15\Q_27511432>python b.py
687/11

            Emergency treatment of allergic reactions (anaphylaxis) to insect st
ings or bites, foods, drugs and other allergens as well as idiopathic or exercis
e induced anaphylaxis

----------------------------------------------------------------------
687/11. Emergency treatment of allergic reactions (anaphylaxis) to insect stings
 or bites, foods, drugs and other allergens as well as idiopathic or exercise in
duced anaphylaxis

Open in new window


Basically, you want to use the re.search() operation to apply the pattern to the string.  If you get some object back, then it so called "match object".  Its .group() method can be used to access the matched groups.  See http://docs.python.org/library/re.html#re.search.

The other operations are just string operations, including the concatenation by the + operator.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
peprCommented:
If you wan to use the regular pattern intensively, it may be a good idea to compile it.  Then you get a regular expression object that captures the compiled regular expression pattern inside.  Repeated searching is much faster in such cases.  The (?i) and (?s) can then be compiled inside also via flags (see below).

Another good idea may be to name the regular pattern groups.  See the following example:

c.py
import re

content = '''<tr>
          <th scope="row">Drug Name: </th>
          <td>adrenaline tartrate 150/300 micrograms solution for injection in pre-filled pen (Jext)</td>
        </tr>
                <tr>
          <th width="25%" scope="row">SMC Drug ID: </th>

          <td >687/11</td>
        </tr>
               <tr>
        <th scope="row">Manufacturer:</th>
        <td>ALK-Abello Ltd</td>
      </tr>
          <tr>

        <th scope="row">Indication:</th>
        <td>
            Emergency treatment of allergic reactions (anaphylaxis) to insect stings or bites, foods, drugs and other allergens as well as idiopathic or exercise induced anaphylaxis
        </td>
      </tr>'''


# Notice that (?i)(?s) was removed. The groups were given names
# via the (?P<name>...) syntax.
pattern = r'<th.*?>SMC Drug ID: </th>.*?<td ?>(?P<id>.*?)</td>.*?<th scope="row">Indication:</th>.*?<td>(?P<des>.*?)</td>'

regex = re.compile(pattern, re.I | re.S) # flags given not through patternexplicitly
m = regex.search(content)                # no pattern argument here
if m is not None:
    drugId = m.group('id').strip()       # now string identifiers can be used...
    description = m.group('des').strip() # ... instead of numbers.
    print '{0}. {1}'.format(drugId, description)  # C#-like formatting

Open in new window

0
mmalik15Author Commented:
Awesome buddy thank!
0
peprCommented:
You are welcome.  Thanks ;)
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Python

From novice to tech pro — start learning today.