How to extract all RegEx matching in Python

Hi all.

I have several occurence of similar text in a very long email:

*1116 1200 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

The text can change much but I am able to get all the relevant matches using this regex:
(?s).*\s(\*\d+\s+\d+.*?\*)+.*

The problem is that I can have many different occurencies of such group to extract and I'd need to implement it in Python.

Python says that (?s) is not a valid RegEx...

Therefore I've tried:
print re.findall(r'.*\s(\*\d+\s+\d+.*?\*)+.*', my_very_long_text.replace('\n', ' ').replace('\r', ''))

But I only print the LAST match and not ALL the matches.

Can you kindly help?
LVL 1
ltpittAsked:
Who is Participating?
 
Flabio GatesCommented:
Here's a simple pattern that returns 3 matches from your sample text:
text = '''
Here's a more complete example of text, I'd basically need to write python code to extract all the occurrencies of the RegEx:

*1116 1200 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

text stuff stuff text

*2316 1212 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

more text more more stuff

*1234 4321 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

My regex is this:
'''
import re
for s in re.findall(r'(?s)(\*\d+.+?\*)', text):
    print(repr(s))

Open in new window


and the output:
'*1116 1200 ABC_Content_124853_124855 1117 1500\nABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*'
'*2316 1212 ABC_Content_124853_124855 1117 1500\nABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*'
'*1234 4321 ABC_Content_124853_124855 1117 1500\nABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*'

Open in new window

1
 
NerdsOfTechTechnology ScientistCommented:
I'm very new to Python so forgive my sad attempt...

I think re.search() should be used instead of re.findall()

print re.search(r'(\s?).*\s(\*\d+\s+\d+.*?\*)+.*', my_very_long_text.replace('\n', ' ').replace('\r', ''))

https://docs.python.org/3.4/library/re.html#search-vs-match

Looking forward to seeing the correct answer...

To break down your regular expression:

(\s?)      one or no space,
.*      one or more of or none of any char,
\s      a space,
(\*\d+\s+\d+.*?\*)
      {
      (an astrisk,
       1 or more digit(s),
       1 or more space(s),
       1 or more digit(s),
         one or more of or none of any char [non-greedy],
       another astrisk)
      }
+       this above {SET} 1 or more times,
.*       then finally none or more of any char.
0
 
Flabio GatesCommented:
You need to post more of your text. You also need to describe in words what it is you are trying to match with your Regular Expression.
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
peprCommented:
The problem probably is that .* is "greedy".  It eats as much characters as it can. This way it probably eats also the sequences that you want to search. Try to modify say .*\s to .*?\s (and the like in other cases). The added questionmark causes the regular expression engine to stop as soon as possibel (non greedy). In this particular case, it will stop on the first \s if the rest can match. Or it is better to replace the dot by something more specific that would not match so much cases.
0
 
ltpittAuthor Commented:
@Fabio Glates
Here's a more complete example of text, I'd basically need to write python code to extract all the occurrencies of the RegEx:

*1116 1200 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

text stuff stuff text

*2316 1212 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

more text more more stuff

*1234 4321 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

My regex is this:
(?s).*\s(\*\d+\s+\d+.*?\*)+.*

(?s) means treat all the text as if it was a single line
.* select everything
\s select spaces

Then my capture group starts with a *, continues with any number of digits, any number of spaces and any number of digits again and ends with a *

The capture group is captured several times with + and then ends with .* selecting all remaining text.

@pepr
I do understand what you say but I'd need to extract ALL occurrencies and not the 1st one...
How can I do that?
0
 
NerdsOfTechTechnology ScientistCommented:
Are you saying that there might be other asterisks in the text? If not couldn't you just wildcard everything between the asterisks?

print re.search(r'(\*.*?\*)', my_very_long_text.replace('\n', ' ').replace('\r', ''))

Open in new window


Or, do you need that exact pattern? If so, try the next line of code

Then my capture group starts with a *, continues with any number of digits, any number of spaces and any number of digits again and ends with a *

(
an asterisk,
1 or more digit(s),
1 or more space(s),
1 or more digit(s),
one or more of or none of any char [? non-greedy],
another asterisk
)

try:
print re.search(r'(\*\d+\s+\d+.*?\*)', my_very_long_text.replace('\n', ' ').replace('\r', ''))

Open in new window


search should treat the multiline as one line
0
 
peprCommented:
Flabio has already shown how to do that.

I do understand what you say but I'd need to extract ALL occurrencies and not the 1st one...
How can I do that?

The reason why your regular expression (?s).*\s(\*\d+\s+\d+.*?\*)+.* found only the first occurence is because of the last .*. After finding the first occurence of the wanted substring, the last .* consumed the rest of the content. As Flabio shows, you should not put it there. Otherwise, the regular expression covers all the string with the first patern. The part eaten by the last .* contains all the other occurences.
1
 
NerdsOfTechTechnology ScientistCommented:
In other words, by replacing .* (greedy) with .*? (non-greedy) the search stops at the first minimum occurrence of the pattern (between the FIRST and SECOND asterisk), instead of the maximum occurrence (between the FIRST and LAST asterisk).
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.