How to extract all RegEx matching in Python

Hi all.

I have several occurence of similar text in a very long email:

*1116 1200 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

The text can change much but I am able to get all the relevant matches using this regex:
(?s).*\s(\*\d+\s+\d+.*?\*)+.*

The problem is that I can have many different occurencies of such group to extract and I'd need to implement it in Python.

Python says that (?s) is not a valid RegEx...

Therefore I've tried:
print re.findall(r'.*\s(\*\d+\s+\d+.*?\*)+.*', my_very_long_text.replace('\n', ' ').replace('\r', ''))

But I only print the LAST match and not ALL the matches.

Can you kindly help?
LVL 1
ltpittAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

NerdsOfTechTechnology ScientistCommented:
I'm very new to Python so forgive my sad attempt...

I think re.search() should be used instead of re.findall()

print re.search(r'(\s?).*\s(\*\d+\s+\d+.*?\*)+.*', my_very_long_text.replace('\n', ' ').replace('\r', ''))

https://docs.python.org/3.4/library/re.html#search-vs-match

Looking forward to seeing the correct answer...

To break down your regular expression:

(\s?)      one or no space,
.*      one or more of or none of any char,
\s      a space,
(\*\d+\s+\d+.*?\*)
      {
      (an astrisk,
       1 or more digit(s),
       1 or more space(s),
       1 or more digit(s),
         one or more of or none of any char [non-greedy],
       another astrisk)
      }
+       this above {SET} 1 or more times,
.*       then finally none or more of any char.
0
Flabio GatesCommented:
You need to post more of your text. You also need to describe in words what it is you are trying to match with your Regular Expression.
0
peprCommented:
The problem probably is that .* is "greedy".  It eats as much characters as it can. This way it probably eats also the sequences that you want to search. Try to modify say .*\s to .*?\s (and the like in other cases). The added questionmark causes the regular expression engine to stop as soon as possibel (non greedy). In this particular case, it will stop on the first \s if the rest can match. Or it is better to replace the dot by something more specific that would not match so much cases.
0
C++ 11 Fundamentals

This course will introduce you to C++ 11 and teach you about syntax fundamentals.

ltpittAuthor Commented:
@Fabio Glates
Here's a more complete example of text, I'd basically need to write python code to extract all the occurrencies of the RegEx:

*1116 1200 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

text stuff stuff text

*2316 1212 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

more text more more stuff

*1234 4321 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

My regex is this:
(?s).*\s(\*\d+\s+\d+.*?\*)+.*

(?s) means treat all the text as if it was a single line
.* select everything
\s select spaces

Then my capture group starts with a *, continues with any number of digits, any number of spaces and any number of digits again and ends with a *

The capture group is captured several times with + and then ends with .* selecting all remaining text.

@pepr
I do understand what you say but I'd need to extract ALL occurrencies and not the 1st one...
How can I do that?
0
NerdsOfTechTechnology ScientistCommented:
Are you saying that there might be other asterisks in the text? If not couldn't you just wildcard everything between the asterisks?

print re.search(r'(\*.*?\*)', my_very_long_text.replace('\n', ' ').replace('\r', ''))

Open in new window


Or, do you need that exact pattern? If so, try the next line of code

Then my capture group starts with a *, continues with any number of digits, any number of spaces and any number of digits again and ends with a *

(
an asterisk,
1 or more digit(s),
1 or more space(s),
1 or more digit(s),
one or more of or none of any char [? non-greedy],
another asterisk
)

try:
print re.search(r'(\*\d+\s+\d+.*?\*)', my_very_long_text.replace('\n', ' ').replace('\r', ''))

Open in new window


search should treat the multiline as one line
0
Flabio GatesCommented:
Here's a simple pattern that returns 3 matches from your sample text:
text = '''
Here's a more complete example of text, I'd basically need to write python code to extract all the occurrencies of the RegEx:

*1116 1200 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

text stuff stuff text

*2316 1212 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

more text more more stuff

*1234 4321 ABC_Content_124853_124855 1117 1500
ABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*

My regex is this:
'''
import re
for s in re.findall(r'(?s)(\*\d+.+?\*)', text):
    print(repr(s))

Open in new window


and the output:
'*1116 1200 ABC_Content_124853_124855 1117 1500\nABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*'
'*2316 1212 ABC_Content_124853_124855 1117 1500\nABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*'
'*1234 4321 ABC_Content_124853_124855 1117 1500\nABC_Content_123456_ABC_124865_Sound 1117 1000 - Documentation - 75% to 84% and 85% to 99%*'

Open in new window

1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
peprCommented:
Flabio has already shown how to do that.

I do understand what you say but I'd need to extract ALL occurrencies and not the 1st one...
How can I do that?

The reason why your regular expression (?s).*\s(\*\d+\s+\d+.*?\*)+.* found only the first occurence is because of the last .*. After finding the first occurence of the wanted substring, the last .* consumed the rest of the content. As Flabio shows, you should not put it there. Otherwise, the regular expression covers all the string with the first patern. The part eaten by the last .* contains all the other occurences.
1
NerdsOfTechTechnology ScientistCommented:
In other words, by replacing .* (greedy) with .*? (non-greedy) the search stops at the first minimum occurrence of the pattern (between the FIRST and SECOND asterisk), instead of the maximum occurrence (between the FIRST and LAST asterisk).
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.