How to find both matches in patterns that "overlap"

Hi, I'm just getting my feet wet in Regex and am using the BRRE Delphi Regex library found here:

https://code.google.com/p/brre/

I have a very very simple regex that just looks for 10 digit phone numbers like so:

\d\d\d\s\d\d\d\s\d\d\d\d

(ie. phone numbers that are in the format of 123-456-7890, 334.234.3872, etc.)

I apply the regex to this string:

'blahblah257.290.44449-888-2222blahblah'

The regex finds 257.290.4444, but it doesn't find 449-888-2222. Can anyone offer some insight?

Thanks!
    Shawn
shawn857Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
'.' and '-' should not match \s, but depending on how your search engine handles lookahead, you might use
(?=(\d\d\d\W\d\d\d\W\d\d\d\d))
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Dan CraciunIT ConsultantCommented:
Or
\d{3}[.-]\d{3}[.-]\d{4}

And if your question was why it does not find both matches, the answer is simple: the regex engine is eager to return a match, so it will always return the first match it finds.

HTH,
Dan
0
Dan CraciunIT ConsultantCommented:
@ozo: won't a lookahead by itself always return an empty string? Cause it does not consume any characters, if it's not followed by something, nothing will be captured/returned.
0
Cloud Class® Course: Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

ozoCommented:
That's why I added capture parentheses inside it.
0
Dan CraciunIT ConsultantCommented:
Even with the parenthesis, it's the same result.
(?=(\d\d\d\W\d\d\d\W\d\d\d\d)).{12} will give you the correct result.
Just (?=(\d\d\d\W\d\d\d\W\d\d\d\d)) will give you an empty string, at least according to RegexBuddy.
0
ozoCommented:
That's why I qualified it depending on how your search engine handles lookahead.  Some implementations may require additional consumption, but a single . should be sufficient for that, .{12} would block overlapping patterns
0
shawn857Author Commented:
Thanks guys... but it seems like none of those approaches work according to a quick test at Regexpal.com

Thanks
   Shawn

P.S: Ozo, you're right about \s ... my regex should have looked like this:

d\d\d[-.]\d\d\d[-.]\d\d\d\d
0
ozoCommented:
(?=(\d\d\d\W\d\d\d\W\d\d\d\d)).
does find both in http://regexpal.com/
although I don't see an option to return the capture groups
0
shawn857Author Commented:
Strange... doesn't work at all for me in Regexpal.com. Please see attached screenshot. Am I doing something wrong?

Also, what are "capture groups"?

Thanks
   Shawn
regexpal.JPG
0
ozoCommented:
regexpal.pngin regexpal and other common regex engines, parenthesis groups that don't start with ? capture
(expr)       Capture expr for use with \1, etc.
0
shawn857Author Commented:
Sorry Ozo, I don't quite follow you there Are you saying I need to get rid of the opening parenthesis in the regex.

I didn't understand what you meant by this: "Capture expr for use with \1, etc. "

Thanks
   Shawn
0
ozoCommented:
If your regex engine respects capture groups, then you'll want to keep the parentheses.

"Capture expr for use with \1, etc. " was from http://regexpal.com/#quickReference
But since I don't see a way to display what was captured, I don't know how to demonstrate them on that site.
0
shawn857Author Commented:
Ozo, I think the BRRE regex engine I'm using does have "capture groups"... I see in its source code, it has declarations and variables for "capture".

The rest of your post, sorry, but I still don't understand what you mean. I still don't know what "capture groups" are...

Thanks
   Shawn
0
ozoCommented:
Does the BRRE documentation explain its declarations and variables for "capture"?
0
shawn857Author Commented:
No, there's very little documentation or comments in that BRRE.pas unit.

Ozo, I still don't understand how it worked fine for you in Regexpal.com, but not for me...

Thanks
   Shawn
0
ozoCommented:
As Dan Craciun suggested, some regex packages, apparently including regexpal.com, decide that
Cause it does not consume any characters, if it's not followed by something, nothing will be captured/returned.
So in http:#a40472189, I just followed it with a .
0
shawn857Author Commented:
ahhhh okay, I didn't know about this period '.' at the end of your regex. I didn't know that was meant to be included in the regex. Yes now it works in Regexpal.com!

Ozo, do you know if this approach (ie. finding overlapping matches) "slows down" the execution of the regex very much... or is negligible?

Thanks!
    Shawn
0
Dan CraciunIT ConsultantCommented:
@ozo: maybe I did not understand correctly, but wasn't the original question: why the pattern returns only "257.290.4444" and not "449-888-2222" too?

(?=(\d\d\d\W\d\d\d\W\d\d\d\d)). will return "2" and "4". So you will know you have 2 matches, but you won't know what those matches are.

PS: I've tried most of regex engines in RegexBuddy and cannot find one where a lookaround will return anything. It's a zero-length assertion and will only return "match" (and move the pointer to the beginning of the match) or "no match" (in which case it does not do anything).
0
ozoCommented:
I thought the original question was how to find both.

If the regex implementation returns capture groups, you can know what both matches are.
If the regex implementation returns the position of the matches, you can reference the original string to determine what both matches are.

On the other hand, perhaps it would be more useful to have a regex that finds neither, so that it only finds patterns that match \d{3}[.-]\d{3}[.-]\d{4} and do not match \d{4}[.-]\d{3}[.-]\d{4} or \d{3}[.-]\d{3}[.-]\d{5}
0
shawn857Author Commented:
ahhh Dan, I guess you are right. You said:

"(?=(\d\d\d\W\d\d\d\W\d\d\d\d)). will return "2" and "4". So you will know you have 2 matches, but you won't know what those matches are."

Yes, when I tried it in Regexpal.com last night, it highlighted only the "2" and the "4". I thought this was Regexpal's way of saying that it found both "overlapping patterns".  But I guess I spoke too soon when I said it worked. So back to the drawing board. I thought this would be a trivial thing for Regex to do... given how it is naturally "greedy" by design, I thought it would pick up any and all possible matches.

Thanks
   Shawn
0
ozoCommented:
If the regex implementation returns capture groups, you can know what both matches are.
If the regex implementation returns the position of the matches, you can reference the original string to determine what both matches are.
Most regex packages should have means of returning both.
Regexpal.com says that it understands capture groups, but I don't see anything on the site that says how the contents of the capture groups are returned.
0
Dan CraciunIT ConsultantCommented:
I'm still confused.
Ozo, can you please tell me what regex engine you use where lookarounds return anything else than "match" or "no match"?
0
shawn857Author Commented:
Thanks Ozo... I'm quite sure that the BRRE module I'm using has Capture Groups - the word "capture" occurs throughout the source code. I just don't know how to use capture groups, nor what they are. Is this all something I could include in my main reg expression... or is it more complicated?

Thanks
   Shawn
0
Dan CraciunIT ConsultantCommented:
Look here for capture groups: http://www.regular-expressions.info/brackets.html

I would advise you read the whole tutorial from that site. It will be a few hours well spent if you're interested in regular expressions.
0
käµfm³d 👽Commented:
what regex engine you use where lookarounds return anything else than "match" or "no match"
It is true that a lookaround will return "match" or "no match", but as ozo stated earlier, some regex engines--.NET's for example--allow you to use capture groups within the lookahead. So you can in essence get around the zero-width of a lookahead by using a capture group--the caveat being that you have to inspect the capture group, not the match result itself.
0
Dan CraciunIT ConsultantCommented:
I tested in Powershell and you're right:
$text = 'blahblah257.290.44449-888-2222blahblah'
$matches = ([regex]'(?=(\d{3}[.-]\d{3}[.-]\d{4}))').Matches($text)
foreach ($i in $matches.Groups) {
  $i.Value
}

Open in new window

The only weird thing is that PS finds 4 matches: 1 empty string (probably the lookahead) for each find.
0
shawn857Author Commented:
So if the BRRE engine I'm using does have capture groups... it's still a matter of another step involved to take the matches and do further processing on them? This is not something that can be done all in one pass within one regex expression?

Thanks
   Shawn
0
käµfm³d 👽Commented:
BRRE does have capture groups as far as I can see--you have to use either the BRREMatch or BRREMatchAll functions and inspect the Captures parmeter once the call returns. I would say that you can certainly achieve this in one pass; the only thing that's different is that you are inspecting captures, not matches.

I tried to test this library out in C#, but I'm not terribly great with interop code from C# to C++. I will say that in my opinion I think you should go with another library. As you've already witnessed, the documentation is anything but good. (Let's face it:  the documentation is basically, "Here, read this code.") Unfortunately, I don't have any other alternatives to suggest--that is, unless maybe you can import the Boost libraries into Delphi. Boost has a pretty good regex library. I don't know the intracies of working with C++ code in Delphi, though.
0
shawn857Author Commented:
Thanks Kaufmed. Does anyone know a good fast Delphi regex library that could do these "capture groups", and also provides decent docs and examples?

Thanks
   Shawn
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Regular Expressions

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.