Solved

How to find both matches in patterns that "overlap"

Posted on 2014-11-29
29
90 Views
Last Modified: 2015-02-10
Hi, I'm just getting my feet wet in Regex and am using the BRRE Delphi Regex library found here:

https://code.google.com/p/brre/

I have a very very simple regex that just looks for 10 digit phone numbers like so:

\d\d\d\s\d\d\d\s\d\d\d\d

(ie. phone numbers that are in the format of 123-456-7890, 334.234.3872, etc.)

I apply the regex to this string:

'blahblah257.290.44449-888-2222blahblah'

The regex finds 257.290.4444, but it doesn't find 449-888-2222. Can anyone offer some insight?

Thanks!
    Shawn
0
Comment
Question by:shawn857
  • 10
  • 10
  • 7
  • +1
29 Comments
 
LVL 84

Accepted Solution

by:
ozo earned 250 total points
ID: 40471955
'.' and '-' should not match \s, but depending on how your search engine handles lookahead, you might use
(?=(\d\d\d\W\d\d\d\W\d\d\d\d))
0
 
LVL 34

Assisted Solution

by:Dan Craciun
Dan Craciun earned 250 total points
ID: 40471974
Or
\d{3}[.-]\d{3}[.-]\d{4}

And if your question was why it does not find both matches, the answer is simple: the regex engine is eager to return a match, so it will always return the first match it finds.

HTH,
Dan
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 40471991
@ozo: won't a lookahead by itself always return an empty string? Cause it does not consume any characters, if it's not followed by something, nothing will be captured/returned.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40471993
That's why I added capture parentheses inside it.
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 40471997
Even with the parenthesis, it's the same result.
(?=(\d\d\d\W\d\d\d\W\d\d\d\d)).{12} will give you the correct result.
Just (?=(\d\d\d\W\d\d\d\W\d\d\d\d)) will give you an empty string, at least according to RegexBuddy.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472015
That's why I qualified it depending on how your search engine handles lookahead.  Some implementations may require additional consumption, but a single . should be sufficient for that, .{12} would block overlapping patterns
0
 

Author Comment

by:shawn857
ID: 40472108
Thanks guys... but it seems like none of those approaches work according to a quick test at Regexpal.com

Thanks
   Shawn

P.S: Ozo, you're right about \s ... my regex should have looked like this:

d\d\d[-.]\d\d\d[-.]\d\d\d\d
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472163
(?=(\d\d\d\W\d\d\d\W\d\d\d\d)).
does find both in http://regexpal.com/
although I don't see an option to return the capture groups
0
 

Author Comment

by:shawn857
ID: 40472178
Strange... doesn't work at all for me in Regexpal.com. Please see attached screenshot. Am I doing something wrong?

Also, what are "capture groups"?

Thanks
   Shawn
regexpal.JPG
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472189
regexpal.pngin regexpal and other common regex engines, parenthesis groups that don't start with ? capture
(expr)       Capture expr for use with \1, etc.
0
 

Author Comment

by:shawn857
ID: 40472192
Sorry Ozo, I don't quite follow you there Are you saying I need to get rid of the opening parenthesis in the regex.

I didn't understand what you meant by this: "Capture expr for use with \1, etc. "

Thanks
   Shawn
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472209
If your regex engine respects capture groups, then you'll want to keep the parentheses.

"Capture expr for use with \1, etc. " was from http://regexpal.com/#quickReference
But since I don't see a way to display what was captured, I don't know how to demonstrate them on that site.
0
 

Author Comment

by:shawn857
ID: 40472216
Ozo, I think the BRRE regex engine I'm using does have "capture groups"... I see in its source code, it has declarations and variables for "capture".

The rest of your post, sorry, but I still don't understand what you mean. I still don't know what "capture groups" are...

Thanks
   Shawn
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472229
Does the BRRE documentation explain its declarations and variables for "capture"?
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:shawn857
ID: 40472232
No, there's very little documentation or comments in that BRRE.pas unit.

Ozo, I still don't understand how it worked fine for you in Regexpal.com, but not for me...

Thanks
   Shawn
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472234
As Dan Craciun suggested, some regex packages, apparently including regexpal.com, decide that
Cause it does not consume any characters, if it's not followed by something, nothing will be captured/returned.
So in http:#a40472189, I just followed it with a .
0
 

Author Comment

by:shawn857
ID: 40472238
ahhhh okay, I didn't know about this period '.' at the end of your regex. I didn't know that was meant to be included in the regex. Yes now it works in Regexpal.com!

Ozo, do you know if this approach (ie. finding overlapping matches) "slows down" the execution of the regex very much... or is negligible?

Thanks!
    Shawn
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 40472261
@ozo: maybe I did not understand correctly, but wasn't the original question: why the pattern returns only "257.290.4444" and not "449-888-2222" too?

(?=(\d\d\d\W\d\d\d\W\d\d\d\d)). will return "2" and "4". So you will know you have 2 matches, but you won't know what those matches are.

PS: I've tried most of regex engines in RegexBuddy and cannot find one where a lookaround will return anything. It's a zero-length assertion and will only return "match" (and move the pointer to the beginning of the match) or "no match" (in which case it does not do anything).
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472274
I thought the original question was how to find both.

If the regex implementation returns capture groups, you can know what both matches are.
If the regex implementation returns the position of the matches, you can reference the original string to determine what both matches are.

On the other hand, perhaps it would be more useful to have a regex that finds neither, so that it only finds patterns that match \d{3}[.-]\d{3}[.-]\d{4} and do not match \d{4}[.-]\d{3}[.-]\d{4} or \d{3}[.-]\d{3}[.-]\d{5}
0
 

Author Comment

by:shawn857
ID: 40472790
ahhh Dan, I guess you are right. You said:

"(?=(\d\d\d\W\d\d\d\W\d\d\d\d)). will return "2" and "4". So you will know you have 2 matches, but you won't know what those matches are."

Yes, when I tried it in Regexpal.com last night, it highlighted only the "2" and the "4". I thought this was Regexpal's way of saying that it found both "overlapping patterns".  But I guess I spoke too soon when I said it worked. So back to the drawing board. I thought this would be a trivial thing for Regex to do... given how it is naturally "greedy" by design, I thought it would pick up any and all possible matches.

Thanks
   Shawn
0
 
LVL 84

Expert Comment

by:ozo
ID: 40472802
If the regex implementation returns capture groups, you can know what both matches are.
If the regex implementation returns the position of the matches, you can reference the original string to determine what both matches are.
Most regex packages should have means of returning both.
Regexpal.com says that it understands capture groups, but I don't see anything on the site that says how the contents of the capture groups are returned.
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 40472812
I'm still confused.
Ozo, can you please tell me what regex engine you use where lookarounds return anything else than "match" or "no match"?
0
 

Author Comment

by:shawn857
ID: 40472845
Thanks Ozo... I'm quite sure that the BRRE module I'm using has Capture Groups - the word "capture" occurs throughout the source code. I just don't know how to use capture groups, nor what they are. Is this all something I could include in my main reg expression... or is it more complicated?

Thanks
   Shawn
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 40472851
Look here for capture groups: http://www.regular-expressions.info/brackets.html

I would advise you read the whole tutorial from that site. It will be a few hours well spent if you're interested in regular expressions.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 40475249
what regex engine you use where lookarounds return anything else than "match" or "no match"
It is true that a lookaround will return "match" or "no match", but as ozo stated earlier, some regex engines--.NET's for example--allow you to use capture groups within the lookahead. So you can in essence get around the zero-width of a lookahead by using a capture group--the caveat being that you have to inspect the capture group, not the match result itself.
0
 
LVL 34

Expert Comment

by:Dan Craciun
ID: 40475583
I tested in Powershell and you're right:
$text = 'blahblah257.290.44449-888-2222blahblah'
$matches = ([regex]'(?=(\d{3}[.-]\d{3}[.-]\d{4}))').Matches($text)
foreach ($i in $matches.Groups) {
  $i.Value
}

Open in new window

The only weird thing is that PS finds 4 matches: 1 empty string (probably the lookahead) for each find.
0
 

Author Comment

by:shawn857
ID: 40482292
So if the BRRE engine I'm using does have capture groups... it's still a matter of another step involved to take the matches and do further processing on them? This is not something that can be done all in one pass within one regex expression?

Thanks
   Shawn
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 40482299
BRRE does have capture groups as far as I can see--you have to use either the BRREMatch or BRREMatchAll functions and inspect the Captures parmeter once the call returns. I would say that you can certainly achieve this in one pass; the only thing that's different is that you are inspecting captures, not matches.

I tried to test this library out in C#, but I'm not terribly great with interop code from C# to C++. I will say that in my opinion I think you should go with another library. As you've already witnessed, the documentation is anything but good. (Let's face it:  the documentation is basically, "Here, read this code.") Unfortunately, I don't have any other alternatives to suggest--that is, unless maybe you can import the Boost libraries into Delphi. Boost has a pretty good regex library. I don't know the intracies of working with C++ code in Delphi, though.
0
 

Author Comment

by:shawn857
ID: 40494944
Thanks Kaufmed. Does anyone know a good fast Delphi regex library that could do these "capture groups", and also provides decent docs and examples?

Thanks
   Shawn
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

This article explains how to create forms/units independent of other forms/units object names in a delphi project. Have you ever created a form for user input in a Delphi project and then had the need to have that same form in a other Delphi proj…
Whatever be the reason, if you are working on web development side,  you will need day-today validation codes like email validation, date validation , IP address validation, phone validation on any of the edit page or say at the time of registration…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now