Solved

regular expresssion to extract 3 digit numbers and ignore longer numbers

Posted on 2010-11-23
29
481 Views
Last Modified: 2012-05-10
I thought I got an answer in http://www.experts-exchange.com/Programming/Languages/Regular_Expressions/Q_26561210.html,
but the following code does not work.

I understand why it doesn't work, but I can't figure out how to fix it.  It fails because the first match is on '123,' which "steals" the comma. So, the 456 gets ignored because it is not preceeded by a non-numeric or a word boundary.

I am trying to learn regular expressions, so please, only reply with regular expression solutions i.e. don't propose code that uses instr, mid(str,i,1),len(submatch(0)) etc.

Sub func()
  ' set reference to microsoft vbscript regular expressions 5.0
    Dim reg As New RegExp
    Dim test As String
    Dim matches As MatchCollection
    Dim m As Match
   
    test = "123,456x789"
   
    With reg
        .Pattern = "(?:\D|\b)(\d{3})(?:\D|\b)"
        .Global = True
        Set matches = .Execute(test)
    End With
    Dim smtch As String
    For Each m In matches
        smtch = smtch & " " & m.SubMatches(0)
    Next
    MsgBox vbCrLf & smtch

End Sub
0
Comment
Question by:rberke
  • 11
  • 8
  • 5
  • +2
29 Comments
 
LVL 35

Accepted Solution

by:
Terry Woods earned 290 total points
ID: 34201662
In myregextester.com, the last 3 digits aren't being picked up by your pattern - are you getting the same behaviour?

I can't explain why the end of the line isn't picked up as a word boundary (\b), but it worked for me when I changed the last group to be a lookahead:
(?:\D|\b)(\d{3})(?=\D|\b)
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 190 total points
ID: 34201668
With regards to the comment on your last question, the middle number is extracted for me, but I agree that the last number is not. This is because the pattern matches

   non-digit or word-break | 5 digits | non-digit or word-break

Since the "x" matches the latter "non-digit or word-break" for the second match, there is nothing for the third number to satisfy the first "non-digit or word-break"--the "x" is already consumed by the time the engine begins inspecting the last 5-digit number. This is what the pattern finds:

    (word-break)12345(word-break)
    (space)12346x

We can modify the pattern to accommodate the new requirement, as below. Here, we use a positive lookahead

    (?= ... )

to validate "future" patterns, but not make them part of the capture. In other words, we take a peek at the next few characters to see if they match the pattern. In this case, we want to see if 5 digits is followed by either a non-digit or a word-break.
Sub func2()

    Dim reg As New RegExp

    Dim test As String

    Dim matches As MatchCollection

    Dim m As Match

   

    test = "12345 23456x12345"

    'test = "12345this is great!!!!"

   

    With reg

        .Pattern = "(?:\b|\D)(\d{5}(?=\b|\D))"

        .Global = True

        Set matches = .Execute(test)

    End With

   

    For Each m In matches

        MsgBox m.SubMatches(0)

    Next

   

End Sub

Open in new window

0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 34201669
Would you mind a solution that parses the string using VBA and no regex?
0
 
LVL 41

Assisted Solution

by:HonorGod
HonorGod earned 20 total points
ID: 34201679
Do you want the value of "123,456x789" to be a match, or not?

If so, what part(s) of it do you want to match.

This part of the pattern: (?:\D|\b) is used to match non-digits, or a word boundary.
So, the beginning of the string matches.

This part of the pattern: (\d{3}) is used to match the 3 digits (i.e., "123"), and this
group is captured.

The last part of the pattern: (?:\D|\b) is used to match non-digits (i.e., the comma).
So, as you indicate, the test string matches the pattern.

List out various values that you want to succeed, and fail, and then look for patterns that match.
For example, are comma's valid?  Are they part of the number?

This library of RegExp can help a lot: http://regexlib.com/
For example, searching for "integer", we find some RegExp that may help you.

^([0-9]*\,?[0-9]+|[0-9]+\,?[0-9]*)?$  = European integers (comma instead of decimal point)

^([1-9]{1}[0-9]{0,7})+((,[1-9]{1}[0-9]{0,7}){0,1})+$ = Integers with optional commas
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34201680
@TommySzalapski

>>  Would you mind a solution that parses the string using VBA and no regex?

I'm thinking "no":

>>  I am trying to learn regular expressions, so please, only reply with regular expression solutions i.e. don't propose code that uses instr, mid(str,i,1),len(submatch(0)) etc.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34201707
I should probably correct those match strings:

    (word-break)12345(space)
    (word-break)12346x

would be what your original pattern finds. My modified pattern (because I flipped the \D and \b--which don't change the patterns intent), would match as I posted previously.
0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 34201734
Why don't you just add an "x" or something to the start and end of the string and find all occurances of (non-digit)(5-digits)(non-digit)?
0
 
LVL 37

Expert Comment

by:TommySzalapski
ID: 34201737
Maybe I came in too late to this one. If so, ignore me.
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 34201740
Nice explanation by the way, kaufmed. :-)
0
 
LVL 5

Author Comment

by:rberke
ID: 34201747
"no!"  only regular expression solutions please.
and 123x456x789  is NOT a single match.   I want 3 matches (or submatches), one with 123 456 and 789

Lucky for me, that is exactly what kaufmed's solution provides.

But, is this the FINAL solution?  Extra points to anybody who finds an examples where kaufmed's solution does not work.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34201750
@TerryAtOpus

I didn't refresh before posting, so I after I read your comment, all I could say was, "great minds think alike"  :D
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34201758
>>  Lucky for me, that is exactly what kaufmed's solution provides.

I think technically Terry beat me out on this one  :)
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 290 total points
ID: 34201781
I'm pretty sure that pattern will cover exactly what you want - I can't think of any exceptions.

ps: I support a points split - kaufmed's explanation is valuable for understanding why the lookahead is necessary.
0
 
LVL 5

Author Comment

by:rberke
ID: 34201850
I understand 95% of the solution.  Clearly the postive lookahead will not "steal" or "consume" the digit, so I understand why that works.

But, I don't entirely understand's Kaufmed's reference to flipping \D|\b  into \b:\D

In this particular example, I don't think it makes any difference.

But, maybe he is saying that it is a good "rule of thumb" to put \b before \D because \D might consume a character? But,  I thought word boundaries were also consumable, or am I wrong about that?  

Speaking of good rules of thumb, is it usually a good idea to preceed  negative character sets with a (?=           ?

0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 5

Author Comment

by:rberke
ID: 34201856
sorry for typo.  I meant      flipping   \D|\b    into   \b|\D
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 190 total points
ID: 34201870
You can disregard my "flip". When I was testing solutions, I though swapping the word-boundary and the non-digit would overcome the missing last number, but this was before I realized the consumption of the "x". Just know that flipping the two should not make a difference when it comes to the pattern matching appropriately.
0
 
LVL 5

Author Comment

by:rberke
ID: 34201896
We replaced a non-capturing parenthesis with a positive lookahead parenthesis.

i.e.   (?:   with  (?=

How do these different concepts interact.  Is there such a thing as a "non-capturing positive lookahead parenthesis"?
Or, conversly, is there such a thing as a "capturing positive lookahead parenthesis?"

And, I now see terry was the first responder, so he will get the "best solution" with nearly equal points going to every positive contribution.
0
 
LVL 5

Author Comment

by:rberke
ID: 34201903
Actually I have occasionally found that flipping alternatives DOES make a difference.  I am just not good at seeing when it will and when it won't. But, I am of the opinion that it will never make a difference when a lookahead is being used and   I suspect that is true for both positive and negative lookaheads.   Just a lot of gut instinct here, so I hope I am right.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34201920
Lookaround by definition is non-capturing, although some engines/languages allow capturing parentheses [ ( ... ) ] within lookaround. .NET is one that does, I believe.
0
 
LVL 5

Author Comment

by:rberke
ID: 34201936
regarding the flip you could have added an important tidbit.

It should have said "I thought swapping the word-boundary and the non-digit would overcome the missing last number BUT IT DID NOT HELP".

I believe that proves that I am correct when I said that a word boundary is consumable.  If the word boundary was not consumable, flipping the two items would have solved the problem.

But, this is making my head hurt a little bit, so I am heading home.  I will pick up more posts in about an hour.  
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 190 total points
ID: 34201967
Word boundaries aren't really "consumed" as they match positions and not characters--the same way ^ matches start of string/ine (position) and $ matches end of string/line (again, position). Here's a quick example to demonstrate.

By your claim, the output I should see would be:

    1=>2=>3=>4=>5

each of the above characters in sequence. What you really get is

   1=>,=>2=>,=>3=>,=>4=>,=>5

again, each of the above in sequence, including the commas. In this way, I guess you could compare word-boundaries and ^ and $ to lookaround.
Sub func2()

    Dim reg As New RegExp

    Dim test As String

    Dim matches As MatchCollection

    Dim m As Match

   

    test = "1,2,3,4,5"

    'test = "12345this is great!!!!"

   

    With reg

        .Pattern = "\b(\d|\D)\b"

        .Global = True

        Set matches = .Execute(test)

    End With

   

    For Each m In matches

        MsgBox m.SubMatches(0)

    Next

   

End Sub

Open in new window

0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 290 total points
ID: 34202214
I also tried to test whether they were consumed - it's an interesting concept:

I tried matching pattern:
word1\b\b\b\b\b word2
against
word1 word2

It matched (using myregextester.com, that is, which is PHP or .NET based - I tried both engines).... I wonder if it differs for other regex engines though?
0
 
LVL 35

Assisted Solution

by:Terry Woods
Terry Woods earned 290 total points
ID: 34202256
And order of alternatives definitely is important when capturing sub-patterns:

(wor)d1|(word.)
matched against
word1

captures
wor

Whereas in the opposite order:
(word.)|(wor)d1
captures
word1
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 190 total points
ID: 34202310
In my understanding of regex, given

    word1\b\b\b\b\b word2

when you find the 1, even when you match the first word boundary, you are still "at" the position just past the 1--you have not advanced past this position. So I guess because you haven't advanced the position of the evalutor (term.?), the remaining word boundaries still match because you are still at the same position. The word boundary, like the caret and dollar sign, is a zero-width token--even though you match one, you're still at the position where the last character match occurred; or rather just after it.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34202319
I guess for the sake of this newest example you could think of them as being consumed, but because they are zero-width, you can still match something like

    1,2,3,4,5

against

    \b(\d|\D)\b

because of the zero-width-edness.
0
 
LVL 5

Author Comment

by:rberke
ID: 34202697
After reviewing the above, I am 99% sure I was totally off base, and that word boundaries are simply not consumed.

In fact, I don't even understand your 10:16 pm comment that seemed to say they could be considered as consumed in some way.  

I simply jumped to a total unwarranted conclusion. The the \b|\D  was failing on the case of  123x345x456 which actually doesn't have any word boundaries to being with.

This has been a very helpful discussion, here come the points.

By the way, thanks Terry for supplying some examples where the sequence of alternation matters.  



0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 34204697
>>  123x345x456 which actually doesn't have any word boundaries to being with.

There are 2 word boundaries in that example:
123x345x456
^           ^

Open in new window

0
 
LVL 5

Author Comment

by:rberke
ID: 34207391
Being precise is incredibly important with regular expressions, so I appreciate your correction.

0
 
LVL 41

Expert Comment

by:HonorGod
ID: 34208850
Thanks for the assist and the points.

Good luck & have a great day.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Suggested Solutions

I have been reconstructing a PHP-based application that has grown into a full blown interface system over the last ten years by a developer that has now gone into business for himself building websites. I am not incredibly fond of writing PHP code o…
As most anyone who uses or has come across them can attest to, regular expressions (regex) are a complicated bit of magic. Packed so succinctly within their cryptic syntax lies a great deal of power. It's not the "take over the world" kind of power,…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now