Link to home
Start Free TrialLog in
Avatar of JohnyStyles577
JohnyStyles577

asked on

Regular Expressions how to Logical AND

I recently posted a questions here:
https://www.experts-exchange.com/questions/22895346/Regular-Expression-FAKE-Phone-Number-handling.html

This asked about how to do some pattern matching to deny suspect phone number submissions through my web pages.  That was answered properly and I ended up with 6 total regular expressions to different patterns.  When I went to implement these, I wanted to Logical AND these alltogether to provide me with a single test to use for Valid Phone numbers.  The 6 expressions I had are:
            1)      1212123333  - no pair of repeating digits 3 times or more
                        ^(?!.*(\d)\D*(\d)(\D*\1\D*\2){2})
            2)      1231233333  - no group of 3 repeating digits 2 times or more
                        ^(?!.*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3))
            3)      1234123433  - no group of 4 repeating digits 2 times or more
                        ^(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3\D*\4))
            4)      1234512345  - no group of 5 repeating digits 2 times or more
                        ^(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3\D*\4\D*\5))
            5)      1234533333 or 5432133333 - no incrementing groups of 1 through 5 or 5 through 1.
                        ^(?!.*1\D*2\D*3\D*4\D*5|.*5\D*4\D*3\D*2\D*1)
            6)      9205551212 - no 555-1212 numbers allowed
                        ^(?!.*5\D*5\D*5\D*1\D*2\D*1\D*2)

I attempted to logical AND these together to get this:
^(?!.*(\d)\D*(\d)(\D*\1\D*\2){2})(?!.*(\d)\D*(\d)\D*(\d)(\D*\3\D*\4\D*\5))(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)(\D*\6\D*\7\D*\8\D*\9))(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)(\D*\10\D*\11\D*\12\D*\13\D*\14))(?!.*1\D*2\D*3\D*4\D*5|.*5\D*4\D*3\D*2\D*1)(?!.*5\D*5\D*5\D*1\D*2\D*1\D*2).*$

IT IS VERY IMPORTANT TO NOTE THAT I AM USING JAVASCRIPT REGEX ENGINE TO RUN THIS.

Now everything looked fine, until I ran a final test case of something simple like:
12312

This failed!  This should have passed, but did not.  Through further testing, I found that the following would all fail:
12312
123412
1234512

Very odd.  If I were to remove the 2 from the end of any of those, it passes.  This is NOT the case when I run the expressions on the string individually!  So it seems like the "AND"ing of the expressions together is causing some odd behavior.

I am using the following Javascript regex tester to test:
http://www.regular-expressions.info/javascriptexample.html

Can someone please help me figure out what the issue is?
Avatar of SnowFlake
SnowFlake
Flag of Israel image

why not just | all of those alternatives and then use ! on the result ?
i.e. -
use
(^(?!.*(\d)\D*(\d)(\D*\1\D*\2){2}) |^(?!.*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3))|^(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3\D*\4))|^(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3\D*\4\D*\5))|^(?!.*1\D*2\D*3\D*4\D*5|.*5\D*4\D*3\D*2\D*1)|^(?!.*5\D*5\D*5\D*1\D*2\D*1\D*2))

and then you know that if you have a MATCH then its a bad phone numer.

(you will probably also need some general expression to match anything that is not digiats/spaces etc.)

SnowFlake
b.t.w. -
IMHO you are beeing way way over strict.
what would you expect someone that actually has a phone number that matches you "suspicious" numbers to do ?
you will get the following situation:
1) a fraud would just be force to invent some other number
   something like 9208642946   - see, that wasnt so hard :)
2) a honest person with the phone numer
    87254654654 (almost like a phone of a friend of mine (not realy)) would have to bang his head against the wall not understanding what do you want from him.

SnowFlake
ASKER CERTIFIED SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi JohnyStyles577,

Personally, I feel it's going over the limit with Regular Expression with what you want.

It's simpler to just loop through each condition and test them and keep the expression readable. I still remember the author of the O'Reilly book Mastering Regular Expression commented that he too would sometimes keep type his expression for readability.


Cheers,
NicksonKoh
One thing that you could do to greatly simplify your regular expressions would be to remove the non-digits using something like this first:

  str = str.replace( /\D/g, '' );

Then, your RE's would be much simpler.
Resulting in:

1: No pair of repeating digits 3 times or more
    ((\d\d)\1){2}
2: No group of 3 repeating digits 2 times or more
    ((\d)(\d)(\d))\1
3: No group of 4 repeating digits 2 times or more
    ((\d)(\d)(\d)(\d))\1
4: No group of 5 repeating digits 2 times or more
    ((\d)(\d)(\d)(\d)(\d))\1
5: No incrementing groups of 1 through 5 or 5 through 1
    (12345|54321)
6: No 555-1212 numbers allowed
    (5551212)

Combined:

((\d\d)\1){2}|((\d)(\d)(\d))\1|((\d)(\d)(\d)(\d))\1((\d)(\d)(\d)(\d)(\d))\1|(12345|54321)|(5551212)

By the way, my brothers home phone number matches rule #4... :-)
oops, I had a typo...  I was missing '|' after the pattern for #3... :-(

((\d\d)\1){2}|((\d)(\d)(\d))\1|((\d)(\d)(\d)(\d))\1|((\d)(\d)(\d)(\d)(\d))\1|(12345|54321)|(5551212)
I re-read my answers, and would like to simplify them a bit...

# First remove non-digits to make the RegExp's simplier:

  str = str.replace( /\D/g, '' );

# Then, using:

1: No pair of repeating digits 3 times or more
    ((\d\d)\1){2}
2: No group of 3 repeating digits 2 times or more
    (\d{3})\1
3: No group of 4 repeating digits 2 times or more
    (\d{4})\1
4: No group of 5 repeating digits 2 times or more
    (\d{5})\1
5: No incrementing groups of 1 through 5 or 5 through 1
    (12345|54321)
6: No 555-1212 numbers allowed
    (5551212)

Resulting in:

    ((\d\d)\1){2}|(\d{3})\1|(\d{4})\1|(\d{5})\1|(12345|54321)|(5551212)

How does this look for you?  Better, I hope.
((\d\d)\1){2}|(\d{3})\3|(\d{4})\4|(\d{5})\5|(12345|54321)|(5551212)
Ah, thanks ozo.
Avatar of JohnyStyles577
JohnyStyles577

ASKER

I thank you for ALL of your comments, and apologize for my slow response time to this post.  This week was quite busy, and I didn't have the time to review this.

In response to the comments about the Regex being too complicated and long, I appreciate your help in reducing the size of it, however in my case (using built in ASP.NET client side validator controls), I do not have the option of cleaning up the input before submitting it to the validator regex.  i'm ok with the long expression, and document it will in the code to break it up and explain its different parts.

In response to comments about my regex being too strict, I am OK with rejecting some valid numbers, as long as the amount of rejected numbers is low, and the gain is that I am blocking MANY non-valid numbers.  For my particular application, rejecting some valid numbers is acceptable.  What's a bit more important is that I am not allowing "junk" to get through.

I did however want to test the validator regex pattern on valid numbers to find out what percentage was being filtered correctly.  So I went and did some in depth analysis on this using a real database of submitted user phone numbers:

Sample Size:  8621 phone numbers
Definition:  False Positives (FP) - Numbers which were marked Invalid, but which ARE Valid

--------- Test 1 ---------

In my first run, I used exactly the expressions I outlined in my post here, this gave me the following results:
Total Failed Numbers:                     246
Percentage of Failed Numbers:        2.85%
Total Number of FPs:                     45
Percentage of FPs:                      0.52 %

So with these expressions, 1 out of 200 VALID phone numbers FAILED validation.  This is way to high.

--------- Test 2 ---------

I found the biggest culprit pattern creating most of the FPs.  This was pattern #2 from my post.  I then changed this:
From:   ^(?!.*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3))
To:     ^(?!.*(\d)\D*(\d)\D*(\d)(\D*\1\D*\2\D*\3){2,})
So now, 1231239999 does not fail, but 1231231239 does fail.

So now I ran another test on the same sample with the above change:
Total Failed Numbers:                     204
Percentage of Failed Numbers:        2.37%
Total Number of FPs:                     7
Percentage of FPs:                     0.08 %

So with the revised expressions, 1 out of 1250 VALID phone numbers FAILED validation.  This is much better.  The cost is 4 "Bad Phone Numbers" fell through the crack, and were not filtered out as bad numbers, but were instead allowed through as valid.  So 4 numbers became "False Negatives" due to this change.

--------- Test 3 ---------

I tried to find and fix the next bottleneck for the sake of being thorough.  This was pattern 1 from my post.  I changed this:
From:      ^(?!.*(\d)\D*(\d)(?:\D*\1\D*\2){2,})
To:      ^(?!.*(\d)\D*(\d)(?:\D*\1\D*\2){3,})
So now, 1212129999 does not fail, but 1212121299 does fail.

As I expected, the results from this change were not good.  Here are the stats:
Total Failed Numbers:                     155
Percentage of Failed Numbers:        1.80%
Total Number of FPs:                     3
Percentage of FPs:                      0.03 %

So with these revised expressions, only 1 out of 3333 VALID phone numbers FAILED validation.  This is much better.  However, the cost in this case is VERY high.  45 "Bad Phone Numbers", were not filtered out as bad numbers, but were instead allowed through as valid.  So 4 Fixed FPs, but 45 NEW FNs.  NOT good.

Here's some stats you may be of interest on % of numbers which failed a given pattern.  These are for my "Accepted" pattern given below:
Fail Pat 1      Fail Pat 2      Fail Pat 3      Fail Pat 4      Fail Pat 5      Fail Pat 6
147             90                101                 83                   16             31
1.71%             1.04%             1.17%           0.96%          0.19%        0.36%

Accepted Pattern:
^(?!.*(\d)\D*(\d)(?:\D*\1\D*\2){2,})(?!.*(\d)\D*(\d)\D*(\d)(?:\D*\3\D*\4\D*\5){2,})(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)(?:\D*\6\D*\7\D*\8\D*\9))(?!.*(\d)\D*(\d)\D*(\d)\D*(\d)\D*(\d)(?:\D*\10\D*\11\D*\12\D*\13\D*\14))(?!.*1\D*2\D*3\D*4\D*5|.*5\D*4\D*3\D*2\D*1)(?!.*5\D*5\D*5\D*1\D*2\D*1\D*2)

Ozo, thanks for pointing out what the problem was with my regex.  Your response fixed my issue.  Thanks All!

Some shameless promotion:
http://www.iWebQuotes.com
Health Insurance Quotes you can Trust!

Thanks Much,
John Pequeno
just wandering how comes after youer entire response explaing how my comment drove you into action (and how it improved your results) you did not find fit to even mark my answer as one that assisted you ...
SnowFlake
Snowflake, fair question.  But you mis-read my response.  I said:
"I did however want to test the validator regex pattern on valid numbers to find out what percentage was being filtered correctly."  I did not say that your: "comment drove [me] into action".  

I was already planning on doing this research before actually implementing my changes, before I ever even opened this question.

I appreciate your comments and assistance, but the reason I gave the full credit to Ozo, was because he answered my actual post question.

Thanks again SnowFlake?
Typing while talking is never good.  I meant to say:
Thanks again Snowflake!
NOT:  Thanks again Snowflake?

That seemed kind of rude :)
your welcome.