Solved

RegEx/Postini CC & SSN Filtering

Posted on 2011-09-29
16
925 Views
Last Modified: 2012-05-12
Hello again. I am again having problems with Postini catching false positives, or not catching CC numbers.

Using Postini's built in filters is not robust enough for us to meet PCI requirements, so I've been tasked with creating custom RegEx's that will block emails. The problem, though, is that Postini's RegEx engine doesn't seem to conform to most standards.

Through multiple variations, I've finally came up with the following RegEx for Visa, MasterCard and JCB credit card filtering:

(^|\s|:)(3|4|5)\d{3}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.)

It basically states that the number should be at the beginning of a line, or preceded by a space, or preceded by a colon, and contain a set of four numbers starting with a 3,4, or 5, and 3 blocks of four digits, separated by a variety of separators. I also have a similar RegEx looking for a string of 16 digits starting with a 3,4, or 5, and various other similar RegEx's for other cards, or SSN's.

For the most part, it seems to work fine, however there are two problems:

1. It catches false positives in the form of "5000" followed by NO other blocks of digits.

So, for example and email like this:

"Hello, here are the short codes I'd like to order:

5123
5327
7347
2236
3456

Thanks"

It will flag it.

2. If a CC number is sent in this format: "5XXX XXXXXXXXXXXX" it will get by. Presumably, any variation on that will get by as well.

Any help would be appreciated.

0
Comment
Question by:bradyhummel
  • 8
  • 4
  • 4
16 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 36815622
According to this site, ContentManager, which I assume is related to Postini since the page is hosted at Postini.com, supports PCRE. PCRE is a very rich set of regular expressions. Please forgive me if I am incorrect about ContentManager. I haven't used Postini before.

As to your issue, your pattern seems a bit difficult to read, and subsequently a bit difficult to debug. What if we try cleaning it up a tad?

^[\s:]?[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[\s.]?$

Open in new window


Also, I believe why you would get a false positive in your example is that you are using "\s" to refer to spaces, but "\s" refers to any whitespace, including newlines. Since you have at least 4 groups of digits, the pattern still matches. The pattern I proposed has accounted for this, but you can just as easily modify yours to exchange the "\s" occurrences with literal space characters.
0
 

Author Comment

by:bradyhummel
ID: 36815933
Thanks for the response. I've tested this against the email that was brought to my attention regarding false positives, however when I test the RegEx in Postini (Content Manager is correct), against this sample email (this is a fake CC number):

"Hey guys, thanks for the placing the order.

Here's the credit card info: 4555 6631 3445 2304.

Thanks"

It doesn't catch what is blatantly a CC number.
0
 
LVL 74

Assisted Solution

by:käµfm³d 👽
käµfm³d   👽 earned 125 total points
ID: 36815979
I'm wondering if you really need the start-of-line ( ^ ) and end-of-line anchors ( $ ). What happens if you do:

You
(3|4|5)\d{3}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}

Open in new window



Me
[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}

Open in new window

0
 

Author Comment

by:bradyhummel
ID: 36816281
Thanks, I think that that may have done it. A few more questions, though.

1. My current RegEx for 16 straight digits for Visa, Mastercard, JCB is (^|\s|:)(3|4|5)\d{15}($|\s|\.), but following your example, it should be [345]\d{15}, is that correct?

2. Is there any way to prevent these combinations (X's being numbers) using one RegEx?

4XXX XXXXXXXXXXXX
4XXXXXXX XXXXXXXX
4XXXXXXXXXXX XXXX

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 36816774
You could probably do something like this:

[345](?:\D*\d){15}

Open in new window


which equates to:

[345]      - 3, 4, or 5
(?: ... )  - non-capturing group
\D*        - zero-or-more ( * ) of any character NOT a digit ( \D )
\d         - single digit
{15}       - 15 of the thing to the left, which in this case is the entire group

Open in new window


This should prevent even occurrences such as:

There were 4 books on 3 bookshelves. Each book had 88 pages...

Essentialy, the numbers split by any number of arbitrary characters. The only issue you could run into is actually having valid text with numeric values in it that are not actually credit card numbers. I would expect such an occurrence to be rare, though.
0
 

Author Comment

by:bradyhummel
ID: 36818280
Alright... I think I'm making headway. Your second response seems to be the trick for the most part. If i gave you a few more RegEx's that I've created, would you mind reviewing them as you had the given example? I have a seperate RegEx for Discover Card, another for AmEx and Diners Club, and another for SSNs. They all follow the same format, but I'm very much a novice, you seem to be very much an expert in this. Would that be OK?
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 36818661
Certainly, no problem. Even if I'm not the one to look at them, there are several experts who participate on the regex boards who are extremely knowledgeable in the area  = )
0
 

Author Comment

by:bradyhummel
ID: 36891848
Thanks! Here's what I have in addition to the one you already assisted me with:

DiscoverCard
(^|\s|:)(6011)(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.)  OR
(^|\s|:)(6011)\d{12}($|\s|\.)

AmExSinersClub
(^|\s|:)(3\d{3})(-|_|,|;|:|'|\.|\s|\\|/)\d{6}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.) OR
(^|\s|:)(3\d{13})($|\s|\.)

SSNFilter
(^|\s|:)([0-9]{3})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{2})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{4})($|\s|\.)


0
Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

 
LVL 35

Expert Comment

by:Terry Woods
ID: 36900670
DiscoverCard
(^|\s|:)(6011)(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.)  OR

Can be more concisely written as:
(^|[\s:])(6011)([-_,;:'.\s\\/])\d{4}([-_,;:'.\s\\/])\d{4}([-_,;:'.\s\\/])\d{4}($|[\s.])
or maybe even:
(^|[\s:])(6011)(([-_,;:'.\s\\/])\d{4}){3}($|[\s.])
though in the 2nd case, the capturing groups will be different (and may cause problems if you're making use of them)

(^|\s|:)(6011)\d{12}($|\s|\.)

Can be more neatly (debatable maybe!) written as:
(^|[\s:])(6011)\d{12}($|[\s.])

You could potentially combine the 2 different expressions to:
(^|[\s:])(6011)(([-_,;:'.\s\\/])?\d{4}){3}($|[\s.])
though it would also accept values missing some (but not all) of the delimiters like
6011-12345678.1234
and
601112345678:1234

AmExSinersClub
(^|\s|:)(3\d{3})(-|_|,|;|:|'|\.|\s|\\|/)\d{6}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.) OR

Can be more concisely written as:
(^|[\s:])(3\d{3})([-_,;:'.\s\\/])\d{6}([-_,;:'.\s\\/])\d{4}($|[\s.])

(^|\s|:)(3\d{13})($|\s|\.)

Can be more neatly (debatable maybe!) written as:
(^|[\s:])(3\d{13})($|[\s.])

SSNFilter
(^|\s|:)([0-9]{3})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{2})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{4})($|\s|\.)

Can be more concisely written as:
(^|[\s:])([0-9]{3})([-_,;:'.\s\\/])([0-9]{2})([-_,;:'.\s\\/])([0-9]{4})($|[\s.])
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 36900686
Note also that by allowing : and . as characters for indicating the start and end of a value, you'll get a match on values like this:
6011.1234.1234.1234.5678
1234:6011:1234:1234:1234.5678
1234:6011:1234:1234:1234
the number matched in all above cases for the DiscoverCard pattern would be
6011 1234 1234 1234

And bear in mind that the - character within [] brackets is a special character unless it's the first or last one listed, so don't change:
[-_,;:'.\s\\/]
to
[=-_,;:'.\s\\/]
as it would match any character between = and _

Instead, you'd change it to:
[=\-_,;:'.\s\\/]
or
[-=_,;:'.\s\\/]
0
 

Author Comment

by:bradyhummel
ID: 36913310
Thanks everyone, for your help. It seems with kaufmed's example:

[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}

Works, but it seems it catches the following:

5111 1111 1111 11111

It doesn't seem to matter where the extra digit is, it will catch it, so 51111 1111 1111 1111 flags as a CC too.

Any suggestions?
0
 
LVL 35

Expert Comment

by:Terry Woods
ID: 36913352
Are you trying to validation these in a single field value, or capture them from text?

For validation, you'd generally use start-of-line and end-of-line placeholders as part of the validation like this:
^[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}$

Capturing them from text, you need to consider what you'll allow at either end of the pattern, eg using a negative lookbehind and negative lookahead:
(?<!\d)[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}(?!\d)
disallows:
5111 1111 1111 11111
but will match the 5111 1111 1111 1111 from:
5111 1111 1111 1111 1
0
 

Author Comment

by:bradyhummel
ID: 36913371
To answer your question, we are trying to prevent customers from using credit card numbers (and socials) in emails to us due to PCI requirements.

We're using Postini's Content Manager to scan incoming email.

The code you suggested (second line, since we need to scan the text of a document) I get this error when I check the syntax in Postini: "Assertions starting with (? are not supported in (?<!\d), (?!\d)"
0
 
LVL 35

Accepted Solution

by:
Terry Woods earned 125 total points
ID: 36913388
Some regular expression engines don't support lookahead and lookbehind - sounds like that's one of them.

This might do it - it requires the CC number to be (pre/suf)fixed with a non-digit or the start/end of line (ie nothing).
(^|\D)[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}(\D|$)
0
 

Author Comment

by:bradyhummel
ID: 36913499
That may have done it. Let me do some testing with this and I'll let you know.
0
 

Author Closing Comment

by:bradyhummel
ID: 36930895
Thanks for your help, guys. It seems to be working now. Appreciate the input. I'm clearly a novice when it comes to writing these.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

This story has been written with permission from the scammed victim, a valued client of mine – identity protected by request.
Never store passwords in plain text or just their hash: it seems a no-brainier, but there are still plenty of people doing that. I present the why and how on this subject, offering my own real life solution that you can implement right away, bringin…
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now