Link to home
Start Free TrialLog in
Avatar of bradyhummel
bradyhummelFlag for United States of America

asked on

RegEx/Postini CC & SSN Filtering

Hello again. I am again having problems with Postini catching false positives, or not catching CC numbers.

Using Postini's built in filters is not robust enough for us to meet PCI requirements, so I've been tasked with creating custom RegEx's that will block emails. The problem, though, is that Postini's RegEx engine doesn't seem to conform to most standards.

Through multiple variations, I've finally came up with the following RegEx for Visa, MasterCard and JCB credit card filtering:

(^|\s|:)(3|4|5)\d{3}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.)

It basically states that the number should be at the beginning of a line, or preceded by a space, or preceded by a colon, and contain a set of four numbers starting with a 3,4, or 5, and 3 blocks of four digits, separated by a variety of separators. I also have a similar RegEx looking for a string of 16 digits starting with a 3,4, or 5, and various other similar RegEx's for other cards, or SSN's.

For the most part, it seems to work fine, however there are two problems:

1. It catches false positives in the form of "5000" followed by NO other blocks of digits.

So, for example and email like this:

"Hello, here are the short codes I'd like to order:

5123
5327
7347
2236
3456

Thanks"

It will flag it.

2. If a CC number is sent in this format: "5XXX XXXXXXXXXXXX" it will get by. Presumably, any variation on that will get by as well.

Any help would be appreciated.

Avatar of kaufmed
kaufmed
Flag of United States of America image

According to this site, ContentManager, which I assume is related to Postini since the page is hosted at Postini.com, supports PCRE. PCRE is a very rich set of regular expressions. Please forgive me if I am incorrect about ContentManager. I haven't used Postini before.

As to your issue, your pattern seems a bit difficult to read, and subsequently a bit difficult to debug. What if we try cleaning it up a tad?

^[\s:]?[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[\s.]?$

Open in new window


Also, I believe why you would get a false positive in your example is that you are using "\s" to refer to spaces, but "\s" refers to any whitespace, including newlines. Since you have at least 4 groups of digits, the pattern still matches. The pattern I proposed has accounted for this, but you can just as easily modify yours to exchange the "\s" occurrences with literal space characters.
Avatar of bradyhummel

ASKER

Thanks for the response. I've tested this against the email that was brought to my attention regarding false positives, however when I test the RegEx in Postini (Content Manager is correct), against this sample email (this is a fake CC number):

"Hey guys, thanks for the placing the order.

Here's the credit card info: 4555 6631 3445 2304.

Thanks"

It doesn't catch what is blatantly a CC number.
SOLUTION
Avatar of kaufmed
kaufmed
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks, I think that that may have done it. A few more questions, though.

1. My current RegEx for 16 straight digits for Visa, Mastercard, JCB is (^|\s|:)(3|4|5)\d{15}($|\s|\.), but following your example, it should be [345]\d{15}, is that correct?

2. Is there any way to prevent these combinations (X's being numbers) using one RegEx?

4XXX XXXXXXXXXXXX
4XXXXXXX XXXXXXXX
4XXXXXXXXXXX XXXX

You could probably do something like this:

[345](?:\D*\d){15}

Open in new window


which equates to:

[345]      - 3, 4, or 5
(?: ... )  - non-capturing group
\D*        - zero-or-more ( * ) of any character NOT a digit ( \D )
\d         - single digit
{15}       - 15 of the thing to the left, which in this case is the entire group

Open in new window


This should prevent even occurrences such as:

There were 4 books on 3 bookshelves. Each book had 88 pages...

Essentialy, the numbers split by any number of arbitrary characters. The only issue you could run into is actually having valid text with numeric values in it that are not actually credit card numbers. I would expect such an occurrence to be rare, though.
Alright... I think I'm making headway. Your second response seems to be the trick for the most part. If i gave you a few more RegEx's that I've created, would you mind reviewing them as you had the given example? I have a seperate RegEx for Discover Card, another for AmEx and Diners Club, and another for SSNs. They all follow the same format, but I'm very much a novice, you seem to be very much an expert in this. Would that be OK?
Certainly, no problem. Even if I'm not the one to look at them, there are several experts who participate on the regex boards who are extremely knowledgeable in the area  = )
Thanks! Here's what I have in addition to the one you already assisted me with:

DiscoverCard
(^|\s|:)(6011)(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.)  OR
(^|\s|:)(6011)\d{12}($|\s|\.)

AmExSinersClub
(^|\s|:)(3\d{3})(-|_|,|;|:|'|\.|\s|\\|/)\d{6}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.) OR
(^|\s|:)(3\d{13})($|\s|\.)

SSNFilter
(^|\s|:)([0-9]{3})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{2})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{4})($|\s|\.)


DiscoverCard
(^|\s|:)(6011)(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.)  OR

Can be more concisely written as:
(^|[\s:])(6011)([-_,;:'.\s\\/])\d{4}([-_,;:'.\s\\/])\d{4}([-_,;:'.\s\\/])\d{4}($|[\s.])
or maybe even:
(^|[\s:])(6011)(([-_,;:'.\s\\/])\d{4}){3}($|[\s.])
though in the 2nd case, the capturing groups will be different (and may cause problems if you're making use of them)

(^|\s|:)(6011)\d{12}($|\s|\.)

Can be more neatly (debatable maybe!) written as:
(^|[\s:])(6011)\d{12}($|[\s.])

You could potentially combine the 2 different expressions to:
(^|[\s:])(6011)(([-_,;:'.\s\\/])?\d{4}){3}($|[\s.])
though it would also accept values missing some (but not all) of the delimiters like
6011-12345678.1234
and
601112345678:1234

AmExSinersClub
(^|\s|:)(3\d{3})(-|_|,|;|:|'|\.|\s|\\|/)\d{6}(-|_|,|;|:|'|\.|\s|\\|/)\d{4}($|\s|\.) OR

Can be more concisely written as:
(^|[\s:])(3\d{3})([-_,;:'.\s\\/])\d{6}([-_,;:'.\s\\/])\d{4}($|[\s.])

(^|\s|:)(3\d{13})($|\s|\.)

Can be more neatly (debatable maybe!) written as:
(^|[\s:])(3\d{13})($|[\s.])

SSNFilter
(^|\s|:)([0-9]{3})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{2})(-|_|,|;|:|'|\.|\s|\\|/)([0-9]{4})($|\s|\.)

Can be more concisely written as:
(^|[\s:])([0-9]{3})([-_,;:'.\s\\/])([0-9]{2})([-_,;:'.\s\\/])([0-9]{4})($|[\s.])
Note also that by allowing : and . as characters for indicating the start and end of a value, you'll get a match on values like this:
6011.1234.1234.1234.5678
1234:6011:1234:1234:1234.5678
1234:6011:1234:1234:1234
the number matched in all above cases for the DiscoverCard pattern would be
6011 1234 1234 1234

And bear in mind that the - character within [] brackets is a special character unless it's the first or last one listed, so don't change:
[-_,;:'.\s\\/]
to
[=-_,;:'.\s\\/]
as it would match any character between = and _

Instead, you'd change it to:
[=\-_,;:'.\s\\/]
or
[-=_,;:'.\s\\/]
Thanks everyone, for your help. It seems with kaufmed's example:

[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}

Works, but it seems it catches the following:

5111 1111 1111 11111

It doesn't seem to matter where the extra digit is, it will catch it, so 51111 1111 1111 1111 flags as a CC too.

Any suggestions?
Are you trying to validation these in a single field value, or capture them from text?

For validation, you'd generally use start-of-line and end-of-line placeholders as part of the validation like this:
^[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}$

Capturing them from text, you need to consider what you'll allow at either end of the pattern, eg using a negative lookbehind and negative lookahead:
(?<!\d)[345]\d{3}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}[-_,;:'. \\/]?\d{4}(?!\d)
disallows:
5111 1111 1111 11111
but will match the 5111 1111 1111 1111 from:
5111 1111 1111 1111 1
To answer your question, we are trying to prevent customers from using credit card numbers (and socials) in emails to us due to PCI requirements.

We're using Postini's Content Manager to scan incoming email.

The code you suggested (second line, since we need to scan the text of a document) I get this error when I check the syntax in Postini: "Assertions starting with (? are not supported in (?<!\d), (?!\d)"
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That may have done it. Let me do some testing with this and I'll let you know.
Thanks for your help, guys. It seems to be working now. Appreciate the input. I'm clearly a novice when it comes to writing these.