Link to home
Start Free TrialLog in
Avatar of Petr Poleshko
Petr PoleshkoFlag for Russian Federation

asked on

Regex expression for email addresses with some exclusion logic

Hello,
I'm facing a great wall when dealing with regex expression which is supposed to filter email addresses with some exclusions of formatting of those email addresses

I need to apply a kind of standard regex expression
\b[!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9][!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9\.]*[!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9]@[a-zA-Z0-9\-][a-zA-Z0-9\-\.]+[a-zA-Z0-9\-]\b

Open in new window

which would do:
### Include ###
anyvalidaddress@anyvalid.domain

### Exclude ###
<anyvalidaddress@anyvalid.domain
<anyvalidaddress@anyvalid.domain>
<mailto:anyvalidaddress@anyvalid.domain>

On purpose i didn't mention real email addresses, basically because regex should filter ANY valid email address in some way: include or exclude.

Updated 7:23:PM MSK 2/6/18
below is the text which as example needs to be processed
The phrase regular expressions <anyvalidaddress@anyvalid.domain (and consequently, regexes) is often used to mean the specific, <anyvalidaddress@anyvalid.domain> standard textual syntax (distinct from the mathematical $A12345@example.com notation described below) for representing patterns that matching text <mailto:anyvalidaddress@anyvalid.domain> need to conform Abc\@def@example.com to. Each character in a regular expression (that is, each character in the string describing its pattern) is understood to be a metacharacter (with its special meaning), or a regular character (with its literal meaning). For example,
in anyvalid.address@anyu.any.anyvalid.domain the regex a. a <anyvalid.address@anyu.any.anyvalid.domain is a literal character which matches just 'a' and . is a meta character which matches <anyvalida.ddress@anyu.any.anyvalid.domain> every character except a Fred\ Bloggs@example.com newline. Therefore, this regex would match Joe.\\Blow@example.com for example 'a ' or 'ax' or 'a0'. Together, metacharacters and literal characters can be used to identify textual material of a given pattern, or process <mailto:anyvali.daddress@anyu.any.anyvalid.domain> a number "Abc@def"@example.com of instances anyvalidaddress@anyvalid.domain of it. Pattern-matches can vary from a
precise equality to a very general similarity (controlled by customer/department=shipping@example.com the metacharacters). For example, . is a very general pattern, [a-z] (match all letters from 'a' to 'z') is less general and a is a
precise pattern (match just 'a'). The metacharacter syntax is designed specifically to represent prescribed targets in a concise and flexible way to direct the automation of text processing of a variety of input data, in a form easy to type using a standard ASCII keyboard.
!def!xyz%abc@example.com
_somename@example.com

BOLD - those which should be CATCHED
UNDERLINED - which should be EXCLUDED
Avatar of Rgonzo1971
Rgonzo1971

HI,

pls try
(?<!<)(?<!<mailto:)\b[!#\$%&amp;'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9][!#\$%&amp;'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9\.]*[!#\$%&amp;'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9]@[a-zA-Z0-9\-][a-zA-Z0-9\-\.]+[a-zA-Z0-9\-]\b

Open in new window

REgards
Avatar of Petr Poleshko

ASKER

Hi Rgonzo1971

Thank you for quick response, but unfortunately this regex expression works like below:
input: <name.lastname@anyvalid.domain>
matched: lastname@anyvalid.domain

input: <mailto:name.lastname@anyvalid.domain>
matched: lastname@anyvalid.domain

but these should be excluded as email address is inside <> or email begins with 'mailto:'.
which engine are you using?
Try this and let me know if there are any cases that it doesn't work for:
^[a-zA-Z0-9\.!#\$%&amp;'][^@]*@[a-zA-Z0-9!#\$%&amp;'][^.]*.[a-zA-Z0-9]{2,}

Open in new window

Hi,
Rgonzo1971:
which engine are you using?
We use McAfee DLP, but i check regex with regex101.com and also with DLP
It works in regex101.com you just have toturn on multi line. See below.

I will check for a syntax that works with DLP.  What do you know about the programming language or the regex engine that they use?

User generated image
I think this is because the sample text has line breaks in it.
regex101_com.png
Garfield Samuels
Thank you for the quick response too, but unfortunately your version of regex has a lot of false positive detection:
... February 6, 2018 3:53 PM To: LastName, Name <Name.Lastname@anyvalid.domain>

but this shouldn't be matched:
1. "... February 6, 2018 3:53 PM To: LastName, Name " - doesn't have email-address in it at all
2. "<Name.Lastname@anyvalid.domain" should be excluded, as email address begins with "<"
Garfield Samuels
I will check for a syntax that works with DLP.  What do you know about the programming language or the regex engine that they use?
McAfee DLP 10.x uses Google RE2
Rgonzo1971:
https://regex101.com/r/btijFc/1/

the emails:
<anyvalid.address@anyvalid.domain
<anyvalid.address@anyvalid.domain>
<mailto:anyvalid.address@anyvalid.domain>

are DETECTED as:
address@anyvalid.domain
address@anyvalid.domain>
address@anyvalid.domain>

which is wrong.

Guys, i really appreciate all the help you do here, but, trying to test emails, please check against any possible (according to the initially mentioned RegEx Expression) versions of email addresses.
I read online that DLP uses JAVA so based on my knowledge you can force multi line with the following:

(?m)^[a-zA-Z0-9\.!#\$%&amp;'^@]*@[a-zA-Z0-9!#\$%&amp;'^.]*.[a-zA-Z0-9]{2,}

Open in new window


Give it a try and let me know.

 User generated image
Garfield Samuels

Last version is much closer, but

even <email... or <mailto:email... are excluded as i'd need, but
1. email:anyvali.daddress@anyu.any.anyvalid.domain - is missed, and i'd like to have such possible variant to be INCLUDED, and this is what i mentioned in initial request

All of the mentioned emails addresses (samples) below are VALID as per RFC email-addresses:
Abc\@def@example.com
Fred\ Bloggs@example.com
Joe.\\Blow@example.com
"Abc@def"@example.com
customer/department=shipping@example.com
_somename@example.com
$A12345@example.com
!def!xyz%abc@example.com

Open in new window


and the only two in bold are only matched
$A12345@example.com
!def!xyz%abc@example.com


and the reason of why i mention these weird emails - because besides requirement of filtering <email... or <mailto:email... , initially mentioned regex expression includes also these weird addresses which are VALID email addresses as per RFC
The devil is always in the details :D

Have a go at this and let me know.

(?m)^([a-zA-Z]*:|)[a-zA-Z0-9\.!#\$%&amp;"_=\/\\ '^@]*@[a-zA-Z0-9!#\$%&amp;'\.]*.[a-zA-Z0-9]{2,}

Open in new window


User generated image
then try
(?<!<)(?<!<mailto:)(\b[!#\$%&amp;'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9][!#\$%&amp;'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9\.]*[!#\$%&amp;'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9]@[a-zA-Z0-9\-][a-zA-Z0-9\-\.]+[a-zA-Z0-9\-])\b.*(?<!>)$

Open in new window

Garfield SamuelsRgonzo1971
I had to update the question itself by providing a sample of the text (just a random Wikipedia paragraph about regex) with randomly inserted emails both expected to be detected and expected to be excluded. Hope this will provide you more sense/understanding about what i really need.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.