Petr Poleshko
asked on
Regex expression for email addresses with some exclusion logic
Hello,
I'm facing a great wall when dealing with regex expression which is supposed to filter email addresses with some exclusions of formatting of those email addresses
I need to apply a kind of standard regex expression
### Include ###
anyvalidaddress@anyvalid.d omain
### Exclude ###
<anyvalidaddress@anyvalid. domain
<anyvalidaddress@anyvalid. domain>
<mailto:anyvalidaddress@an yvalid.dom ain>
On purpose i didn't mention real email addresses, basically because regex should filter ANY valid email address in some way: include or exclude.
Updated 7:23:PM MSK 2/6/18
below is the text which as example needs to be processed
BOLD - those which should be CATCHED
UNDERLINED - which should be EXCLUDED
I'm facing a great wall when dealing with regex expression which is supposed to filter email addresses with some exclusions of formatting of those email addresses
I need to apply a kind of standard regex expression
\b[!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9][!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9\.]*[!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9]@[a-zA-Z0-9\-][a-zA-Z0-9\-\.]+[a-zA-Z0-9\-]\b
which would do:### Include ###
anyvalidaddress@anyvalid.d
### Exclude ###
<anyvalidaddress@anyvalid.
<anyvalidaddress@anyvalid.
<mailto:anyvalidaddress@an
On purpose i didn't mention real email addresses, basically because regex should filter ANY valid email address in some way: include or exclude.
Updated 7:23:PM MSK 2/6/18
below is the text which as example needs to be processed
The phrase regular expressions <anyvalidaddress@anyvalid.domain (and consequently, regexes) is often used to mean the specific, <anyvalidaddress@anyvalid. domain> standard textual syntax (distinct from the mathematical $A12345@example.com notation described below) for representing patterns that matching text <mailto:anyvalidaddress@an yvalid.dom ain> need to conform Abc\@def@example.com to. Each character in a regular expression (that is, each character in the string describing its pattern) is understood to be a metacharacter (with its special meaning), or a regular character (with its literal meaning). For example,
in anyvalid.address@anyu.any.anyvalid.d omain the regex a. a <anyvalid.address@anyu.any .anyvalid. domain is a literal character which matches just 'a' and . is a meta character which matches <anyvalida.ddress@anyu.any .anyvalid. domain> every character except a Fred\ Bloggs@example.com newline. Therefore, this regex would match Joe.\\Blow@example.com for example 'a ' or 'ax' or 'a0'. Together, metacharacters and literal characters can be used to identify textual material of a given pattern, or process <mailto:anyvali.daddress@a nyu.any.an yvalid.dom ain> a number "Abc@def"@example.com of instances anyvalidaddress@anyvalid.d omain of it. Pattern-matches can vary from a
precise equality to a very general similarity (controlled by customer/department=shipping@example .com the metacharacters). For example, . is a very general pattern, [a-z] (match all letters from 'a' to 'z') is less general and a is a
precise pattern (match just 'a'). The metacharacter syntax is designed specifically to represent prescribed targets in a concise and flexible way to direct the automation of text processing of a variety of input data, in a form easy to type using a standard ASCII keyboard.
!def!xyz%abc@example.com
_somename@example.com
BOLD - those which should be CATCHED
UNDERLINED - which should be EXCLUDED
ASKER
Hi Rgonzo1971
Thank you for quick response, but unfortunately this regex expression works like below:
input: <name.lastname@anyvalid.do main>
matched: lastname@anyvalid.domain
input: <mailto:name.lastname@anyv alid.domai n>
matched: lastname@anyvalid.domain
but these should be excluded as email address is inside <> or email begins with 'mailto:'.
Thank you for quick response, but unfortunately this regex expression works like below:
input: <name.lastname@anyvalid.do
matched: lastname@anyvalid.domain
input: <mailto:name.lastname@anyv
matched: lastname@anyvalid.domain
but these should be excluded as email address is inside <> or email begins with 'mailto:'.
which engine are you using?
Try this and let me know if there are any cases that it doesn't work for:
^[a-zA-Z0-9\.!#\$%&'][^@]*@[a-zA-Z0-9!#\$%&'][^.]*.[a-zA-Z0-9]{2,}
ASKER
Hi,
Rgonzo1971:
Rgonzo1971:
which engine are you using?We use McAfee DLP, but i check regex with regex101.com and also with DLP
see example in regex101
https://regex101.com/r/btijFc/1/
https://regex101.com/r/btijFc/1/
It works in regex101.com you just have toturn on multi line. See below.
I will check for a syntax that works with DLP. What do you know about the programming language or the regex engine that they use?
I think this is because the sample text has line breaks in it.
regex101_com.png
I will check for a syntax that works with DLP. What do you know about the programming language or the regex engine that they use?
I think this is because the sample text has line breaks in it.
regex101_com.png
ASKER
Garfield Samuels
Thank you for the quick response too, but unfortunately your version of regex has a lot of false positive detection:
... February 6, 2018 3:53 PM To: LastName, Name <Name.Lastname@anyvalid.do main>
but this shouldn't be matched:
1. "... February 6, 2018 3:53 PM To: LastName, Name " - doesn't have email-address in it at all
2. "<Name.Lastname@anyvalid.d omain" should be excluded, as email address begins with "<"
Thank you for the quick response too, but unfortunately your version of regex has a lot of false positive detection:
... February 6, 2018 3:53 PM To: LastName, Name <Name.Lastname@anyvalid.do
but this shouldn't be matched:
1. "... February 6, 2018 3:53 PM To: LastName, Name " - doesn't have email-address in it at all
2. "<Name.Lastname@anyvalid.d
ASKER
Garfield Samuels
I will check for a syntax that works with DLP. What do you know about the programming language or the regex engine that they use?McAfee DLP 10.x uses Google RE2
ASKER
Rgonzo1971:
the emails:
<anyvalid.address@anyvalid .domain
<anyvalid.address@anyvalid .domain>
<mailto:anyvalid.address@a nyvalid.do main>
are DETECTED as:
address@anyvalid.domain
address@anyvalid.domain>
address@anyvalid.domain>
which is wrong.
Guys, i really appreciate all the help you do here, but, trying to test emails, please check against any possible (according to the initially mentioned RegEx Expression) versions of email addresses.
https://regex101.com/r/btijFc/1/
the emails:
<anyvalid.address@anyvalid
<anyvalid.address@anyvalid
<mailto:anyvalid.address@a
are DETECTED as:
address@anyvalid.domain
address@anyvalid.domain>
address@anyvalid.domain>
which is wrong.
Guys, i really appreciate all the help you do here, but, trying to test emails, please check against any possible (according to the initially mentioned RegEx Expression) versions of email addresses.
ASKER
Garfield Samuels
Last version is much closer, but
even <email... or <mailto:email... are excluded as i'd need, but
1. email:anyvali.daddress@any u.any.anyv alid.domai n - is missed, and i'd like to have such possible variant to be INCLUDED, and this is what i mentioned in initial request
All of the mentioned emails addresses (samples) below are VALID as per RFC email-addresses:
and the only two in bold are only matched
$A12345@example.com
!def!xyz%abc@example.com
and the reason of why i mention these weird emails - because besides requirement of filtering <email... or <mailto:email... , initially mentioned regex expression includes also these weird addresses which are VALID email addresses as per RFC
Last version is much closer, but
even <email... or <mailto:email... are excluded as i'd need, but
1. email:anyvali.daddress@any
All of the mentioned emails addresses (samples) below are VALID as per RFC email-addresses:
Abc\@def@example.com
Fred\ Bloggs@example.com
Joe.\\Blow@example.com
"Abc@def"@example.com
customer/department=shipping@example.com
_somename@example.com
$A12345@example.com
!def!xyz%abc@example.com
and the only two in bold are only matched
$A12345@example.com
!def!xyz%abc@example.com
and the reason of why i mention these weird emails - because besides requirement of filtering <email... or <mailto:email... , initially mentioned regex expression includes also these weird addresses which are VALID email addresses as per RFC
then try
(?<!<)(?<!<mailto:)(\b[!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9][!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9\.]*[!#\$%&'\*\+\-\/=\?\^_`{\|}~a-zA-Z0-9]@[a-zA-Z0-9\-][a-zA-Z0-9\-\.]+[a-zA-Z0-9\-])\b.*(?<!>)$
ASKER
Garfield Samuels & Rgonzo1971
I had to update the question itself by providing a sample of the text (just a random Wikipedia paragraph about regex) with randomly inserted emails both expected to be detected and expected to be excluded. Hope this will provide you more sense/understanding about what i really need.
I had to update the question itself by providing a sample of the text (just a random Wikipedia paragraph about regex) with randomly inserted emails both expected to be detected and expected to be excluded. Hope this will provide you more sense/understanding about what i really need.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIALMembers can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
pls try
Open in new window
REgards