flag or remove emails, or obscured emails / telephone numers using PHP

hi

I'm writing a system that's free to use but requires users to subscribe before they can send messages containing contact info such as email or phone numbers.

Looking for a @ or (at) is obviously quite simple and should flag accounts for us to check that try that method to circumnavigate subscribing but does anyone have any ideas how I could also check for phone numbers this way, say if there is 4 or more numbers in a row for example

I'm using PHP by the way.

Thanks
Neil
LVL 3
Neil ThompsonSenior Systems DeveloperAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Sam WallisIt/Web ManagerCommented:
You should use regular expressions or in PHP preg_match()
0
Julian HansenCommented:
What if I send this

john(dot)smith(at)somewhere(dot)com
Or
john.smith @ somewhere . com

There are many ways of obscuring something - how meticulous do you want to get?
0
Neil ThompsonSenior Systems DeveloperAuthor Commented:
Hi Julian,

The is obviously a n'th degree people can use to try to get round but I just want to get a fair few "chancers" so obviously anyone using @ I can grab as most people wont use this in an "about me" text.

I can also check for the biggies gmail, Hotmail, yahoo, .uk .com (dot) (.) but would also be handy to look for groups of numbers say 4 or more in a row, such as 07941 111222
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

Julian HansenCommented:
Then you should be looking at using regular expressions.
preg_match()

The next question is - do you want to block the message from being sent OR do you want to remove the offending content and send the modified message on?
0
Neil ThompsonSenior Systems DeveloperAuthor Commented:
Ideally just remove and send as it then negates user intervention.
0
gr8gonzoConsultantCommented:
My two cents - it's not worth it to try and analyze every message for these patterns. Like other people have said, if someone WANTS to send this information, then there's too many ways around it. The simplest approach that can be used is to separate the content into different messages:

ME: My email is
ME: johnsmith
ME: at gmail
ME: dot com

Since the messages could be spread out and earlier portions of the message could already be in the hands of the recipient by the time enough data is accumulated to recognize it as an attempt to send an email address, it's virtually impossible to stop without human moderation (the concept of the "radio broadcast delay" where an "editor" has 2-3 seconds to review message content and flag or block it before the data makes it to the recipient.

If you wanted this kind of human-based, delay moderation, you'd need to implement something that looks for keywords in conversations so that the human moderator only has to review content that is potentially in violation of your terms and conditions. For example, look for @ and # and ( ) characters, and keywords like "phone", "phon", "fone", "ph#", etc... and then if you find those, simply flag the conversation to begin moderation so that the messages are routed through a human moderator first (until the moderator senses that the conversation is okay and turns off the flag).

If you don't have the human resources, then your better bet might be to limit the # of messages or overall number of words exchanged between two people. In theory, most people need to have some minimum amount of trust in a person before they exchange contact information, and that trust is normally gained by conversation. So after a certain amount of conversation, cut it off unless one of them is a subscriber (don't force both of them to be a subscriber - if this is a matchmaking chat type of service, your site's success depends on enabling connections, so providing the lowest amount of resistance to continuing a successful connection is in your best interest).
0
Neil ThompsonSenior Systems DeveloperAuthor Commented:
Thanks all,

Taking o board your great comments I'm going to try and just sniff the obvious and flag that for moderation before delivery.

This obviously matches @ but how can I add more things to check or do I need to preg_match for every one I want.

Ideally I would like something like this but it obviously doesn't work:
 preg_match('/(@)(hotmail)(gmail)(079)/', $userText , $matches, PREG_OFFSET_CAPTURE);

<?php
$userText = "hi, my name is bob send me an email bob @ hotmail .(dot) co DOT UK or call 07912 123456 or bob@testemail.co.uk";
preg_match('/(@)/', $userText , $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?> 

Array
(
    [0] => Array
        (
            [0] => @
            [1] => 55
        )

    [1] => Array
        (
            [0] => @
            [1] => 55
        )

)
 

Open in new window

0
gr8gonzoConsultantCommented:
I would strongly recommend against trying to capture individual domains. There's just way too many, and if you're dealing with real-time chat, then performance is absolutely essential. You can't have PHP taking 2 seconds to analyze every message.

You can try to throw a wider net like this:
<?php
$userText = "hi, my name is bob send me an email bob @ hotmail .(dot) co DOT UK or call 07912 123456 or bob@testemail.co.uk or bob at gmail dot com or call 1-800-555-1234 or call (555) 1234. Wasn't 2017 a crazy year?";

if(preg_match_all('/(?:@|at)[^a-zA-Z]*[a-zA-Z]{3,}\s*(?:\bdot\b|\.|\.\(dot\))\s*(?:co|net|org|biz|us|me)/', $userText , $matches, PREG_OFFSET_CAPTURE))
{
  print_r($matches);
}
if(preg_match_all('/[0-9][0-9\-\. \(\)]{3,}\b/', $userText , $matches, PREG_OFFSET_CAPTURE))
{
  // Try to exclude 4-digit matches that are between 1970 and 2030 - likely years
  foreach($matches[0] as $idx => $match)
  {
    if(strlen(trim($match[0])) == 4)
    {
      $int = intval($match[0]);
      if(($int >= 1970) && ($int <= 2030))
      {
        unset($matches[0][$idx]);
      }
    }
  }
  print_r($matches);
}

Open in new window

That's using 2 regexes - one for loose email domains and one for 3+ consecutive numbers that might be phone numbers. Bear in mind that the "looser" the regex is, the more it will capture, which WILL include false positives. For example, someone saying, "So 2017 was a crazy year, right?" - the "2017" will match the criteria for the phone number, which is why I added some code to strip out those afterwards.
1

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Neil ThompsonSenior Systems DeveloperAuthor Commented:
Excellent, many thanks for your code and thoughts
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Chat / IM

From novice to tech pro — start learning today.