Need a wise strategy for validating names with special characters: preg_match()

Hello,

php beginner here again...

I want to validate names of people and cities, as user input. Some names will have special characters. (Jónsdóttir, Québec, etc.)

I realize I really cannot stop someone from inputting:
Mickey Mouse, Orlando, Florida
in the first-name, last-name, city, and state fields...

But, it would be nice to keep out
JJ#*#H, 1=1, DROP TABLE CUSTOMER--

In addition to using mysql_real_escape_string() on each field, what else makes sense to try to stop some nonsense input?

(The member does have to input a real email address, and a validation code is sent there. But, as we all know, someone can have as many email addresses as they want.)

Is this startegy, (along with mysql_real_escape_string), enough:
$preg_match('/^[A-Za-z-'.àáâãäåæçèéêëìíîïðñò]{2,50}$/', $string);

That is just a partial list. Should I add in all the special characters that I want to allow - that is, special characters that could be in names of cities and people around the world?

Obviously, that list doesn't include Chinese, Japanese, Vietnamese, et. etc. characters... but I have not really seen a US-based website that people's names were shown in Mandarin Chinese, for example. I would think it is typical Americentric bad etiquette to force people to anglicize their names...but again, it is a US-based website.

Thanks for any ideas on how to handle this both from a (server-side php) security standpoint, and for a friendly way for the site to take and display the names of people using the special characters they normally use.

Dennis
dtleahyAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Ray PaseurCommented:
You might want to learn about this function.
http://us.php.net/manual/en/function.filter-var.php

As far as regular expressions go, I think you might want a character class that looks like this: [^A-Z] -- that says match anything that is NOT a member of the range from A to Z.  Obviously you would want to put more characters into the class, including blanks, commas, dots, apostrophe, etc. (eg: Winston O'Churchill, Esq.).  Put every character you want to keep inside the brackets, and use preg_replace() to remove all the characters you do not want to keep.

After sanitizing the input values, use MySQL_Real_Escape_String() and your data base will be safe.

If you're dealing with human client input from strangers, you might want to consider having a "report inappropriate content" button, too.
0
dtleahyAuthor Commented:
Hi Ray, and thanks for the reply.

So, email would be handled like this:
$emailadr = trim($_POST['email'])
if(!filter_var($emailadr, FILTER_VALIDATE_EMAIL))
  {
  ## return an error message "E-mail is not valid";
  }
else
  {  
	$emailadr=mysql_real_escape_string($_POST['email']);
  }

Open in new window


I'm not quite sure what you meant by this:
As far as regular expressions go, I think you might want a character class that looks like this: [^A-Z] -- that says match anything that is NOT a member of the range from A to Z.  Obviously you would want to put more characters into the class, including blanks, commas, dots, apostrophe, etc. (eg: Winston O'Churchill, Esq.).  Put every character you want to keep inside the brackets, and use preg_replace() to remove all the characters you do not want to keep.

Do you think it's a good idea to replace characters, or is it better to provide an error that says illegal characters were entered?

Rather than using preg_match, should I be using FILTER_VALIDATE_REGEXP ?

(The following pattern is an attempt to only allow a-z, A-Z, a bunch of accented characters, space, apostrophe, and hyphen

$pattern= "/^[a-zA-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùú ûüýþÿŒœŠšŸŽžƒ-'.\s]*/";

$fname = trim($_POST['firstname']);

if(filter_var($fname, FILTER_VALIDATE_REGEXP, array("options"=>array("regexp"=>$pattern))) === false)
  {
	  ## return an error message "E-mail is not valid";
	  ## echo "First Name has invalid characters";
  }
else
  {  
		$fname=mysql_real_escape_string($fname);
  }

Open in new window


Am I on the right track, or maybe a better question is, do I have it?

Thanks!

Dennis
0
Ray PaseurCommented:
Here is my standard email validation example.
<?php // RAY_email_validation.php
error_reporting(E_ALL);


// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE
// SEE MAN PAGE: http://php.net/manual/en/intro.filter.php
function check_valid_email($email, $rout=TRUE)
{
    // LIST OF BLOCKED DOMAINS
    $bogus = array
    ( '@unknown.com'
    , '@example.com'
    , '@gooseball.org'
    )
    ;

    // IF PHP 5.2 OR ABOVE, WE CAN USE THE FILTER
    if (strnatcmp(phpversion(),'5.2') >= 0)
    {
        if(filter_var($email, FILTER_VALIDATE_EMAIL) === FALSE) return FALSE;
    }

    // IF LOWER-LEVEL PHP, WE CAN CONSTRUCT A REGULAR EXPRESSION
    else
    {
        $regex
        = '/'                        // START REGEX DELIMITER
        . '^'                        // START STRING
        . '[A-Z0-9_-]'               // AN EMAIL - SOME CHARACTER(S)
        . '[A-Z0-9._-]*'             // AN EMAIL - SOME CHARACTER(S) PERMITS DOT
        . '@'                        // A SINGLE AT-SIGN
        . '([A-Z0-9][A-Z0-9-]*\.)+'  // A DOMAIN NAME PERMITS DOT, ENDS DOT
        . '[A-Z\.]'                  // A TOP-LEVEL DOMAIN PERMITS DOT
        . '{2,6}'                    // TLD LENGTH >= 2 AND =< 6
        . '$'                        // ENDOF STRING
        . '/'                        // ENDOF REGEX DELIMITER
        . 'i'                        // CASE INSENSITIVE
        ;
        // TEST THE STRING FORMAT
        if (!preg_match($regex, $email)) return FALSE;
    }

    // TEST TO SEE IF THE DOMAIN IS IN OUR BLOCKED LIST
    foreach ($bogus as $badguy)
    {
        if (stripos($email, $badguy)) return FALSE;
    }

    // FILTER_VAR OR PREG_MATCH DOES NOT TEST IF THE DOMAIN IS ROUTABLE
    if ($rout)
    {
        $domain = explode('@', $email);

        // MAN PAGE: http://php.net/manual/en/function.checkdnsrr.php
        if ( checkdnsrr($domain[1], "MX") || checkdnsrr($domain[1], "A") ) return TRUE;

        // EMAIL IS NOT ROUTABLE
        return FALSE;
    }
    return TRUE;
}



// DEMONSTRATE THE FUNCTION IN ACTION
$e = NULL;
if (!empty($_GET["e"]))
{
    $e = $_GET["e"];
    if (check_valid_email($e))
    {
        echo "<br/>VALID: $e \n";
    }
    else
    {
        echo "<br/>BOGUS: $e \n";
    }
}


// END OF PROCESSING - CREATE THE FORM USING HEREDOC NOTATION
$form = <<<ENDFORM
<form>
TEST A STRING FOR A VALID EMAIL ADDRESS:
<input name="e" value="$e" />
<input type="submit" />
</form>
ENDFORM;

echo $form;

Open in new window

As far as the illegal character in the names goes, I would just replace the illegal characters.  Nobody is named ?/* and those characters can just be dropped out (replaced with NULL or blank).  Why bother with an error message like who cares?

I think this might be more on point for the pattern (not 100% sure, but it would be easy to test)
[$pattern
= "/"    // REGEX DELIMITER
. '['    // START CHARACTER CLASS
. '^'    // NEGATION (MATCH NONE OF THESE)
. 'a-zA-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùú ûüýþÿŒœŠšŸŽžƒ\-'
. "'"    // APOSTROPHE
. '\.'   // PERIOD
. '\s'   // WHITESPACE
. ']'    // END CHARACTER CLASS
. "/"    // REGEX DELIMITER
;

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Learn Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

dtleahyAuthor Commented:
Thank you, thank you, Ray!

While I am working out the details, I'm such a newbie at php that I assume the leading opening bracket ( [ ) before $pattern is a typo... but if not, please school me on just what that is for.

so, is this correct?

$pattern
= "/"    // REGEX DELIMITER
. '['    // START CHARACTER CLASS
. '^'    // NEGATION (MATCH NONE OF THESE)
. 'a-zA-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿŒœŠšŸŽžƒ\-'
. "'"    // APOSTROPHE
. '\.'   // PERIOD
. '\s'   // WHITESPACE
. ']'    // END CHARACTER CLASS
. "/"    // REGEX DELIMITER
;  
 
$replace = "";
$new_string = preg_replace($pattern, $replace, $string);
$new_string=mysql_real_escape_string(new_string);

Open in new window


Thanks!

Dennis

{edited: removed something about smart quotes}
0
Ray PaseurCommented:
The left bracket is the "start of character class" indicator.  The ^ means start of string when it is outside the character class, and means negation when it is inside the character class.  A couple of examples may show what is going on here.

/RAY/ will match the three letters R,A,Y if an only if they are adjacent, spelling RAY.  It does not have to be a word, so StingRAY would match.
/^RAY/ will match the three letters RAY if they are adjacent and at the beginning of the string.  StingRAY would not match
/[^RAY]/ will match any letter that is not one of R,A,Y.  So preg_replace('/[^RAY]/', NULL, 'StingRAY') would eliminate the unmatched characters, making StingRAY into RAY.

Make sense?

Here's the PHP man page.
http://us2.php.net/manual/en/reference.pcre.pattern.syntax.php

Here is an article that is only tangentially about regular expressions (its' more about ways of approaching a problem) but it has some examples that might be helpful.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

And in case you're interested, this article covers "magic quotes" - one of the worst ideas to get baked into PHP.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_6630-Magic-Quotes-a-bad-idea-from-day-one.html

Best regards, ~Ray
0
dtleahyAuthor Commented:
I stumbled across some info dealing with "magic quotes." Not pretty.

Thank you very much for all of your help. You are a very helpful person! I very much appreciate that you're not just supplying answers, but connecting me with good resources.

The opening bracket I was referring to was in your second reply (ID: 37812593), in the beginning of the second block of code.
[$pattern

Open in new window

, not the normal start of class indicator.

I'm going to do a little bit more reading, then plug the code in and start testing.

Thanks!

Dennis
0
Ray PaseurCommented:
Yeah, the bracket in [$pattern is a typo.
0
dtleahyAuthor Commented:
Thanks again, Ray, for all the help and the resources that you pointed me to!

Dennis
0
Ray PaseurCommented:
Thanks for the points - great question! ~Ray
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.