Need a wise strategy for validating names with special characters: preg_match()

Hello,

php beginner here again...

I want to validate names of people and cities, as user input. Some names will have special characters. (Jónsdóttir, Québec, etc.)

I realize I really cannot stop someone from inputting:
Mickey Mouse, Orlando, Florida
in the first-name, last-name, city, and state fields...

But, it would be nice to keep out
JJ#*#H, 1=1, DROP TABLE CUSTOMER--

In addition to using mysql_real_escape_string() on each field, what else makes sense to try to stop some nonsense input?

(The member does have to input a real email address, and a validation code is sent there. But, as we all know, someone can have as many email addresses as they want.)

Is this startegy, (along with mysql_real_escape_string), enough:
$preg_match('/^[A-Za-z-'.àáâãäåæçèéêëìíîïðñò]{2,50}$/', $string);

That is just a partial list. Should I add in all the special characters that I want to allow - that is, special characters that could be in names of cities and people around the world?

Obviously, that list doesn't include Chinese, Japanese, Vietnamese, et. etc. characters... but I have not really seen a US-based website that people's names were shown in Mandarin Chinese, for example. I would think it is typical Americentric bad etiquette to force people to anglicize their names...but again, it is a US-based website.

Thanks for any ideas on how to handle this both from a (server-side php) security standpoint, and for a friendly way for the site to take and display the names of people using the special characters they normally use.

Dennis
dtleahyAsked:
Who is Participating?
 
Ray PaseurConnect With a Mentor Commented:
Here is my standard email validation example.
<?php // RAY_email_validation.php
error_reporting(E_ALL);


// A FUNCTION TO TEST FOR A VALID EMAIL ADDRESS, RETURN TRUE OR FALSE
// SEE MAN PAGE: http://php.net/manual/en/intro.filter.php
function check_valid_email($email, $rout=TRUE)
{
    // LIST OF BLOCKED DOMAINS
    $bogus = array
    ( '@unknown.com'
    , '@example.com'
    , '@gooseball.org'
    )
    ;

    // IF PHP 5.2 OR ABOVE, WE CAN USE THE FILTER
    if (strnatcmp(phpversion(),'5.2') >= 0)
    {
        if(filter_var($email, FILTER_VALIDATE_EMAIL) === FALSE) return FALSE;
    }

    // IF LOWER-LEVEL PHP, WE CAN CONSTRUCT A REGULAR EXPRESSION
    else
    {
        $regex
        = '/'                        // START REGEX DELIMITER
        . '^'                        // START STRING
        . '[A-Z0-9_-]'               // AN EMAIL - SOME CHARACTER(S)
        . '[A-Z0-9._-]*'             // AN EMAIL - SOME CHARACTER(S) PERMITS DOT
        . '@'                        // A SINGLE AT-SIGN
        . '([A-Z0-9][A-Z0-9-]*\.)+'  // A DOMAIN NAME PERMITS DOT, ENDS DOT
        . '[A-Z\.]'                  // A TOP-LEVEL DOMAIN PERMITS DOT
        . '{2,6}'                    // TLD LENGTH >= 2 AND =< 6
        . '$'                        // ENDOF STRING
        . '/'                        // ENDOF REGEX DELIMITER
        . 'i'                        // CASE INSENSITIVE
        ;
        // TEST THE STRING FORMAT
        if (!preg_match($regex, $email)) return FALSE;
    }

    // TEST TO SEE IF THE DOMAIN IS IN OUR BLOCKED LIST
    foreach ($bogus as $badguy)
    {
        if (stripos($email, $badguy)) return FALSE;
    }

    // FILTER_VAR OR PREG_MATCH DOES NOT TEST IF THE DOMAIN IS ROUTABLE
    if ($rout)
    {
        $domain = explode('@', $email);

        // MAN PAGE: http://php.net/manual/en/function.checkdnsrr.php
        if ( checkdnsrr($domain[1], "MX") || checkdnsrr($domain[1], "A") ) return TRUE;

        // EMAIL IS NOT ROUTABLE
        return FALSE;
    }
    return TRUE;
}



// DEMONSTRATE THE FUNCTION IN ACTION
$e = NULL;
if (!empty($_GET["e"]))
{
    $e = $_GET["e"];
    if (check_valid_email($e))
    {
        echo "<br/>VALID: $e \n";
    }
    else
    {
        echo "<br/>BOGUS: $e \n";
    }
}


// END OF PROCESSING - CREATE THE FORM USING HEREDOC NOTATION
$form = <<<ENDFORM
<form>
TEST A STRING FOR A VALID EMAIL ADDRESS:
<input name="e" value="$e" />
<input type="submit" />
</form>
ENDFORM;

echo $form;

Open in new window

As far as the illegal character in the names goes, I would just replace the illegal characters.  Nobody is named ?/* and those characters can just be dropped out (replaced with NULL or blank).  Why bother with an error message like who cares?

I think this might be more on point for the pattern (not 100% sure, but it would be easy to test)
[$pattern
= "/"    // REGEX DELIMITER
. '['    // START CHARACTER CLASS
. '^'    // NEGATION (MATCH NONE OF THESE)
. 'a-zA-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùú ûüýþÿŒœŠšŸŽžƒ\-'
. "'"    // APOSTROPHE
. '\.'   // PERIOD
. '\s'   // WHITESPACE
. ']'    // END CHARACTER CLASS
. "/"    // REGEX DELIMITER
;

Open in new window

0
 
Ray PaseurConnect With a Mentor Commented:
You might want to learn about this function.
http://us.php.net/manual/en/function.filter-var.php

As far as regular expressions go, I think you might want a character class that looks like this: [^A-Z] -- that says match anything that is NOT a member of the range from A to Z.  Obviously you would want to put more characters into the class, including blanks, commas, dots, apostrophe, etc. (eg: Winston O'Churchill, Esq.).  Put every character you want to keep inside the brackets, and use preg_replace() to remove all the characters you do not want to keep.

After sanitizing the input values, use MySQL_Real_Escape_String() and your data base will be safe.

If you're dealing with human client input from strangers, you might want to consider having a "report inappropriate content" button, too.
0
 
dtleahyAuthor Commented:
Hi Ray, and thanks for the reply.

So, email would be handled like this:
$emailadr = trim($_POST['email'])
if(!filter_var($emailadr, FILTER_VALIDATE_EMAIL))
  {
  ## return an error message "E-mail is not valid";
  }
else
  {  
	$emailadr=mysql_real_escape_string($_POST['email']);
  }

Open in new window


I'm not quite sure what you meant by this:
As far as regular expressions go, I think you might want a character class that looks like this: [^A-Z] -- that says match anything that is NOT a member of the range from A to Z.  Obviously you would want to put more characters into the class, including blanks, commas, dots, apostrophe, etc. (eg: Winston O'Churchill, Esq.).  Put every character you want to keep inside the brackets, and use preg_replace() to remove all the characters you do not want to keep.

Do you think it's a good idea to replace characters, or is it better to provide an error that says illegal characters were entered?

Rather than using preg_match, should I be using FILTER_VALIDATE_REGEXP ?

(The following pattern is an attempt to only allow a-z, A-Z, a bunch of accented characters, space, apostrophe, and hyphen

$pattern= "/^[a-zA-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùú ûüýþÿŒœŠšŸŽžƒ-'.\s]*/";

$fname = trim($_POST['firstname']);

if(filter_var($fname, FILTER_VALIDATE_REGEXP, array("options"=>array("regexp"=>$pattern))) === false)
  {
	  ## return an error message "E-mail is not valid";
	  ## echo "First Name has invalid characters";
  }
else
  {  
		$fname=mysql_real_escape_string($fname);
  }

Open in new window


Am I on the right track, or maybe a better question is, do I have it?

Thanks!

Dennis
0
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
dtleahyAuthor Commented:
Thank you, thank you, Ray!

While I am working out the details, I'm such a newbie at php that I assume the leading opening bracket ( [ ) before $pattern is a typo... but if not, please school me on just what that is for.

so, is this correct?

$pattern
= "/"    // REGEX DELIMITER
. '['    // START CHARACTER CLASS
. '^'    // NEGATION (MATCH NONE OF THESE)
. 'a-zA-ZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿŒœŠšŸŽžƒ\-'
. "'"    // APOSTROPHE
. '\.'   // PERIOD
. '\s'   // WHITESPACE
. ']'    // END CHARACTER CLASS
. "/"    // REGEX DELIMITER
;  
 
$replace = "";
$new_string = preg_replace($pattern, $replace, $string);
$new_string=mysql_real_escape_string(new_string);

Open in new window


Thanks!

Dennis

{edited: removed something about smart quotes}
0
 
Ray PaseurCommented:
The left bracket is the "start of character class" indicator.  The ^ means start of string when it is outside the character class, and means negation when it is inside the character class.  A couple of examples may show what is going on here.

/RAY/ will match the three letters R,A,Y if an only if they are adjacent, spelling RAY.  It does not have to be a word, so StingRAY would match.
/^RAY/ will match the three letters RAY if they are adjacent and at the beginning of the string.  StingRAY would not match
/[^RAY]/ will match any letter that is not one of R,A,Y.  So preg_replace('/[^RAY]/', NULL, 'StingRAY') would eliminate the unmatched characters, making StingRAY into RAY.

Make sense?

Here's the PHP man page.
http://us2.php.net/manual/en/reference.pcre.pattern.syntax.php

Here is an article that is only tangentially about regular expressions (its' more about ways of approaching a problem) but it has some examples that might be helpful.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_7830-A-Quick-Tour-of-Test-Driven-Development.html

And in case you're interested, this article covers "magic quotes" - one of the worst ideas to get baked into PHP.
http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_6630-Magic-Quotes-a-bad-idea-from-day-one.html

Best regards, ~Ray
0
 
dtleahyAuthor Commented:
I stumbled across some info dealing with "magic quotes." Not pretty.

Thank you very much for all of your help. You are a very helpful person! I very much appreciate that you're not just supplying answers, but connecting me with good resources.

The opening bracket I was referring to was in your second reply (ID: 37812593), in the beginning of the second block of code.
[$pattern

Open in new window

, not the normal start of class indicator.

I'm going to do a little bit more reading, then plug the code in and start testing.

Thanks!

Dennis
0
 
Ray PaseurCommented:
Yeah, the bracket in [$pattern is a typo.
0
 
dtleahyAuthor Commented:
Thanks again, Ray, for all the help and the resources that you pointed me to!

Dennis
0
 
Ray PaseurCommented:
Thanks for the points - great question! ~Ray
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.