Solved

Perl Regular Expersions in PHP

Posted on 2011-09-08
8
342 Views
Last Modified: 2012-05-12
I have a php script that uses preg_match to test for certain inputs.  I had something like this,

elseif(preg_match("/^[^a-zA-Z,-_0-9\. ]+$/D", $string))
 return FALSE

Meaning anything but those characters will return false, but it wasn't behaivng as I would expect.

 So for this question I'll break it down to the most simple example:
preg_match('|^[^0-9]{1,}$|', '$string');

I understand this to mean that anything that does NOT start or end with a number will be matched.
So if string is "gggg" it is matched.  If string is "9999" it is not matched (do to the carrot in the bracket).

But if string is "g0g" it is not matched.  The string begins and ends with a letter so my thought is that it would be matched.  Why does adding a number between the two letters cause this to not match.  To me it seems that the beggining and end of line anchors are not respected.

Even passing characters like ^^^^ gets matched, But as soon as a number is added somewhere it is not matched.  

I assume it's working correctly, but I'd like an explanation as to why it behaves like this.  I am assuming that the anchors (^$) in this case does not actually mean begins with and ends with?
0
Comment
Question by:credog
  • 5
  • 2
8 Comments
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
The way your patterns are constructed, the target string must be a series of characters that are:

Pattern 1: Not a letter, not a number and not a period
    Examples:
        !@#$%^&
        (*)(*)
        -

Pattern 2: Not a number:
    Examples:
        !@#$%^&
        (*)(*)
        -
        hello world

I'm not entirely sure what the goal was for your first pattern. Perhaps you can elaborate.

In reading the description of what you'd like to achieve in the second pattern, it sounds like you want alternation, using the vertical bar ( | ). I would suggest, however, changing your pattern delimiters since the bar is a special character in regex. Try this change:

preg_match('#^\D|\D$#', '$string');

Open in new window


which means:
#      -  Pattern delimiter
^      -  Beginning of string
\D     -  Any character NOT a digit
|      -  OR (alternation)
\D     -  Any character NOT a digit
$      -  End of string
#      -  Pattern delimiter

Open in new window

0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
Actually, I think I misinterpreted the 2nd pattern's intent. I think this is what you are after:

preg_match('#^\D.*\D$#', '$string');

Open in new window


and it's meaning:
#      -  Pattern delimiter
^      -  Beginning of string
\D     -  Any character NOT a digit
.*     -  Zero-or-more ( * ) of any character ( . )
\D     -  Any character NOT a digit
$      -  End of string
#      -  Pattern delimiter

Open in new window

0
 

Author Comment

by:credog
Comment Utility
Good explanation, but I'm still confused on what the following does:

preg_match('#^[^0-9]{1,}$#', '$string');

The carrot inside the the bracket says NOT a number. I get that.
The carrot and the dollar outside the brackets I thought were anchors that would say:
Anything that does not begin or end in a number is matched.  So the string g5g should be matched becouse the beginning and ending does not contain a number, but it is not matched.  Obviously I'm confused by what the ^ and $ are actually doing outside the brakets.

It appears that if a number exists anywhere in the string than the patter in not matched.
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 400 total points
Comment Utility
^ means start of string (or start of line if you turn on the appropriate modifier); $ means end of string (or end of line if you turn on the appropriate modifier). Combining the two (sans modifiers) essentially says, "match the entire string". For example, given the pattern:

^hello world$

Open in new window


and the string variable:

$value = "hello world";

Open in new window


your preg_match call would succeed. However, if you change the string variable to:

$value = "hello joe";

Open in new window


your preg_match call would fail because the pattern expects the entire string to be "hello world". Now if we keep the latter string variable:

$value = "hello joe";

Open in new window


but we change the pattern:

^hello|world$

Open in new window


now "hello joe" would match because our pattern says, "any string that starts with hello ( ^hello ) or ( | ) ends with "world" ( world$ ).
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
P.S.

"hello world" would also match with the last pattern. If you used a preg_match_all, you would actually see two matches: one for "hello" and one for "world".
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 100 total points
Comment Utility
Looking at this:

preg_match("/^[^a-zA-Z,-_0-9\. ]+$/D", $string))

I don't know about that trailing "D" - that is usually the location of a pattern modifier.

Let's assume for a moment that you want to disqualify any string that does not contain letters, numbers, the space, the underscore, the comma, the dash, and the dot.  You could put these elements together into a character class that is wrapped in brackets.  Using the caret ^ metacharacter as the first character inside a character class means negation - the regular expression will match anything that is not part of the class.

To make it more confusing, the caret ^ metacharacter, when used at the beginning of a regular expression does not mean negation - it tells the regex engine to start the match at the first character of the string.  If you omit the caret, where will the regex start matching?  At the first character of the string!  And whenever you use a metacharacter inside a regular expression, you need an escape (backslash).  The dash, though not technically a metacharacter, means "from this to that" in regex, so it needs to be escaped too, if it is to mean the literal character hyphen.  Who thought this sort of syntax up?  Oh, I guess it must have been a 1950's mathematician ;-)
http://en.wikipedia.org/wiki/Stephen_Cole_Kleene

See http://www.laprbass.com/RAY_temp_credog.php
Outputs something like:
This ought to work.
But this will fail! HAS BAD CHARACTER(S)
SOS ... --- ...
Pi or maybe Pie 3.14159
Pi or maybe Pie? 3.14159 HAS BAD CHARACTER(S)
<?php // RAY_temp_credog.php
error_reporting(E_ALL);

// A REGULAR EXPRESSION
$rgx
= '/'          // A REGEX DELIMITER
. '['          // START A CHARACTER CLASS
. '^'          // NONE OF THE FOLLOWING MATCH
. 'A-Z'        // LETTERS
. '0-9'        // NUMBERS
. ' _,\-\.'    // SPACE, UNDERSCORE, COMMA, (ESCAPED) DASH, (ESCAPED) DOT
. ']'          // END A CHARACTER CLASS
. '/'          // END REGEX DELIMITER
. 'i'          // MODIFIER FOR CASE-INSENSITIVE
;

// SOME TEST DATA
$dat = array
( 'This ought to work.'
, 'But this will fail!'
, 'SOS ... --- ...'
, 'Pi or maybe Pie 3.14159'
, 'Pi or maybe Pie? 3.14159'
)
;

// TEST THE DATA WITH THE REGEX TO FIND BAD STRINGS
echo "<pre>";
foreach ($dat as $str)
{
    echo PHP_EOL . $str;
    if (preg_match($rgx, $str))
    {
        echo " HAS BAD CHARACTER(S)";
    }
}

// SHOW THE REGEX WE USED
echo PHP_EOL . "THE REGEX CONTAINS: ";
echo htmlentities($rgx);

Open in new window

Grab yourself a copy of this.  Very helpful.
http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

Best regards to all, ~Ray
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
Comment Utility
@Ray_Paseur
I don't know about that trailing "D" - that is usually the location of a pattern modifier.
Well with regard to PHP, it actually is a pattern modifier--though I can't recall if it affected this situation or not.

http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php
0
 
LVL 108

Expert Comment

by:Ray Paseur
Comment Utility
PCRE_DOLLAR_ENDONLY - up until now I had remained blissfully ignorant of that modifier!  But then, I tend to remain pedestrian when it comes to programming.  Thanks for the link!
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Developers of all skill levels should learn to use current best practices when developing websites. However many developers, new and old, fall into the trap of using deprecated features because this is what so many tutorials and books tell them to u…
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
The viewer will learn how to count occurrences of each item in an array.

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now