[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 508
  • Last Modified:

I need help deciphering some regular expressions

I've never been an expert in regular expressions.  Below, I'm pasting in several preg_replace commands that are in a PHP script.  I'm hoping that someone here that knows regular expressions like the back of their hands can tell me what these are doing faster than I could possibly decipher them on my own.

      $string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu', "$1>", $string);

      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2nojavascript...', $string);
echo "<br>String is now {$string}<br>";
     
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2novbscript...', $string);
echo "<br>String is now {$string}<br>";
     
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*-moz-binding[\x00-\x20]*:#Uu', '$1=$2nomozbinding...', $string);
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*data[\x00-\x20]*:#Uu', '$1=$2nodata...', $string);

      $string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])style[^>]*>#iUu', "$1>", $string);

      $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);

Thank you in advance for any assistance.

Gary.
0
garyhoffmann
Asked:
garyhoffmann
  • 3
  • 2
  • 2
  • +2
2 Solutions
 
Dan CraciunIT ConsultantCommented:
You can use RegexBudy to quickly get an explanation of your expressions. For example:
#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu

Match the character “#” literally «#»
Match the regex below and capture its match into backreference number 1 «(<[^>]+[\x00-\x20\"\'\/])»
   Match the character “<” literally «<»
   Match any character that is NOT a “>” «[^>]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match a single character present in the list below «[\x00-\x20\"\'\/]»
      A character in the range between these two characters «\x00-\x20»
         The NULL character «\x00»
         The character “ ” which occupies position 0x20 (32 decimal) in the character set «\x20»
      The literal character “"” «\"»
      The literal character “'” «\'»
      The literal character “/” «\/»
Match the regex below and capture its match into backreference number 2 «(on|xmlns)»
   Match this alternative (attempting the next alternative only if this one fails) «on»
      Match the character string “on” literally (case insensitive) «on»
   Or match this alternative (the entire group fails if this one fails to match) «xmlns»
      Match the character string “xmlns” literally (case insensitive) «xmlns»
Match any character that is NOT a “>” «[^>]*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “>#iUu” literally (case insensitive) «>#iUu»

Open in new window

LE: I think this is a perfect example of why you (and by that I mean the original programmer) should always comment your code.
0
 
Ray PaseurCommented:
Link to purchase RegexBuddy here:
http://www.regexbuddy.com/tutorial.html

This is helpful, too:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/

FWIW, if you're originating the regular expressions, writing them like this with comments on separate lines helps with the understanding.

// FIND ANY WORD WITH A CHARACTER REPEATED 3 OR MORE TIMES
$rgx
= '#'          // REGEX DELIMITER
. '(\w)'       // GROUP OF ANY WORD CHARACTER
. '\1'         // BACKREFERENCE TO GROUP 1
. '{2,}'       // REPEATED TWO OR MORE TIMES
. '#'          // REGEX DELIMITER
;

Open in new window

0
 
käµfm³d 👽Commented:
@Dan Crucian

However, your description is a bit off. The hash tags are not a part of the pattern itself. Rather they are the pattern delimiters used by PHP. Likewise, the trailing "iUu" are not a part of the pattern--they are modifiers; "i" being case-insensitive; "U" reverses the "greediness" of quantifiers; "u" treats the pattern and subject string as being UTF-8.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
garyhoffmannAuthor Commented:
@Dan Crucian - RegexBuddy does seem like it would be very helpful - it appears to have a "PHP Mode", so I'm hoping it deals with the things such as @kaufmed pointed out.

@kaufmed - without your help, I was feeling that I was even more confused - thank you!
0
 
käµfm³d 👽Commented:
For what it's worth, these patterns look to be doing some sort of XML/HTML parsing. Generally speaking, regex isn't the tool for this. You'd typically use a library that is setup specifically for handling XML/HTML.

Glad to help  = )
0
 
garyhoffmannAuthor Commented:
@kaufmed - they are - they are trying to strip potentially dangerous items out of user submitted forms, but the problem is they were stripping almost anything entered (into a WYSIWYG editor) and returning blank strings most of the time.
0
 
Ray PaseurCommented:
Ahh -- XML parsing?  Maybe you can post a new question with some examples of the data you want to redact.  We can help with that, and there is no REGEX involved!
0
 
Alistair GeorgeCommented:
The first expression is exactly one I have been having problems with because its syntax is wrong causing error '4' found by preg_last_error() and this is the reason O.P. found many null returns.
Where its going wrong I dont know as there is no decoding utility which shows where a regex is wrong.
Anyone hazard a guess whats wrong in expression 1 as it applies to the others as well they all come up with last_error.
Regexbuddy is decoding the modifiers as part of the expression so Im thinking the expression itself has erroneous syntax.
I'd like to know what the # delimiters mean as there are various including / but cant find any reference elsewhere.
Alistair
0
 
käµfm³d 👽Commented:
@Alistair George

I would suggest opening a new thread  = )
0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 3
  • 2
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now