I need help deciphering some regular expressions

I've never been an expert in regular expressions.  Below, I'm pasting in several preg_replace commands that are in a PHP script.  I'm hoping that someone here that knows regular expressions like the back of their hands can tell me what these are doing faster than I could possibly decipher them on my own.

      $string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu', "$1>", $string);

      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2nojavascript...', $string);
echo "<br>String is now {$string}<br>";
     
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2novbscript...', $string);
echo "<br>String is now {$string}<br>";
     
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*-moz-binding[\x00-\x20]*:#Uu', '$1=$2nomozbinding...', $string);
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*data[\x00-\x20]*:#Uu', '$1=$2nodata...', $string);

      $string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])style[^>]*>#iUu', "$1>", $string);

      $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);

Thank you in advance for any assistance.

Gary.
garyhoffmannAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dan CraciunIT ConsultantCommented:
You can use RegexBudy to quickly get an explanation of your expressions. For example:
#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu

Match the character “#” literally «#»
Match the regex below and capture its match into backreference number 1 «(<[^>]+[\x00-\x20\"\'\/])»
   Match the character “<” literally «<»
   Match any character that is NOT a “>” «[^>]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Match a single character present in the list below «[\x00-\x20\"\'\/]»
      A character in the range between these two characters «\x00-\x20»
         The NULL character «\x00»
         The character “ ” which occupies position 0x20 (32 decimal) in the character set «\x20»
      The literal character “"” «\"»
      The literal character “'” «\'»
      The literal character “/” «\/»
Match the regex below and capture its match into backreference number 2 «(on|xmlns)»
   Match this alternative (attempting the next alternative only if this one fails) «on»
      Match the character string “on” literally (case insensitive) «on»
   Or match this alternative (the entire group fails if this one fails to match) «xmlns»
      Match the character string “xmlns” literally (case insensitive) «xmlns»
Match any character that is NOT a “>” «[^>]*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character string “>#iUu” literally (case insensitive) «>#iUu»

Open in new window

LE: I think this is a perfect example of why you (and by that I mean the original programmer) should always comment your code.
0
Ray PaseurCommented:
Link to purchase RegexBuddy here:
http://www.regexbuddy.com/tutorial.html

This is helpful, too:
http://www.cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/

FWIW, if you're originating the regular expressions, writing them like this with comments on separate lines helps with the understanding.

// FIND ANY WORD WITH A CHARACTER REPEATED 3 OR MORE TIMES
$rgx
= '#'          // REGEX DELIMITER
. '(\w)'       // GROUP OF ANY WORD CHARACTER
. '\1'         // BACKREFERENCE TO GROUP 1
. '{2,}'       // REPEATED TWO OR MORE TIMES
. '#'          // REGEX DELIMITER
;

Open in new window

0
käµfm³d 👽Commented:
@Dan Crucian

However, your description is a bit off. The hash tags are not a part of the pattern itself. Rather they are the pattern delimiters used by PHP. Likewise, the trailing "iUu" are not a part of the pattern--they are modifiers; "i" being case-insensitive; "U" reverses the "greediness" of quantifiers; "u" treats the pattern and subject string as being UTF-8.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

garyhoffmannAuthor Commented:
@Dan Crucian - RegexBuddy does seem like it would be very helpful - it appears to have a "PHP Mode", so I'm hoping it deals with the things such as @kaufmed pointed out.

@kaufmed - without your help, I was feeling that I was even more confused - thank you!
0
käµfm³d 👽Commented:
For what it's worth, these patterns look to be doing some sort of XML/HTML parsing. Generally speaking, regex isn't the tool for this. You'd typically use a library that is setup specifically for handling XML/HTML.

Glad to help  = )
0
garyhoffmannAuthor Commented:
@kaufmed - they are - they are trying to strip potentially dangerous items out of user submitted forms, but the problem is they were stripping almost anything entered (into a WYSIWYG editor) and returning blank strings most of the time.
0
Ray PaseurCommented:
Ahh -- XML parsing?  Maybe you can post a new question with some examples of the data you want to redact.  We can help with that, and there is no REGEX involved!
0
Alistair GeorgeMrCommented:
The first expression is exactly one I have been having problems with because its syntax is wrong causing error '4' found by preg_last_error() and this is the reason O.P. found many null returns.
Where its going wrong I dont know as there is no decoding utility which shows where a regex is wrong.
Anyone hazard a guess whats wrong in expression 1 as it applies to the others as well they all come up with last_error.
Regexbuddy is decoding the modifiers as part of the expression so Im thinking the expression itself has erroneous syntax.
I'd like to know what the # delimiters mean as there are various including / but cant find any reference elsewhere.
Alistair
0
käµfm³d 👽Commented:
@Alistair George

I would suggest opening a new thread  = )
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
PHP

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.