I need help deciphering some regular expressions

Posted on 2014-08-27
Last Modified: 2014-08-28
I've never been an expert in regular expressions.  Below, I'm pasting in several preg_replace commands that are in a PHP script.  I'm hoping that someone here that knows regular expressions like the back of their hands can tell me what these are doing faster than I could possibly decipher them on my own.

      $string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu', "$1>", $string);

      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2nojavascript...', $string);
echo "<br>String is now {$string}<br>";
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2novbscript...', $string);
echo "<br>String is now {$string}<br>";
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*-moz-binding[\x00-\x20]*:#Uu', '$1=$2nomozbinding...', $string);
      $string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*data[\x00-\x20]*:#Uu', '$1=$2nodata...', $string);

      $string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])style[^>]*>#iUu', "$1>", $string);

      $string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);

Thank you in advance for any assistance.

Question by:garyhoffmann
    LVL 34

    Assisted Solution

    by:Dan Craciun
    You can use RegexBudy to quickly get an explanation of your expressions. For example:
    Match the character “#” literally «#»
    Match the regex below and capture its match into backreference number 1 «(<[^>]+[\x00-\x20\"\'\/])»
       Match the character “<” literally «<»
       Match any character that is NOT a “>” «[^>]+»
          Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
       Match a single character present in the list below «[\x00-\x20\"\'\/]»
          A character in the range between these two characters «\x00-\x20»
             The NULL character «\x00»
             The character “ ” which occupies position 0x20 (32 decimal) in the character set «\x20»
          The literal character “"” «\"»
          The literal character “'” «\'»
          The literal character “/” «\/»
    Match the regex below and capture its match into backreference number 2 «(on|xmlns)»
       Match this alternative (attempting the next alternative only if this one fails) «on»
          Match the character string “on” literally (case insensitive) «on»
       Or match this alternative (the entire group fails if this one fails to match) «xmlns»
          Match the character string “xmlns” literally (case insensitive) «xmlns»
    Match any character that is NOT a “>” «[^>]*»
       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
    Match the character string “>#iUu” literally (case insensitive) «>#iUu»

    Open in new window

    LE: I think this is a perfect example of why you (and by that I mean the original programmer) should always comment your code.
    LVL 107

    Expert Comment

    by:Ray Paseur
    Link to purchase RegexBuddy here:

    This is helpful, too:

    FWIW, if you're originating the regular expressions, writing them like this with comments on separate lines helps with the understanding.

    = '#'          // REGEX DELIMITER
    . '(\w)'       // GROUP OF ANY WORD CHARACTER
    . '\1'         // BACKREFERENCE TO GROUP 1
    . '{2,}'       // REPEATED TWO OR MORE TIMES
    . '#'          // REGEX DELIMITER

    Open in new window

    LVL 74

    Accepted Solution

    @Dan Crucian

    However, your description is a bit off. The hash tags are not a part of the pattern itself. Rather they are the pattern delimiters used by PHP. Likewise, the trailing "iUu" are not a part of the pattern--they are modifiers; "i" being case-insensitive; "U" reverses the "greediness" of quantifiers; "u" treats the pattern and subject string as being UTF-8.

    Author Comment

    @Dan Crucian - RegexBuddy does seem like it would be very helpful - it appears to have a "PHP Mode", so I'm hoping it deals with the things such as @kaufmed pointed out.

    @kaufmed - without your help, I was feeling that I was even more confused - thank you!
    LVL 74

    Expert Comment

    by:käµfm³d 👽
    For what it's worth, these patterns look to be doing some sort of XML/HTML parsing. Generally speaking, regex isn't the tool for this. You'd typically use a library that is setup specifically for handling XML/HTML.

    Glad to help  = )

    Author Comment

    @kaufmed - they are - they are trying to strip potentially dangerous items out of user submitted forms, but the problem is they were stripping almost anything entered (into a WYSIWYG editor) and returning blank strings most of the time.
    LVL 107

    Expert Comment

    by:Ray Paseur
    Ahh -- XML parsing?  Maybe you can post a new question with some examples of the data you want to redact.  We can help with that, and there is no REGEX involved!

    Expert Comment

    by:Alistair George
    The first expression is exactly one I have been having problems with because its syntax is wrong causing error '4' found by preg_last_error() and this is the reason O.P. found many null returns.
    Where its going wrong I dont know as there is no decoding utility which shows where a regex is wrong.
    Anyone hazard a guess whats wrong in expression 1 as it applies to the others as well they all come up with last_error.
    Regexbuddy is decoding the modifiers as part of the expression so Im thinking the expression itself has erroneous syntax.
    I'd like to know what the # delimiters mean as there are various including / but cant find any reference elsewhere.
    LVL 74

    Expert Comment

    by:käµfm³d 👽
    @Alistair George

    I would suggest opening a new thread  = )

    Write Comment

    Please enter a first name

    Please enter a last name

    We will never share this with anyone.

    Featured Post

    What Security Threats Are You Missing?

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Suggested Solutions

    I imagine that there are some, like me, who require a way of getting currency exchange rates for implementation in web project from time to time, so I thought I would share a solution that I have developed for this purpose. It turns out that Yaho…
    These days socially coordinated efforts have turned into a critical requirement for enterprises.
    The viewer will learn how to count occurrences of each item in an array.
    The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …

    758 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    7 Experts available now in Live!

    Get 1:1 Help Now