Russ Suter
asked on
Need help with my Regex statement
I have the following Regular Expression:
What do I need to do to this Regex to force it to not match if there is a non-quoted # character anywhere on the line before the keyword? Also, what do I have to do to make sure the keywords are ignored if there are enclosed in quotes?
In the following example:
the second line should match the words "for" and "in" since they are keywords not considered part of a comment or a quoted string
There should be no matches for the third line since the keywords "is" and "for" are already enclosed in quotes
(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?:$|\s|[\(\)])
It is used for finding keywords in a Python script. Unfortunately, it also finds them inside quoted strings and after comment characters. (In Python the # character is the start of a comment and nothing after that character should be matched UNLESS that character is inside quotes in which case it is treated as a literal)What do I need to do to this Regex to force it to not match if there is a non-quoted # character anywhere on the line before the keyword? Also, what do I have to do to make sure the keywords are ignored if there are enclosed in quotes?
In the following example:
# The following is used for iterationThere should be no matches for the first line since it is preceded with a '#' character
for row in table.Rows
myVariable = "What is this text for anyway?"
the second line should match the words "for" and "in" since they are keywords not considered part of a comment or a quoted string
There should be no matches for the third line since the keywords "is" and "for" are already enclosed in quotes
(?:#.*$)|(?:".*$)|(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?:$|\s|[\(\)])
This will pass your samples, but it's a very simplistic way to treat quotes.
ASKER
That doesn't work. That means it will match the whole string. I need it to not match. The fact that it's in a non-capturing group doesn't mean it doesn't still match.
Yes, it will match.
And then you can test if capturing group 1 is empty or not.
Anyway, that was my best idea for tonight. I'll let the others give it a shot :)
And then you can test if capturing group 1 is empty or not.
Anyway, that was my best idea for tonight. I'll let the others give it a shot :)
Instead of a non-capturing group, you can maybe use a negative lookahead or lookbehind.
I have work to do right now, but I'll look a bit later if you don't figure it out before.
I have work to do right now, but I'll look a bit later if you don't figure it out before.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
If your regex engine doesn't support lookbehind, you can try this:
(?-s)(?:(\#.*\r\n)*[^#"]*?([^#"]*?"[^"]*")*?[^#"]*?)(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?=$|\s|[\(\)])
ASKER
Winner! I'm not entirely certain why this works but it does. If you're feeling generous perhaps a little explanation of what exactly is going on here? If not, I still very much appreciate the assistance.
I added three things.
The first might not be necessary: (?-s).
This ensures that the . does not match newline characters.
The second and third are negative lookbehind expressions.
It starts with (?<!
A positive lookbehind would start with (?<=
A lookbehind expression checks the part of the string before the current scan point of the regex.
An example of positive lookbehind. This would look for any instance of "st" which is preceded by a digit:
There also exists positive and negative lookahead, (?= and (?! respectively, which checks that what follows the current scan point matches or doesn't match an expression.
A lookahead expression could be used instead of your last non-capturing group.
Let's go back to your expression.
The first lookbehind I used checks that your matched text is not preceded by a # anywhere in the same line.
The second lookbehind checks that your matched text is not preceded by a odd number of quotes (start of line, text, quote, zero or more pairs of quotes, text).
The first might not be necessary: (?-s).
This ensures that the . does not match newline characters.
The second and third are negative lookbehind expressions.
It starts with (?<!
A positive lookbehind would start with (?<=
A lookbehind expression checks the part of the string before the current scan point of the regex.
An example of positive lookbehind. This would look for any instance of "st" which is preceded by a digit:
(?<=\d)st
A negative lookbehind. This would look for "st" which is not preceded by a digit:
(?<!\d)st
You could have used a positive lookbehind instead of your first non-capturing group.There also exists positive and negative lookahead, (?= and (?! respectively, which checks that what follows the current scan point matches or doesn't match an expression.
A lookahead expression could be used instead of your last non-capturing group.
Let's go back to your expression.
The first lookbehind I used checks that your matched text is not preceded by a # anywhere in the same line.
The second lookbehind checks that your matched text is not preceded by a odd number of quotes (start of line, text, quote, zero or more pairs of quotes, text).
(?:#.*$)|
at the beginning. It will ignore the part after # up to the end of line.
With ", that is the tricky part. How do you know if that the keyword is after an even number of " or not?
HTH,
Dan