Link to home
Start Free TrialLog in
Avatar of Russ Suter
Russ Suter

asked on

Need help with my Regex statement

I have the following Regular Expression:
(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?:$|\s|[\(\)])

Open in new window

It is used for finding keywords in a Python script. Unfortunately, it also finds them inside quoted strings and after comment characters. (In Python the # character is the start of a comment and nothing after that character should be matched UNLESS that character is inside quotes in which case it is treated as a literal)

What do I need to do to this Regex to force it to not match if there is a non-quoted # character anywhere on the line before the keyword? Also, what do I have to do to make sure the keywords are ignored if there are enclosed in quotes?

In the following example:
# The following is used for iteration
  for row in table.Rows
    myVariable = "What is this text for anyway?"
There should be no matches for the first line since it is preceded with a '#' character
the second line should match the words "for" and "in" since they are keywords not considered part of a comment or a quoted string
There should be no matches for the third line since the keywords "is" and "for" are already enclosed in quotes
Avatar of Dan Craciun
Dan Craciun
Flag of Romania image

For the # part, you could simply use
(?:#.*$)|
at the beginning. It will ignore the part after # up to the end of line.

With ", that is the tricky part. How do you know if that the keyword is after an even number of " or not?

HTH,
Dan
(?:#.*$)|(?:".*$)|(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?:$|\s|[\(\)])

Open in new window

This will pass your samples, but it's a very simplistic way to treat quotes.
Avatar of Russ Suter
Russ Suter

ASKER

That doesn't work. That means it will match the whole string. I need it to not match. The fact that it's in a non-capturing group doesn't mean it doesn't still match.
Yes, it will match.
And then you can test if capturing group 1 is empty or not.

Anyway, that was my best idea for tonight. I'll let the others give it a shot :)
Instead of a non-capturing group, you can maybe use a negative lookahead or lookbehind.
I have work to do right now, but I'll look a bit later if you don't figure it out before.
ASKER CERTIFIED SOLUTION
Avatar of louisfr
louisfr

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If your regex engine doesn't support lookbehind, you can try this:
(?-s)(?:(\#.*\r\n)*[^#"]*?([^#"]*?"[^"]*")*?[^#"]*?)(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?=$|\s|[\(\)])

Open in new window

Winner! I'm not entirely certain why this works but it does. If you're feeling generous perhaps a little explanation of what exactly is going on here? If not, I still very much appreciate the assistance.
I added three things.

The first might not be necessary: (?-s).
This ensures that the . does not match newline characters.

The second and third are negative lookbehind expressions.
It starts with (?<!
A positive lookbehind would start with (?<=
A lookbehind expression checks the part of the string before the current scan point of the regex.

An example of positive lookbehind. This would look for any instance of "st" which is preceded by a digit:
(?<=\d)st

Open in new window

A negative lookbehind. This would look for "st" which is not preceded by a digit:
(?<!\d)st

Open in new window

You could have used a positive lookbehind instead of your first non-capturing group.

There also exists positive and negative lookahead, (?= and (?! respectively, which checks that what follows the current scan point matches or doesn't match an expression.
A lookahead expression could be used instead of your last non-capturing group.

Let's go back to your expression.
The first lookbehind I used checks that your matched text is not preceded by a # anywhere in the same line.
The second lookbehind checks that your matched text is not preceded by a odd number of quotes (start of line, text, quote, zero or more pairs of quotes, text).