asked on

Need help with my Regex statement

I have the following Regular Expression:

(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?:$|\s|[\(\)])

Open in new window

It is used for finding keywords in a Python script. Unfortunately, it also finds them inside quoted strings and after comment characters. (In Python the # character is the start of a comment and nothing after that character should be matched UNLESS that character is inside quotes in which case it is treated as a literal)

What do I need to do to this Regex to force it to not match if there is a non-quoted # character anywhere on the line before the keyword? Also, what do I have to do to make sure the keywords are ignored if there are enclosed in quotes?

In the following example:

# The following is used for iteration
for row in table.Rows
myVariable = "What is this text for anyway?"

There should be no matches for the first line since it is preceded with a '#' character
the second line should match the words "for" and "in" since they are keywords not considered part of a comment or a quoted string
There should be no matches for the third line since the keywords "is" and "for" are already enclosed in quotes

Dan Craciun

For the # part, you could simply use
(?:#.*$)|
at the beginning. It will ignore the part after # up to the end of line.

With ", that is the tricky part. How do you know if that the keyword is after an even number of " or not?

HTH,
Dan

Dan Craciun

(?:#.*$)|(?:".*$)|(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?:$|\s|[\(\)])

Open in new window

This will pass your samples, but it's a very simplistic way to treat quotes.

Russ Suter

ASKER

That doesn't work. That means it will match the whole string. I need it to not match. The fact that it's in a non-capturing group doesn't mean it doesn't still match.

Dan Craciun

Yes, it will match.
And then you can test if capturing group 1 is empty or not.

Anyway, that was my best idea for tonight. I'll let the others give it a shot :)

louisfr

Instead of a non-capturing group, you can maybe use a negative lookahead or lookbehind.
I have work to do right now, but I'll look a bit later if you don't figure it out before.

ASKER CERTIFIED SOLUTION

louisfr

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

louisfr

If your regex engine doesn't support lookbehind, you can try this:

(?-s)(?:(\#.*\r\n)*[^#"]*?([^#"]*?"[^"]*")*?[^#"]*?)(?:^|\s|[\(\)])(and|as|assert|break|class|continue|def|del|elif|else|except|False|finally|for|from|global|if|import|in|is|lambda|None|nonlocal|not|or|pass|raise|return|True|try|while|with|yield)(?=$|\s|[\(\)])

Open in new window

Russ Suter

ASKER

Winner! I'm not entirely certain why this works but it does. If you're feeling generous perhaps a little explanation of what exactly is going on here? If not, I still very much appreciate the assistance.

louisfr

I added three things.

The first might not be necessary: (?-s).
This ensures that the . does not match newline characters.

The second and third are negative lookbehind expressions.
It starts with (?<!
A positive lookbehind would start with (?<=
A lookbehind expression checks the part of the string before the current scan point of the regex.

An example of positive lookbehind. This would look for any instance of "st" which is preceded by a digit:

(?<=\d)st

Open in new window

A negative lookbehind. This would look for "st" which is not preceded by a digit:

(?<!\d)st

Open in new window

You could have used a positive lookbehind instead of your first non-capturing group.

There also exists positive and negative lookahead, (?= and (?! respectively, which checks that what follows the current scan point matches or doesn't match an expression.
A lookahead expression could be used instead of your last non-capturing group.

Let's go back to your expression.
The first lookbehind I used checks that your matched text is not preceded by a # anywhere in the same line.
The second lookbehind checks that your matched text is not preceded by a odd number of quotes (start of line, text, quote, zero or more pairs of quotes, text).