Link to home
Start Free TrialLog in
Avatar of Russ Suter
Russ Suter

asked on

Need help with Regex to extract a parameter list

I'm trying to extract a list of parameter names from a Python script. Here's an example of what I'm looking at
def foo(cmd
    ,pIncludeAll #BOOL
    ,pOrderByDisplayOrder #BOOL
    ) :
    try:
        ....

Open in new window

Currently, I'm using a Regex that grabs everything between the parentheses then just splitting on the comma. This doesn't work in the above case since there are comments after 2 of the 3 parameters. My split string ends up looking like this:
cmd
pIncludeAll #BOOL
pOrderByDisplayOrder #BOOL

Open in new window

What I need is a Regex that will produce a match result that contains each of the parameters without the comment like this:
cmd
pIncludeAll
pOrderByDisplayOrder

Open in new window

I know I need to delimit the Regex match on commas, whitespace, and pound signs. I just don't know how to write the expression so that it will return a proper match against an arbitrary number of arguments.
Avatar of HonorGod
HonorGod
Flag of United States of America image

You could do a trivial 2nd step and strip everything after "#" on each line.

What does your current RegEx look like?
Avatar of Russ Suter
Russ Suter

ASKER

My current Regex looks like this:
def\s+\w+\s*\((?<args>[^\)]+)\)

Open in new window

I'm aware of the 2nd step option but for some underlying technical reasons I cannot use that option. What I need is a Regex that returns a match with multiple groups. Right now the Regex returns a single group named "args" which looks like this:
args: cmd[CR][LF], param1 #bool[CR][LF], param2 #int[CR][LF], param3[CR][LF], param4 #date[CR][LF]

Open in new window

What I ideally need is a Regex that returns this:
args: cmd
args: param1
args: param2
args: param3
args: param4

Open in new window

I would be OK with returning values with included whitespace because I can just do a Trim() on that.
@Shaun Vermaak
That doesn't seem to work at all. I ran it through Expresso and it returned no matches.
If it doesn't support look-ahead etc. you will not be able to do it with RegEx
Both Expresso and C# (which I'm ultimately using) support look ahead. The Regex provided just doesn't work.
This should do what you want.  if \R is not supported, it just matches any line-ending (so replace it with something else that matches line endings (either generally or specifically in your file)).
def\s+\w+\s*\(?:(\w+)(?:\s*#[^\R,\)]*)?(?:\s*,\s*(w+)(?:\s*#[^\R,\)]*)?)*\)

Open in new window

@wilcoxon

That looked so promising. Alas it didn't match anything at all. I did have to replace \R with \n (for newline) to get it to even execute without throwing an error but the end result is a failed match.
import re

defRE = re.compile( r"def\s+\w+\s*\((.*)\)", re.MULTILINE + re.DOTALL )

text = '''
def foo(cmd
    ,pIncludeAll #BOOL
    ,pOrderByDisplayOrder #BOOL
    ) :
    try:
        ...
'''

mo = re.search( defRE, text )
if mo :
  info = mo.groups()[ 0 ]
  print "Before:", info, type( info )
  print " After:"
  for line in info.splitlines() :
    print re.sub( '#.*$', '', line )
else :
  print 'no match'
@HonorGod

That misses the point of my question. I know how to solve this problem in other ways. What I NEED is a Regex that does as I stated above.
Ah, a single regex to rule them all... ok.  Sorry.  I'll be watching.
Sorry - a couple typos - fixed...
def\s+\w+\s*\((\w+)(?:\s*#[^\n,\)]*)?(?:\s*,\s*(\w+)(?:\s*#[^\n,\)]*)?)*\s*\)

Open in new window

@wilcoxon

Oh, I feel like we're getting close but having run it through C# I get the following result set
User generated imageAs you can see the 2nd parameter is missing from the capture groups.
ASKER CERTIFIED SOLUTION
Avatar of wilcoxon
wilcoxon
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
On second thought, you could do it without | clauses but it still gets longer for each argument you want to handle.  Here would be a way to handle 1, 2 or 3 arguments.
def\s+\w+\s*\((\w+)(?:\s*#[^\n,\)]*)?(?:\s*,\s*(\w+)(?:\s*#[^\n,\)]*)?)?(?:\s*,\s*(\w+)(?:\s*#[^\n,\)]*)?)?\s*\)

Open in new window

It probably would work to just add more (?:...)? copied clauses to handle more arguments but it's not very robust.
OK thanks for that last bit of info. In C# repeating match is handled by Groups[x].Captures[y]. I was able to find the above parameters like this:
User generated imageIt's slightly fragmented in that the first parameter shows up in its own group and all subsequent parameters seem to show up in the second group. Is there a way to fix that or am I just going to have to live with it?
there is no other way to do it

Except that there is another way to do it:

(?<=def\s+\w+\s*\((?:\s*\w+(\s*#\w+)?\s*,\s*)*)\w+

Open in new window


string targetString = "the stuff";
string pattern = @"(?<=def\s+\w+\s*\((?:\s*\w+(\s*#\w+)?\s*,\s*)*)\w+";
MatchCollection matches = Regex.Matches(targetString, pattern);

foreach (Match m in matches)
{
    Console.WriteLine(m.Value);
}

Open in new window


 User generated image
This works by using a positive lookbehind to find the initial function declaration, followed by a sequence of parameters (with optional #WHATEVER succeeding the param name). Due to the way regex engines work internally, the engine keeps track of the last matching position. All the lookbehind needs to do is match a sequence of zero or more function definitions and patterns, which is what the above does.
kaufmed, at least according to https://regex101.com/, you can't have quantifiers in lookbehind (for both perl and python regex).  I don't use Python or C# so can't double-check and, since the question was about Python and C#, did not check Perl.

Russ Suter, you can probably get it to work by breaking the regex slightly.  This will work for your sample data but is definitely not as robust and may match bogus data.
def\s+\w+\s*\((?:(\w+)(?:\s*\#[^\n,\)]*)?\s*,?\s*)+\s*\)

Open in new window

@wilcoxon

In C# you can certainly have quantifiers in a lookbehind, which is why I went that route. It's one of the few engines that does support quantifiers in a lookbehind. I mean, I posted a screenshot with working code, after all  = )