We help IT Professionals succeed at work.

python get location in string

Dennie
Dennie used Ask the Experts™
on
Hi experts,

I need to find the location of the variable seperator  (,) in a php function call.

Some example function calls are shown below. The seperator locations are underlined.
test($var1, $var2, $var3);
test(array('hello, bye'), $bla);
test("string with 'quotes' \"quotes\" ",'ok',$bla);

How can I get the position of the seperator locations?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
What is the original intention?  Do you want to split the call by ',' (to get the separated tokens)?

I am not sure if there is any standard Python module/function to parse the arguments.  I usually write my own parser based on a finite automaton.  It may not be that short.  It is actually equivalent to a regular expression.  However--in my opinion--it is easier to modify the finite automaton than a complex regular expression.

If acceptable, I can show you how to write such a parser.  However, it may be a bit longer (not difficult, but the code is longer) than you expect.  Even though the result seems to be complicated, the parser will be quite fast when used.
Here is a simple lexical analyzer (not optimized, namely the part with status == 1):
b.py
def lexAnalyzer(s):
    status = 0
    for n, c in enumerate(s):

        if status == 0:
            # single chars with meaning
            if c == '(':
                yield (n, 'lpar', '(')
            elif c == ')':
                yield (n, 'rpar', ')')
            elif c == ',':
                yield (n, 'comma', ',')
            elif c == ';':
                yield (n, 'semic', ';')
            elif c == '$':
                yield (n, 'dollarsign', '$')
            elif c == '\n':
                yield (n, 'newline', '\n')
            elif c in ' \t':                     # whitespace
                yield (n, 'whitespace', c)
            elif c == '"' or c == "'": # string literal starts
                qc = c       # quoting character
                value = []   # init
                start = n
                status = 3
            else:  # more of other characters that form an identifier
                value = [c]  # first of the chars
                start = n
                status = 1

        elif status == 1:  # collecting identifier
            # single chars with meaning
            if c == '(':
                yield (start, 'identifier', ''.join(value))
                yield (n, 'lpar', '(')
                status = 0
            elif c == ')':
                yield (start, 'identifier', ''.join(value))
                yield (n, 'rpar', ')')
                status = 0
            elif c == ',':
                yield (start, 'identifier', ''.join(value))
                yield (n, 'comma', ',')
                status = 0
            elif c == ';':
                yield (start, 'identifier', ''.join(value))
                yield (n, 'semic', ';')
                status = 0
            elif c == '$':
                yield (start, 'identifier', ''.join(value))
                yield (n, 'dollarsign', '$')
                status = 0
            elif c == '\n':
                yield (start, 'identifier', ''.join(value))
                yield (n, 'newline', '\n')
                status = 0
            elif c in ' \t':                     # whitespace
                yield (start, 'identifier', ''.join(value))
                yield (n, 'whitespace', c)
                status = 0
            else:
                value.append(c)

        elif status == 3:  # string literal
            if c == qc:      # enclosing quote character
                yield (start, 'stringliteral', ''.join(value))
                status = 0
            elif c == '\\':  # next character is escaped
                value.append(c)
                status = 6
            else:            # collect the characters
                value.append(c)

        elif status == 6:  # escaped char in a string literal
            value.append(c)
            status = 3       # back to the string literal

        else:
            yield (n, 'unexpectedstatus', status)
            break


if __name__ == '__main__':
    f = open('data.txt')
    for line in f:
        print '---------------------------'
        print line.rstrip()
        lst = []
        for item in lexAnalyzer(line):
            print item
            if item[1] == 'comma':
                lst.append(item[0])  # another comma position
        print 'comma positions:', lst
    f.close()

Open in new window


When your test lines are read from the data.txt, it prints on my console:
c:\tmp\___python\Dennie\Q_27676426>python b.py
---------------------------
test($var1, $var2, $var3);
(0, 'identifier', 'test')
(4, 'lpar', '(')
(5, 'dollarsign', '$')
(6, 'identifier', 'var1')
(10, 'comma', ',')
(11, 'whitespace', ' ')
(12, 'dollarsign', '$')
(13, 'identifier', 'var2')
(17, 'comma', ',')
(18, 'whitespace', ' ')
(19, 'dollarsign', '$')
(20, 'identifier', 'var3')
(24, 'rpar', ')')
(25, 'semic', ';')
(26, 'newline', '\n')
comma positions: [10, 17]
---------------------------
test(array('hello, bye'), $bla);
(0, 'identifier', 'test')
(4, 'lpar', '(')
(5, 'identifier', 'array')
(10, 'lpar', '(')
(11, 'stringliteral', 'hello, bye')
(23, 'rpar', ')')
(24, 'comma', ',')
(25, 'whitespace', ' ')
(26, 'dollarsign', '$')
(27, 'identifier', 'bla')
(30, 'rpar', ')')
(31, 'semic', ';')
(32, 'newline', '\n')
comma positions: [24]
---------------------------
test("string with 'quotes' \"quotes\" ",'ok',$bla);
(0, 'identifier', 'test')
(4, 'lpar', '(')
(5, 'stringliteral', 'string with \'quotes\' \\"quotes\\" ')
(39, 'comma', ',')
(40, 'stringliteral', 'ok')
(44, 'comma', ',')
(45, 'dollarsign', '$')
(46, 'identifier', 'bla')
(49, 'rpar', ')')
(50, 'semic', ';')
(51, 'newline', '\n')
comma positions: [39, 44]

Open in new window


It seems to solve your problem at first look.  However, you did not ask for the example like:
---------------------------
test(myfunction(fa1, fa2), arg2, arg3);
(0, 'identifier', 'test')
(4, 'lpar', '(')
(5, 'identifier', 'myfunction')
(15, 'lpar', '(')
(16, 'identifier', 'fa1')
(19, 'comma', ',')
(20, 'whitespace', ' ')
(21, 'identifier', 'fa2')
(24, 'rpar', ')')
(25, 'comma', ',')
(26, 'whitespace', ' ')
(27, 'identifier', 'arg2')
(31, 'comma', ',')
(32, 'whitespace', ' ')
(33, 'identifier', 'arg3')
(37, 'rpar', ')')
(38, 'semic', ';')
(39, 'newline', '\n')
comma positions: [19, 25, 31]

Open in new window


For that purpose, the lexical analyzer is not powerfull enough.

Ask for what you do not understand ;)

Commented:
This code might be what you are looking for. However, it is not very flexible.
teststr = "test($var1, $var2, $var3);"
literal_list = ["'",'"']
delims = [',']
literal = ''

for index in range(len(teststr)):
	if literal == '':
		if teststr[index] in delims:
			print index
		elif teststr[index] in literal_list:
			literal = teststr[index]
	else:
		if teststr[index] == literal:
			literal = ''

Open in new window


If you want to look just at the variables rather than the entire call replace

teststr = "test($var1, $var2, $var3);"

Open in new window

with
teststr = "test($var1, $var2, $var3);".split('(',1)[1].rsplit(')',1)[0]

Open in new window


If none of the posts have what you need then you might want to explain a little more about the what and why.