Link to home
Start Free TrialLog in
Avatar of dshrenik
dshrenikFlag for United States of America

asked on

Regular Expression Parser

I am planning to implement a Regular Expression parser (for my own checktyoes) in Python.

My regular expression is of the following form:
Checktype1{n1,n2} Checktype2{n1,n2} ... Checktypen{n1,n2}

This regular expression is used for word matching (match a collection of words in a paragraph of text). Each checktype is an atomic function that tries to match a complete word, and see if it has the required properties. {n1,n2} have the conventional meaning of a Regular Expression - lower and upper limit for the number of occurrences in series.  

I have not yet decided if I will use the 'or' operator ( '|' operator in a standard RegEx)
And I must obviously allow grouping of checktypes with parentheses.

I am not sure about the approach as well. If I'm not wrong, I must first define a grammar for this and collect groups of checktypes such that each group tries to match a single word.
eg: ( C1{1,2} or C2{0,2} )  C3{1,2}
This RegEx tries to match 2 successive words (though it has 3 checktypes)
So. I must break down this RegEx into 2 groups - ( C1{1,2} or C2{0,2} ), and  C3{1,2}
I can then feed these to a Regular Expression matcher. Is my approach correct?

I am not sure as to how I must implement the same. I have read about NFA/DFA and backtracking. I know that backtracking is less efficient, but I don't think my regular expressions will require too much of backtracking. Hence, I feel that it may not be a bad idea.

Please let me know of other kinds of implementations as well. I also want to know which implementation will be most efficient. Any source code (preferably in Python) / sample code / references will be great.
ASKER CERTIFIED SOLUTION
Avatar of LeeeRussell
LeeeRussell
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of dshrenik

ASKER

Thank you. I did find the file re.py that implements regular expression module. But, it does not have enough documentation so that I can understand the whole code.

I have attached a code snipper. I would want to do something similar, except that I would want to replace \w+, which represents any word,  with my own checktype. Any idea how I can achieve that?
>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

Open in new window

It will be nice if I could get a step by step approach to my problem.
Can I just clarify what you need?

Do you want to rewrite the python module so that when you use \w+ it replaces that with your own code?
It looks like if I could replace \w with my own checktype (function), that will do my job. As a first step, yes, I want to know if rewriting the python module so that I can replace \w with my own code is possible..

Also, the other thing is, how do I match 1 or more \w. I know that I can use r"(\w+)*", but then this will collect a series of \w into a single group. I would want each want each \w to be collected in a different group.