dshrenik
asked on
Regular Expression Parser
I am planning to implement a Regular Expression parser (for my own checktyoes) in Python.
My regular expression is of the following form:
Checktype1{n1,n2} Checktype2{n1,n2} ... Checktypen{n1,n2}
This regular expression is used for word matching (match a collection of words in a paragraph of text). Each checktype is an atomic function that tries to match a complete word, and see if it has the required properties. {n1,n2} have the conventional meaning of a Regular Expression - lower and upper limit for the number of occurrences in series.
I have not yet decided if I will use the 'or' operator ( '|' operator in a standard RegEx)
And I must obviously allow grouping of checktypes with parentheses.
I am not sure about the approach as well. If I'm not wrong, I must first define a grammar for this and collect groups of checktypes such that each group tries to match a single word.
eg: ( C1{1,2} or C2{0,2} ) C3{1,2}
This RegEx tries to match 2 successive words (though it has 3 checktypes)
So. I must break down this RegEx into 2 groups - ( C1{1,2} or C2{0,2} ), and C3{1,2}
I can then feed these to a Regular Expression matcher. Is my approach correct?
I am not sure as to how I must implement the same. I have read about NFA/DFA and backtracking. I know that backtracking is less efficient, but I don't think my regular expressions will require too much of backtracking. Hence, I feel that it may not be a bad idea.
Please let me know of other kinds of implementations as well. I also want to know which implementation will be most efficient. Any source code (preferably in Python) / sample code / references will be great.
My regular expression is of the following form:
Checktype1{n1,n2} Checktype2{n1,n2} ... Checktypen{n1,n2}
This regular expression is used for word matching (match a collection of words in a paragraph of text). Each checktype is an atomic function that tries to match a complete word, and see if it has the required properties. {n1,n2} have the conventional meaning of a Regular Expression - lower and upper limit for the number of occurrences in series.
I have not yet decided if I will use the 'or' operator ( '|' operator in a standard RegEx)
And I must obviously allow grouping of checktypes with parentheses.
I am not sure about the approach as well. If I'm not wrong, I must first define a grammar for this and collect groups of checktypes such that each group tries to match a single word.
eg: ( C1{1,2} or C2{0,2} ) C3{1,2}
This RegEx tries to match 2 successive words (though it has 3 checktypes)
So. I must break down this RegEx into 2 groups - ( C1{1,2} or C2{0,2} ), and C3{1,2}
I can then feed these to a Regular Expression matcher. Is my approach correct?
I am not sure as to how I must implement the same. I have read about NFA/DFA and backtracking. I know that backtracking is less efficient, but I don't think my regular expressions will require too much of backtracking. Hence, I feel that it may not be a bad idea.
Please let me know of other kinds of implementations as well. I also want to know which implementation will be most efficient. Any source code (preferably in Python) / sample code / references will be great.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
It will be nice if I could get a step by step approach to my problem.
Can I just clarify what you need?
Do you want to rewrite the python module so that when you use \w+ it replaces that with your own code?
Do you want to rewrite the python module so that when you use \w+ it replaces that with your own code?
ASKER
It looks like if I could replace \w with my own checktype (function), that will do my job. As a first step, yes, I want to know if rewriting the python module so that I can replace \w with my own code is possible..
Also, the other thing is, how do I match 1 or more \w. I know that I can use r"(\w+)*", but then this will collect a series of \w into a single group. I would want each want each \w to be collected in a different group.
Also, the other thing is, how do I match 1 or more \w. I know that I can use r"(\w+)*", but then this will collect a series of \w into a single group. I would want each want each \w to be collected in a different group.
ASKER
I have attached a code snipper. I would want to do something similar, except that I would want to replace \w+, which represents any word, with my own checktype. Any idea how I can achieve that?
Open in new window