newton-allan
asked on
re2c example program
I'd like to evaluate the re2c utility (regular expression 2 c). The documentation indicates it can take a script that defines a regular expression and generate the "scanner" routine.
www.re2c.org
There are some examples, but I'm finding them more obscure than I care to wrestle with. I want to be able to evaluate re2c and compare it to the boost regex-related libraries (regex, xpressive, and spirit) and to "hand-tuned" recognizers.
Here's a "stub" of the program I would like to get working:
char* scanner(char* str) // generated by re2c
{
}
void main(void)
{
char testStr[] =
"Alternate days of the week are Tue and Thursday and Sat and Monday. "
"And then Monday and Wed and Friday and Sun. "
boost::regex reg("((Sunday|Sun)|(Monday |Mon)|(Tue sday|Tue)| "
"(Wednesday|Wed)|(Thursday |Thu)|(Fri day|Fri)|( Saturday|S at))");
int pos = 0;
char* result;
while (result = scanner(&testStr[pos]) != NULL) {
printf("%.10s\n", result); // not aware of what is returned
pos += strlen(result); // not aware of how to advance to next search
}
}
www.re2c.org
There are some examples, but I'm finding them more obscure than I care to wrestle with. I want to be able to evaluate re2c and compare it to the boost regex-related libraries (regex, xpressive, and spirit) and to "hand-tuned" recognizers.
Here's a "stub" of the program I would like to get working:
char* scanner(char* str) // generated by re2c
{
}
void main(void)
{
char testStr[] =
"Alternate days of the week are Tue and Thursday and Sat and Monday. "
"And then Monday and Wed and Friday and Sun. "
boost::regex reg("((Sunday|Sun)|(Monday
"(Wednesday|Wed)|(Thursday
int pos = 0;
char* result;
while (result = scanner(&testStr[pos]) != NULL) {
printf("%.10s\n", result); // not aware of what is returned
pos += strlen(result); // not aware of how to advance to next search
}
}
ASKER
Wow ...
Couple of glitches, but otherwise does almost all of what I asked:
* bzero is non-standard (from mks?)
* Needs .h files for printf, memcpy, and memset (to replace bzero)
* redefinition of NULL gives warning
Two remaining questions:
* What should YYFILL(n) be? This becomes a "nop" with a compiler warning about "if statement being empty ... empty controlled statement found; is this the intent?".
Also, I think this #define has something to do with the result always being three letters long instead of the complete token:
res=Tue¦¦¦¦¦¦¦
res=Thu¦¦¦¦¦¦¦
res=Sat¦¦¦¦¦¦¦
res=Mon¦¦¦¦¦¦¦
res=Mon¦¦¦¦¦¦¦
etc.
And finally (outside of the original question) ... can the scanner figure out and make known with "out" reference variables the "match-index", position/offset, and length of the token that was matched?
MatchIndex 0 = Sunday or Sun
MatchIndex 1 = Monday or Mon
2 = Tuesday or Tue
etc.
int matchIndex, len, pos;
char* res = scan(curr, &matchIndex, &len, &pos);
So that something like the following could be shown:
Found: Pos Length MatchIndex
--------- ------- ------- ------
Tue 31 3 2
Thursday 39 8 4
etc.
Thanks VERY MUCH for your help on this. I was baffled.
Couple of glitches, but otherwise does almost all of what I asked:
* bzero is non-standard (from mks?)
* Needs .h files for printf, memcpy, and memset (to replace bzero)
* redefinition of NULL gives warning
Two remaining questions:
* What should YYFILL(n) be? This becomes a "nop" with a compiler warning about "if statement being empty ... empty controlled statement found; is this the intent?".
Also, I think this #define has something to do with the result always being three letters long instead of the complete token:
res=Tue¦¦¦¦¦¦¦
res=Thu¦¦¦¦¦¦¦
res=Sat¦¦¦¦¦¦¦
res=Mon¦¦¦¦¦¦¦
res=Mon¦¦¦¦¦¦¦
etc.
And finally (outside of the original question) ... can the scanner figure out and make known with "out" reference variables the "match-index", position/offset, and length of the token that was matched?
MatchIndex 0 = Sunday or Sun
MatchIndex 1 = Monday or Mon
2 = Tuesday or Tue
etc.
int matchIndex, len, pos;
char* res = scan(curr, &matchIndex, &len, &pos);
So that something like the following could be shown:
Found: Pos Length MatchIndex
--------- ------- ------- ------
Tue 31 3 2
Thursday 39 8 4
etc.
Thanks VERY MUCH for your help on this. I was baffled.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Terrific. Thanks for your help. I'm going to post a very similar question to get an example program that illustrates their -f flag (which involves YYGETSTATE and YYSETSTATE)
ASKER
I stared at your suggested code (which I'll call DayOfWeekRecognizer.re) some more, and revised/simplified it to this based on your very valuable recommendations. It provides the length of the matching string as an "out" variable, and returns bool if a match was found.
// File DayOfWeekRecognizer.re
// re2c command line to generate
// re2c -s -i -b -s -oDayOfWeekRecognizer.cpp DayOfWeekRecognizer.re
// vc7.1 command line to compile/link:
// cl -O2 /DNDEBUG /D_CONSOLE /DWIN32 /D_MBSC DayofWeekRecognizer.cpp
#include "stdio.h"
#include "string.h"
static char *pBacktrackInfo;
#define YYCTYPE char
#define YYCURSOR pStrToScan
#define YYLIMIT pStrToScan
#define YYMARKER pBacktrackInfo
#define YYFILL(n)
bool RecognizeDayOfWeek(char *pStrToScan, int* pLen)
{
char* pOrigStr = pStrToScan;
/*!re2c
(("Sunday"|"Sun")|
("Monday"|"Mon")|
("Tuesday"|"Tues")|
("Wednesday"|"Wed")|
("Thursday"|"Thu")|
("Friday"|"Fri")|
("Saturday"|"Sat"))
{
*pLen = YYCURSOR - pOrigStr;
return true;
}
[\000-\377] {return false;}
*/
}
void main(void)
{
char *testStr =
"Alternate days of the week are Tues and Thursday and Sat and Monday. "
"And then Monday and Wed and Friday and Sun. ";
bool bMatch;
char *pCurTestStrPos = testStr;
int len;
while (*pCurTestStrPos != '\0') {
bMatch = RecognizeDayOfWeek(pCurTes tStrPos, &len);
if (bMatch)
{
printf("Day=%.*s len: %d\n", len, pCurTestStrPos, len);
}
pCurTestStrPos++;
}
}
// File DayOfWeekRecognizer.re
// re2c command line to generate
// re2c -s -i -b -s -oDayOfWeekRecognizer.cpp DayOfWeekRecognizer.re
// vc7.1 command line to compile/link:
// cl -O2 /DNDEBUG /D_CONSOLE /DWIN32 /D_MBSC DayofWeekRecognizer.cpp
#include "stdio.h"
#include "string.h"
static char *pBacktrackInfo;
#define YYCTYPE char
#define YYCURSOR pStrToScan
#define YYLIMIT pStrToScan
#define YYMARKER pBacktrackInfo
#define YYFILL(n)
bool RecognizeDayOfWeek(char *pStrToScan, int* pLen)
{
char* pOrigStr = pStrToScan;
/*!re2c
(("Sunday"|"Sun")|
("Monday"|"Mon")|
("Tuesday"|"Tues")|
("Wednesday"|"Wed")|
("Thursday"|"Thu")|
("Friday"|"Fri")|
("Saturday"|"Sat"))
{
*pLen = YYCURSOR - pOrigStr;
return true;
}
[\000-\377] {return false;}
*/
}
void main(void)
{
char *testStr =
"Alternate days of the week are Tues and Thursday and Sat and Monday. "
"And then Monday and Wed and Friday and Sun. ";
bool bMatch;
char *pCurTestStrPos = testStr;
int len;
while (*pCurTestStrPos != '\0') {
bMatch = RecognizeDayOfWeek(pCurTes
if (bMatch)
{
printf("Day=%.*s len: %d\n", len, pCurTestStrPos, len);
}
pCurTestStrPos++;
}
}
Thank you.
About statefull algorithm and -f flag, I recommend you to look inside sources,
that already use re2c (links some of them are listed on re2c site).
About statefull algorithm and -f flag, I recommend you to look inside sources,
that already use re2c (links some of them are listed on re2c site).
It matches against first character, so you cannot use '^' in your regular expressions. It
Here is a working example for your regex:
#define NULL ((char*) 0)
static char *q;
char *scan(char *p){
#define YYCTYPE char
#define YYCURSOR p
#define YYLIMIT p
#define YYMARKER q
#define YYFILL(n)
/*!re2c
(("Sunday"|"Sun")|("Monday
[\000-\377] {return NULL;}
*/
}
int
main()
{
char *testStr =
"Alternate days of the week are Tue and Thursday and Sat and Monday. " "And then Monday and Wed and Friday and Sun. ";
char *match;
char *curr;
char buff[32]; /* the longest possible match */
curr=testStr;
while (*curr != '\0') {
match=scan(curr);
if (match)
{
bzero(buff, sizeof buff);
memcpy(buff, curr, q-curr);
printf("res=%.10s\n", buff); // not aware of what is returned
}
curr++;
}
}