Link to home
Start Free TrialLog in
Avatar of newton-allan
newton-allan

asked on

re2c example program

I'd like to evaluate the re2c utility (regular expression 2 c). The documentation indicates it can take a script that defines a regular expression and generate the "scanner" routine.
www.re2c.org

There are some examples, but I'm finding them more obscure than I care to wrestle with. I want to be able to evaluate re2c and compare it to the boost regex-related libraries (regex, xpressive, and spirit) and to "hand-tuned" recognizers.

Here's a "stub" of the program I would like to get working:

char* scanner(char* str)  // generated by re2c
{
}

void main(void)
{
   char  testStr[] =
          "Alternate days of the week are Tue and Thursday and Sat and Monday. "
          "And then Monday and Wed and Friday and Sun. "

   boost::regex reg("((Sunday|Sun)|(Monday|Mon)|(Tuesday|Tue)|"
             "(Wednesday|Wed)|(Thursday|Thu)|(Friday|Fri)|(Saturday|Sat))");

  int pos = 0;
  char* result;

  while (result = scanner(&testStr[pos]) != NULL) {
     printf("%.10s\n", result);  // not aware of what is returned
     pos += strlen(result);       // not aware of how to advance to next search
  }
}
Avatar of Arty K
Arty K
Flag of Kazakhstan image

I've just read the manual. It seems to me rather clear, how to use the parser.
It matches against first character, so you cannot use '^' in your regular expressions. It

Here is a working example for your regex:

#define NULL            ((char*) 0)

static char *q;

char *scan(char *p){
#define YYCTYPE         char
#define YYCURSOR        p
#define YYLIMIT         p
#define YYMARKER        q
#define YYFILL(n)
/*!re2c
  (("Sunday"|"Sun")|("Monday"|"Mon")|("Tuesday"|"Tue")|("Wednesday"|"Wed")|("Thursday"|"Thu")|("Friday"|"Fri")|("Saturday"|"Sat"))          {return YYCURSOR;}
  [\000-\377]     {return NULL;}
*/
}

int
main()
{
   char  *testStr =
          "Alternate days of the week are Tue and Thursday and Sat and Monday. "          "And then Monday and Wed and Friday and Sun. ";
  char *match;
  char *curr;
  char buff[32]; /* the longest possible match */

  curr=testStr;
  while (*curr != '\0') {
     match=scan(curr);
     if (match)
     {
      bzero(buff, sizeof buff);
      memcpy(buff, curr, q-curr);
      printf("res=%.10s\n", buff);  // not aware of what is returned
     }
     curr++;
  }
}
Avatar of newton-allan
newton-allan

ASKER

Wow ...

Couple of glitches, but otherwise does almost all of what I asked:

* bzero is non-standard (from mks?)
* Needs .h files for printf, memcpy, and memset (to replace bzero)
* redefinition of NULL gives warning

Two remaining questions:

* What should YYFILL(n) be? This becomes a "nop" with a compiler warning about "if statement being empty  ... empty controlled statement found; is this the intent?".

Also, I think this #define has something to do with the result always being three letters long instead of the complete token:
res=Tue¦¦¦¦¦¦¦
res=Thu¦¦¦¦¦¦¦
res=Sat¦¦¦¦¦¦¦
res=Mon¦¦¦¦¦¦¦
res=Mon¦¦¦¦¦¦¦
etc.

And finally (outside of the original question) ... can the scanner figure out and make known with "out" reference variables the "match-index", position/offset, and length of the token that was matched?
MatchIndex 0 = Sunday or Sun
MatchIndex 1 = Monday or Mon
2 = Tuesday or Tue
etc.

int matchIndex, len, pos;
char* res = scan(curr, &matchIndex, &len, &pos);

So that something like the following could be shown:
Found:       Pos     Length  MatchIndex
---------  -------    -------  ------
Tue            31         3          2
Thursday    39         8          4
etc.

Thanks VERY MUCH for your help on this. I was baffled.

ASKER CERTIFIED SOLUTION
Avatar of Arty K
Arty K
Flag of Kazakhstan image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Terrific. Thanks for your help. I'm going to post a very similar question to get an example program that illustrates their -f flag (which involves YYGETSTATE and YYSETSTATE)
I stared at your suggested code (which I'll call DayOfWeekRecognizer.re) some more, and revised/simplified it to this based on your very valuable recommendations. It provides the length of the matching string as an "out" variable, and returns bool if a match was found.

// File DayOfWeekRecognizer.re
// re2c command line to generate
// re2c -s -i -b -s -oDayOfWeekRecognizer.cpp DayOfWeekRecognizer.re
// vc7.1 command line to compile/link:
// cl -O2 /DNDEBUG /D_CONSOLE /DWIN32 /D_MBSC DayofWeekRecognizer.cpp
#include "stdio.h"
#include "string.h"

static char *pBacktrackInfo;

#define YYCTYPE         char
#define YYCURSOR        pStrToScan
#define YYLIMIT         pStrToScan
#define YYMARKER        pBacktrackInfo
#define YYFILL(n)

bool RecognizeDayOfWeek(char *pStrToScan, int* pLen)
{
   char* pOrigStr = pStrToScan;
/*!re2c
  (("Sunday"|"Sun")|
   ("Monday"|"Mon")|
   ("Tuesday"|"Tues")|
   ("Wednesday"|"Wed")|
   ("Thursday"|"Thu")|
   ("Friday"|"Fri")|
   ("Saturday"|"Sat"))
  {
     *pLen = YYCURSOR - pOrigStr;
     return true;
  }
  [\000-\377]     {return false;}
*/
}

void main(void)
{
   char  *testStr =
          "Alternate days of the week are Tues and Thursday and Sat and Monday. "         
          "And then Monday and Wed and Friday and Sun. ";
  bool  bMatch;
  char *pCurTestStrPos = testStr;
  int  len;

  while (*pCurTestStrPos != '\0') {
     bMatch = RecognizeDayOfWeek(pCurTestStrPos, &len);
     if (bMatch)
     {
        printf("Day=%.*s  len: %d\n", len, pCurTestStrPos, len);
     }
     pCurTestStrPos++;
  }
}
Thank you.
About statefull algorithm and -f flag, I recommend you to look inside sources,
that already use re2c (links some of them are listed on re2c site).