Link to home
Start Free TrialLog in
Avatar of unknown_
unknown_

asked on

pass code to lexer

Hello,

How can I tweak the code [code snippet] in order to be able to read an ascii code/text and therefore pass the code/text to the lexer in order to return back a token on each call  ?

At the moment it has a function which reads a file.

Thanks in advance for any help !!!
Avatar of unknown_
unknown_

ASKER


%{
		char *buf;
		
		char *read_file(const char *fn)
		{
			FILE *fp;
			char *buf;
			size_t size;
	
			fp = fopen(fn, "rb");
			if (!fp)
			return NULL;
			
			fseek(fp, 0, SEEK_END);
			size = ftell(fp);
			fseek(fp, 0, SEEK_SET);
			
			buf = (char *) malloc(size + 1);
		    if (!buf) {
		    fclose(fp);
		    return NULL;
		}
		
		    size = fread(buf, 1, size, fp);
		    buf[size] = '\0';
		    fclose(fp);
		
		return buf;
	
%}
 
LETTER          [a-zA-Z_]
DIGIT           [0-9]
LETTERDIGIT     [a-zA-Z0-9_]
SIGN            [-+]
STRINGCONSTANT  \"[^"\n]*["\n]
CHARCONSTANT    \'[^'\n]*\'
RANKSPEC        \[[,]*\]
INTEGER			{digit}+
VARIABLE        [a-z_]({LETTERDIGIT})*
 
						  
%%
						  
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"("                  { return LPAREN;     }
")"                  { return RPAREN;     }
";"                  { return SEMICOLON;  }
","                  { return COMMA;      }
"="                  { return EQL;        }
"|"                  { return OR;         }
"&"                  { return AND;        }
"&&"                 { return AND2;       }
 
"if"                 { return K_IF;       }
"else"               { return K_ELSE;     }
"do"                 { return K_DO;       }
"int"                { return K_INT;      }
"return"             { return K_RETURN;   }
"void"               { return K_VOID;     }
"float"              { return K_FLOAT;    }
"while"              { return WHILESYM;   }
						  
						  
{LETTER}{LETTERDIGIT}* {
		yylval.id = new Identifier(yytext); 
		yylval.id->line = line;
		return(IDENTIFIER);
	}
 
{VARIABLE}* {
		yylval.lit = new Liter(yytext); 
	    yylval.lit->type = t_var;
		return(LITERAL);
	}
 
							  
{SIGN}?{DIGIT}+"."{DIGIT}+ {    
		yylval.lit = new Literal(yytext); 
		yylval.lit->type = t_float;
		return(LITERAL);
	}
							  
{SIGN}?{DIGIT}+ {    
	    yylval.lit = new Literal(yytext); 
		yylval.lit->type = t_int32;
		return(LITERAL);
	}
							  
{STRINGCONSTANT} {
	    yylval.lit = new Literal(yytext); 
		yylval.lit->type = t_string;
		return(LITERAL);
	}
						  
						  
[ \t\n\r]            /* skip whitespace */
						  
.                    { printf("Unknown character [%c]\n",yytext[0]);
					   return UNKNOWN;    }
 
%%
 
int yywrap(void){return 1;}

Open in new window

Not sure what you are asking. Can you clarify?

What do you mean "read an ascii code/text" ?

By default, lex/flex reads from the FILE * yyin, which is set to STDIN. Are you able to run your lexer on a file or on standard console input?

mylexer < testfile.txt

I mean how my lexer can actually read the input file and do the appropriate tokenization ??
cuz at the moment the code describes the lexical elements of a c-like language
You need a main() that calls the lexer.

Have you learned about yylex() yet? yylex() is how the lexer is called. If you are working towards a full compiler, eventually you won't call yylex() directly, but for now, you need to.

Where is your main program?

If you don't write an explicit main, lex/flex will provide one for you, which will simply call the lexer once and return.

Have you even compiled this yet?

I ran it through flex and gcc and it has C syntax errors.


Your section of C code in between the %{ %} delimiters won't compile in its current form. You are missing a closing bracket for your read_file() function.

I can't test the rest of it because I don't have your token declarations. Where are your tokens declared (like PLUS, MINUS, K_IF, K_ELSE) ? Normally you declare/define them in your parser grammar (yacc/bison) and use #include in your lex grammar to pull them in. If you aren't using a parser grammar yet, then you need to define them otherwise.
I haven't started working on the yacc/bison yet, but it should be something like that, right ?
extern "C"
{
	int yyparse(void);
	int yylex(void);  
	int yywrap()
	{
		return 1;
	}
	
}
 
extern int yydebug;
 
main()
{
	yydebug=1;
	yyparse();
}
 
 
statement:
			expression                      
			| VARIABLE '=' expression       
;
 
expression: INTEGER 
			|
			VARIABLE
			|
			exp '+' exp
			|
			exp '-' exp
			|
			exp '*' exp
			|
			exp '/' exp
			| '(' expression ')'
;
 
operators_punctuation:  '+'
					    | '-'
						| '*'
						| '('
						| ')'
						| ','
						| ';'
						| '='
						| '/'
						| '%'
						| '||'
						| '|'
						| '&&'
						| '&'
			
;

Open in new window

Yes, but back to my original response, you need to declare your tokens to compile your lexer. Right now, your lexer does not compile. Just because you can run lex/flex on it does not mean it generated a valid C program. You have to compile the program that lex generates.

lex lex.l
cc lex.yy.c

I recommend taking a step back, focus on fixing your lexer properly, without any fancy stuff like opening a file, etc. and just make it work. I think you get ahead of yourself by adding tokens to your grammar before your code compiles. If you have a grammar with 1 token, and invalid C code, its not worth anything except maybe to satisfy your professor's visual check. :)

I would start by declaring all of your tokens as #define in your lexer grammar, like so. Until you declare your tokens, your lexer will never compile into a valid executable program.


%{
 
#define PLUS 1
#define MINUS 2
// ...
#define K_IF 100
#define K_ELSE 101
 
%}

Open in new window

So if i interpreted correctly your response you meant something like that, right ?


if so, the lexer at the moment in order to be complete it needs the yylex or not   ? :s
%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define K_IF		6
#define K_ELSE		7
#define K_WHILE		8
#define K_INT		9
#define K_VOID		10
#define K_RETURN	11
#define K_FLOAT		12
#define PLUS		13
#define MINUS		14
#define TIMES		15
#define SLASH		16
#define LPAREN		17
#define RPAREN		18	
#define SEMICOLON	19
#define COMMA		20
#define EQL			21
#define OR			22
#define OR2			23
#define AND			24
#define AND2		25
		
	
		
%}
	
	LETTER          [a-zA-Z_]
	DIGIT           [0-9]
	LETTERDIGIT     [a-zA-Z0-9_]
	SIGN            [-+]
	STRINGCONSTANT  \"[^"\n]*["\n]
    CHARCONSTANT    \'[^'\n]*\'
	RANKSPEC        \[[,]*\]
	INTEGER			{digit}+
	VARIABLE        [a-z_]({LETTERDIGIT})*
							  
							  
    %%
							  
  "+"                  { return PLUS;       }
  "-"                  { return MINUS;      }
  "*"                  { return TIMES;      }
  "/"                  { return SLASH;      }
  "("                  { return LPAREN;     }
  ")"                  { return RPAREN;     }
  ";"                  { return SEMICOLON;  }
  ","                  { return COMMA;      }
  "="                  { return EQL;        }
  "|"                  { return OR;         }
  "||"                 { return OR2;        }
  "&"                  { return AND;        }
  "&&"                 { return AND2;       }
  
  "if"                 { return K_IF;       }
  "else"               { return K_ELSE;     }
  "do"                 { return K_DO;       }
  "int"                { return K_INT;      }
  "return"             { return K_RETURN;   }
  "void"               { return K_VOID;     }
  "float"              { return K_FLOAT;    }
  "while"              { return WHILESYM;   }
  
  
  {LETTER}{LETTERDIGIT}* {
  yylval.id = new Identifier(yytext); 
  yylval.id->line = line;
  return(IDENTIFIER);
  }
  
  {VARIABLE}* {
  yylval.lit = new Liter(yytext); 
  yylval.lit->type = t_var;
  return(LITERAL);
  }
  
  
  {SIGN}?{DIGIT}+"."{DIGIT}+ {    
  yylval.lit = new Literal(yytext); 
  yylval.lit->type = t_float;
  return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+ {    
  yylval.lit = new Literal(yytext); 
  yylval.lit->type = t_int32;
  return(LITERAL);
  }
  
  {STRINGCONSTANT} {
  yylval.lit = new Literal(yytext); 
  yylval.lit->type = t_string;
  return(LITERAL);
  }
  
  
  [ \t\n\r]            /* skip whitespace */
  
  .                    { printf("Unknown character [%c]\n",yytext[0]);
						return UNKNOWN;    }
 
	%%
	
int yywrap(void){return 1;}

Open in new window

>So if i interpreted correctly your response you meant something like that, right ?

Yes, you are now on the right track. Now what happens when you generate your C program and try to compile it? Did you try it? You'll see that you didn't declare LITERAL and IDENTIFIER tokens.


>if so, the lexer at the moment in order to be complete it needs the yylex or not   ? :s

Just so you are clear, your lexer IS yylex(), however you do need to call it explicitly somewhere from a main if you want to consume and print tokens. This could be in your main()

int token;
while(token = yylex()) {
  printf("lexed token: %d\n", token);
}

Also, please look at the snippet below that is in your program. This was a sample I gave you in another question, but you added it with no changes to your grammar and expected it to work. You cannot do that. My sample had types such as Identifier, Literal, t_literal, t_float, t_string. You haven't declared/defined those types. Remove all that code for now, so we can fix the lexer. Just leave empty rules with a token return value.

Make SURE to declare your TOKENS! :)


  {LETTER}{LETTERDIGIT}* {
  yylval.id = new Identifier(yytext);  <-- REMOVE LINE
  yylval.id->line = line; <-- REMOVE LINE
  return(IDENTIFIER);
  }
  
  {VARIABLE}* {
  yylval.lit = new Liter(yytext);  <-- REMOVE LINE
  yylval.lit->type = t_var; <-- REMOVE LINE
  return(LITERAL);
  }
  
  
  {SIGN}?{DIGIT}+"."{DIGIT}+ {    
  yylval.lit = new Literal(yytext);  <-- REMOVE LINE
  yylval.lit->type = t_float; <-- REMOVE LINE
  return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+ {    
  yylval.lit = new Literal(yytext);  <-- REMOVE LINE
  yylval.lit->type = t_int32; <-- REMOVE LINE
  return(LITERAL);
  }
  
  {STRINGCONSTANT} {
  yylval.lit = new Literal(yytext);  <-- REMOVE LINE
  yylval.lit->type = t_string; <-- REMOVE LINE
  return(LITERAL);
  }
  

Open in new window

what about now  ?
%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define K_IF		6
#define K_ELSE		7
#define K_WHILE		8
#define K_INT		9
#define K_VOID		10
#define K_RETURN	11
#define K_FLOAT		12
#define PLUS		13
#define MINUS		14
#define TIMES		15
#define SLASH		16
#define LPAREN		17
#define RPAREN		18	
#define SEMICOLON	19
#define COMMA		20
#define EQL			21
#define OR			22
#define OR2			23
#define AND			24
#define AND2		25
#define LITERAL	    26
#define IDENTIFIER	27
 
		
	
		
%}
	
	LETTER          [a-zA-Z_]
	DIGIT           [0-9]
	LETTERDIGIT     [a-zA-Z0-9_]
	SIGN            [-+]
	STRINGCONSTANT  \"[^"\n]*["\n]
    CHARCONSTANT    \'[^'\n]*\'
	RANKSPEC        \[[,]*\]
	INTEGER			{digit}+
	VARIABLE        [a-z_]({LETTERDIGIT})*
    COMMENT			"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
							  
    %%
							  
  "+"                  { return PLUS;       }
  "-"                  { return MINUS;      }
  "*"                  { return TIMES;      }
  "/"                  { return SLASH;      }
  "("                  { return LPAREN;     }
  ")"                  { return RPAREN;     }
  ";"                  { return SEMICOLON;  }
  ","                  { return COMMA;      }
  "="                  { return EQL;        }
  "|"                  { return OR;         }
  "||"                 { return OR2;        }
  "&"                  { return AND;        }
  "&&"                 { return AND2;       }
  
  "if"                 { return K_IF;       }
  "else"               { return K_ELSE;     }
  "do"                 { return K_DO;       }
  "int"                { return K_INT;      }
  "return"             { return K_RETURN;   }
  "void"               { return K_VOID;     }
  "float"              { return K_FLOAT;    }
  "while"              { return WHILESYM;   }
  
  
  {LETTER}{LETTERDIGIT}* {
  yylval.id = new Identifier(yytext); 
  yylval.id->line = line;
  return(IDENTIFIER);
  }
  
  {VARIABLE}* {
  yylval.lit = new Liter(yytext); 
  yylval.lit->type = t_var;
  return(LITERAL);
  }
  
  {COMMENT} {
	yylval.lit = new Liter(yytext); 
	yylval.lit->type = t_comment;
	return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+"."{DIGIT}+ {    
  yylval.lit = new Literal(yytext); 
  yylval.lit->type = t_float;
  return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+ {    
  yylval.lit = new Literal(yytext); 
  yylval.lit->type = t_int32;
  return(LITERAL);
  }
  
  {STRINGCONSTANT} {
  yylval.lit = new Literal(yytext); 
  yylval.lit->type = t_string;
  return(LITERAL);
  }
  
  
  [ \t\n\r]            /* skip whitespace */
  
  .                    { printf("Unknown character [%c]\n",yytext[0]);
						return UNKNOWN;    }
 
	%%
	
int main(void){
							  int token;
							  while(token = yylex()) {
							  printf("lexed token: %d\n", token);
}
 
int yywrap(void){return 1;}

Open in new window

[skip my previous post]

ok i removed them !
%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define K_IF		6
#define K_ELSE		7
#define K_WHILE		8
#define K_INT		9
#define K_VOID		10
#define K_RETURN	11
#define K_FLOAT		12
#define PLUS		13
#define MINUS		14
#define TIMES		15
#define SLASH		16
#define LPAREN		17
#define RPAREN		18	
#define SEMICOLON	19
#define COMMA		20
#define EQL			21
#define OR			22
#define OR2			23
#define AND			24
#define AND2		25
#define LITERAL	    26
#define IDENTIFIER	27
 
		
	
		
%}
	
	LETTER          [a-zA-Z_]
	DIGIT           [0-9]
	LETTERDIGIT     [a-zA-Z0-9_]
	SIGN            [-+]
	STRINGCONSTANT  \"[^"\n]*["\n]
    CHARCONSTANT    \'[^'\n]*\'
	RANKSPEC        \[[,]*\]
	INTEGER			{digit}+
	VARIABLE        [a-z_]({LETTERDIGIT})*
    COMMENT			"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
							  
    %%
							  
  "+"                  { return PLUS;       }
  "-"                  { return MINUS;      }
  "*"                  { return TIMES;      }
  "/"                  { return SLASH;      }
  "("                  { return LPAREN;     }
  ")"                  { return RPAREN;     }
  ";"                  { return SEMICOLON;  }
  ","                  { return COMMA;      }
  "="                  { return EQL;        }
  "|"                  { return OR;         }
  "||"                 { return OR2;        }
  "&"                  { return AND;        }
  "&&"                 { return AND2;       }
  
  "if"                 { return K_IF;       }
  "else"               { return K_ELSE;     }
  "do"                 { return K_DO;       }
  "int"                { return K_INT;      }
  "return"             { return K_RETURN;   }
  "void"               { return K_VOID;     }
  "float"              { return K_FLOAT;    }
  "while"              { return WHILESYM;   }
  
  
  {LETTER}{LETTERDIGIT}* {
  return(IDENTIFIER);
  }
  
  {VARIABLE}* {
  return(LITERAL);
  }
  
  {COMMENT} {
	return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+"."{DIGIT}+ {    
  return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+ {    
  return(LITERAL);
  }
  
  {STRINGCONSTANT} {
  return(LITERAL);
  }
  
  
  [ \t\n\r]            /* skip whitespace */
  
  .                    { printf("Unknown character [%c]\n",yytext[0]);
						return UNKNOWN;    }
 
	%%
	
int main(void){
							  int token;
							  while(token = yylex()) {
							  printf("lexed token: %d\n", token);
}
 
int yywrap(void){return 1;}

Open in new window

The integer rule could be :
 
  {SIGN}?{DIGIT}+ {  
        printf("INTEGER\n"); sscanf(yytext,"%d", &(yyval.value));
        return(LITERAL);
  }

Is that right ?
Have you tried compiling your lexer yet?
i haven't tried yet cuz i don't know how to do so on mac osx :s
%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define T_IF		6
#define T_ELSE		7
#define T_WHILE		8
#define T_INT		9
#define T_VOID		10
#define T_RETURN	11
#define T_FLOAT		12
#define PLUS		13
#define MINUS		14
#define TIMES		15
#define SLASH		16
#define LPAREN		17
#define RPAREN		18	
#define SEMICOLON	19
#define COMMA		20
#define EQL			21
#define OR			22
#define OR2			23
#define AND			24
#define AND2		25
#define LITERAL	    26
#define IDENTIFIER	27
 
		
	
		
%}
	
	LETTER          [a-zA-Z_]
	DIGIT           [0-9]
	LETTERDIGIT     [a-zA-Z0-9_]
	SIGN            [-+]
	STRINGCONSTANT  \"[^"\n]*["\n]
    CHARCONSTANT    \'[^'\n]*\'
	RANKSPEC        \[[,]*\]
	INTEGER			{digit}+
	VARIABLE        [a-z_]({LETTERDIGIT})*
    COMMENT			"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
							  
    %%
							  
  "+"                  { return PLUS;       }
  "-"                  { return MINUS;      }
  "*"                  { return TIMES;      }
  "/"                  { return SLASH;      }
  "("                  { return LPAREN;     }
  ")"                  { return RPAREN;     }
  ";"                  { return SEMICOLON;  }
  ","                  { return COMMA;      }
  "="                  { return EQL;        }
  "|"                  { return OR;         }
  "||"                 { return OR2;        }
  "&"                  { return AND;        }
  "&&"                 { return AND2;       }
  
  "if"                 { return T_IF;       }
  "else"               { return T_ELSE;     }
  "do"                 { return T_DO;       }
  "int"                { return T_INT;      }
  "return"             { return T_RETURN;   }
  "void"               { return T_VOID;     }
  "float"              { return T_FLOAT;    }
  "while"              { return T_WHILE;   }
  
  
  {LETTER}{LETTERDIGIT}* {
  return(IDENTIFIER);
  }
  
  {VARIABLE}* { 
	  printf("VARIABLE\n"); yylval.lexeme=(char*)malloc(yyleng+1);
	  strcpy(yyval.lexeme, yytext); 
	  /*return T_VARIABLE*/
	  return(LITERAL);
  }
  
  {COMMENT} {
	return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+"."{DIGIT}+ {   
	  printf("FLOAT\n"); sscanf(yytext,"%d", &(yyval.value));
	  return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+ {  
	  printf("INTEGER\n"); sscanf(yytext,"%d", &(yyval.value));
	  return(LITERAL);
  }
  
  {STRINGCONSTANT} {
  return(LITERAL);
  }
  
  
  [ \t\n\r]            /* skip whitespace */
  
  .                    { printf("Unknown character [%c]\n",yytext[0]);
						return UNKNOWN;    }
 
	%%
	
int main(void){
							  int token;
							  while(token = yylex()) {
							  printf("lexed token: %d\n", token);
}
 
int yywrap(void){return 1;}

Open in new window

Doesn't OS X have standard lex, or flex? gcc? If you want much more help from me you better figure out how to run your tools, or switch OS ;).

Try this:

> lex mygrammar.l

It should produce an output file, look for a .c file

Then do:

> cc lex.yy.c

i did the lex mygrammar.l but it returned back four errors:
mygrammar.l:115: bad character: }
mygrammar.l:117: name defined twice
mygrammar.l:119: bad character: }
mygrammar.l:120: premature EOF

Why do i get these errors ? :S
%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define T_IF		6
#define T_ELSE		7
#define T_WHILE		8
#define T_INT		9
#define T_VOID		10
#define T_RETURN	11
#define T_FLOAT		12
#define PLUS		13
#define MINUS		14
#define TIMES		15
#define SLASH		16
#define LPAREN		17
#define RPAREN		18	
#define SEMICOLON	19
#define COMMA		20
#define EQL			21
#define OR			22
#define OR2			23
#define AND			24
#define AND2		25
#define LITERAL	    26
#define IDENTIFIER	27
 
		
	
		
%}
	
	LETTER          [a-zA-Z_]
	DIGIT           [0-9]
	LETTERDIGIT     [a-zA-Z0-9_]
	SIGN            [-+]
	STRINGCONSTANT  \"[^"\n]*["\n]
    CHARCONSTANT    \'[^'\n]*\'
	RANKSPEC        \[[,]*\]
	INTEGER			{digit}+
	VARIABLE        [a-z_]({LETTERDIGIT})*
    COMMENT			"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
							  
    %%
							  
  "+"                  { return PLUS;       }
  "-"                  { return MINUS;      }
  "*"                  { return TIMES;      }
  "/"                  { return SLASH;      }
  "("                  { return LPAREN;     }
  ")"                  { return RPAREN;     }
  ";"                  { return SEMICOLON;  }
  ","                  { return COMMA;      }
  "="                  { return EQL;        }
  "|"                  { return OR;         }
  "||"                 { return OR2;        }
  "&"                  { return AND;        }
  "&&"                 { return AND2;       }
  
  "if"                 { return T_IF;       }
  "else"               { return T_ELSE;     }
  "do"                 { return T_DO;       }
  "int"                { return T_INT;      }
  "return"             { return T_RETURN;   }
  "void"               { return T_VOID;     }
  "float"              { return T_FLOAT;    }
  "while"              { return T_WHILE;   }
  
  
  {LETTER}{LETTERDIGIT}* {
  return(IDENTIFIER);
  }
  
  {VARIABLE}* { 
	  printf("VARIABLE\n"); yylval.lexeme=(char*)malloc(yyleng+1);
	  strcpy(yyval.lexeme, yytext); 
	  /*return T_VARIABLE*/
	  return(LITERAL);
  }
  
  {COMMENT} {
	return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+"."{DIGIT}+ {   
	  printf("FLOAT\n"); sscanf(yytext,"%d", &(yyval.value));
	  return(LITERAL);
  }
  
  {SIGN}?{DIGIT}+ {  
	  printf("INTEGER\n"); sscanf(yytext,"%d", &(yyval.value));
	  return(LITERAL);
  }
  
  {STRINGCONSTANT} {
  return(LITERAL);
  }
  
  
  [ \t\n\r]            /* skip whitespace */
  
  .                    { printf("Unknown character [%c]\n",yytext[0]);
						return UNKNOWN;    }
 
	%%
	
int main(void){
							  int token;
							  while(token = yylex()) {
							  printf("lexed token: %d\n", token);
							  }
}
 
int yywrap(void){
	return 1;
}
							  

Open in new window

Your formatting is messed up.

Your %% needs to be at the beginning of a line, for one.
Your formatting is really important in a grammar like this.

You should not have whitespace (space or tables) at the beginning of a line with a regular expression / rule on it.

Edit your whole file and for all rules, take out whitespace, like this:

I did it on my local copy and lex compiles it.

You have:
 
	LETTER          [a-zA-Z_]
	DIGIT           [0-9]
 
Fix as:
 
LETTER          [a-zA-Z_]
DIGIT           [0-9]
 
 
You have:
 
  {LETTER}{LETTERDIGIT}* {
  return(IDENTIFIER);
  }
  
 
Fix as:
 
{LETTER}{LETTERDIGIT}* {
  return(IDENTIFIER);
  }

Open in new window

When I said tables I meant tabs
ok now the lex mygrammar.l works!! but the cc lex.yy.c returns multiple errors:

mygrammar.l: In function yylex:
mygrammar.l:80: error: yylval undeclared (first use in this function)
mygrammar.l:80: error: (Each undeclared identifier is reported only once
mygrammar.l:80: error: for each function it appears in.)
mygrammar.l:81: error: yyval undeclared (first use in this function)
mygrammar.l: At top level:
mygrammar.l:118: error: syntax error before % token


About the yylval and yyval i tried defining them as #define YYVAL but it didn't work


%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define T_IF		6
#define T_ELSE		7
#define T_WHILE		8
#define T_INT		9
#define T_VOID		10
#define T_DO		11
#define T_RETURN	12
#define T_FLOAT		13
#define PLUS		14
#define MINUS		15
#define TIMES		16
#define SLASH		17
#define LPAREN		18
#define RPAREN		19	
#define SEMICOLON	20
#define COMMA		21
#define EQL		22
#define OR		23
#define OR2		24
#define AND		25
#define AND2		26
#define LITERAL	        27
#define IDENTIFIER	28
#define UNKNOWN  	29
 
 
		
	
		
%}
	
LETTER          [a-zA-Z_]
DIGIT           [0-9]
LETTERDIGIT     [a-zA-Z0-9_]
SIGN            [-+]
STRINGCONSTANT  \"[^"\n]*["\n]
CHARCONSTANT    \'[^'\n]*\'
RANKSPEC        \[[,]*\]
INTEGER		{digit}+
VARIABLE        [a-z_]({LETTERDIGIT})*
COMMENT		"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
							  
%%
							  
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"("                  { return LPAREN;     }
")"                  { return RPAREN;     }
";"                  { return SEMICOLON;  }
","                  { return COMMA;      }
"="                  { return EQL;        }
"|"                  { return OR;         }
"||"                 { return OR2;        }
"&"                  { return AND;        }
"&&"                 { return AND2;       }
"if"                 { return T_IF;       }
"else"               { return T_ELSE;     }
"do"                 { return T_DO;       }
"int"                { return T_INT;      }
"return"             { return T_RETURN;   }
"void"               { return T_VOID;     }
"float"              { return T_FLOAT;    }
"while"              { return T_WHILE;   }
  
  
{LETTER}{LETTERDIGIT}* {
 return(IDENTIFIER);
 }
  
{VARIABLE}* { 
 printf("VARIABLE\n"); yylval.lexeme=(char*)malloc(yyleng+1);
 strcpy(yyval.lexeme, yytext); 
 /*return T_VARIABLE*/
 return(LITERAL);
 }
  
{COMMENT} {
 return(LITERAL);
 }
  
{SIGN}?{DIGIT}+"."{DIGIT}+ {   
 printf("FLOAT\n"); sscanf(yytext,"%d", &(yyval.value));
 return(LITERAL);
 }
  
{SIGN}?{DIGIT}+ {  
 printf("INTEGER\n"); sscanf(yytext,"%d", &(yyval.value));
 return(LITERAL);
 }
  
{STRINGCONSTANT} {
 return(LITERAL);
}
  
[ \t\n\r]            /* skip whitespace */
  
.                    { printf("Unknown character [%c]\n",yytext[0]);
		       return UNKNOWN;    }
 
%%
int main(void){
    int token;
    while(token = yylex()) {
    printf("lexed token: %d\n", token);
   }
}
 
%%
 
int yywrap(void){
  return 1;
}

Open in new window

After you fix your whitespace indenting, try to compile the grammar to a .c file, then when you try to compile the .c file, you'll see you have missing declarations.

T_DO is not defined as a token.

yyval is undefined, it is actually yylval that you want, but I recommend commenting out any references to yylval until you start integrating your parser with yacc or bison. yylval comes from yacc, not lex, and unless you define your own yylval structure, it won't exist.

I recommend you stop adding or changing code until you can compile the .l file to a .c file, and then compile the .c file to an executable. I have posted the directions, please work through the last 2 posts. I will followup tomorrow.
The reason I recommend commenting or removing yylval for now, is you are trying to build a LEXER first. So all you want is to convert the strings in your language, to discrete integer tokens. Thats the purpose of a lexer. It has to return the token value to a parser so the parser knows what to do.

"do" converts to T_DO (value 11)
"while" converts to T_WHILE (value 8)

You at least want to be able to run your lexer and have it print out all of the token values.

int token;
while(token = yylex())
   printf("TOKEN %d\n", token);


Once you get that far, you have a working lexer, and it is time to move on to the parser.
i commented out the yylval  but i still get the last error :
mygrammar.l:117: error: syntax error before % token

i tried commenting out the int yywrap .... but it returned :
Undefined symbols:
  "_yywrap", referenced from:
      _yylex in ccPDTz1b.o
      _input in ccPDTz1b.o
ld: symbol(s) not found
collect2: ld returned 1 exit status


%{
 
#define COMMENT		1
#define VARIABLE	2
#define INTEGER		3
#define FLOAT		4
#define STRING		5
#define T_IF		6
#define T_ELSE		7
#define T_WHILE		8
#define T_INT		9
#define T_VOID		10
#define T_DO		11
#define T_RETURN	12
#define T_FLOAT		13
#define PLUS		14
#define MINUS		15
#define TIMES		16
#define SLASH		17
#define LPAREN		18
#define RPAREN		19	
#define SEMICOLON	20
#define COMMA		21
#define EQL		22
#define OR		23
#define OR2		24
#define AND		25
#define AND2		26
#define LITERAL	        27
#define IDENTIFIER	28
#define UNKNOWN  	29
 
 
		
	
		
%}
	
LETTER          [a-zA-Z_]
DIGIT           [0-9]
LETTERDIGIT     [a-zA-Z0-9_]
SIGN            [-+]
STRINGCONSTANT  \"[^"\n]*["\n]
CHARCONSTANT    \'[^'\n]*\'
RANKSPEC        \[[,]*\]
INTEGER		{digit}+
VARIABLE        [a-z_]({LETTERDIGIT})*
COMMENT		"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
							  
%%
							  
"+"                  { return PLUS;       }
"-"                  { return MINUS;      }
"*"                  { return TIMES;      }
"/"                  { return SLASH;      }
"("                  { return LPAREN;     }
")"                  { return RPAREN;     }
";"                  { return SEMICOLON;  }
","                  { return COMMA;      }
"="                  { return EQL;        }
"|"                  { return OR;         }
"||"                 { return OR2;        }
"&"                  { return AND;        }
"&&"                 { return AND2;       }
"if"                 { return T_IF;       }
"else"               { return T_ELSE;     }
"do"                 { return T_DO;       }
"int"                { return T_INT;      }
"return"             { return T_RETURN;   }
"void"               { return T_VOID;     }
"float"              { return T_FLOAT;    }
"while"              { return T_WHILE;   }
  
  
{LETTER}{LETTERDIGIT}* {
 return(IDENTIFIER);
 }
  
{VARIABLE}* { 
// printf("VARIABLE\n"); yylval.lexeme=(char*)malloc(yyleng+1);
// strcpy(yyval.lexeme, yytext); 
 /*return T_VARIABLE*/
 return(LITERAL);
 }
  
{COMMENT} {
 return(LITERAL);
 }
  
{SIGN}?{DIGIT}+"."{DIGIT}+ {   
// printf("FLOAT\n");sscanf(yytext,"%d", &(yyval.value)); 
 return(LITERAL);
 }
  
{SIGN}?{DIGIT}+ {  
// printf("INTEGER\n"); sscanf(yytext,"%d", &(yyval.value)); 
 return(LITERAL);
 }
  
{STRINGCONSTANT} {
 return(LITERAL);
}
  
[ \t\n\r]            /* skip whitespace */
  
.                    { printf("Unknown character [%c]\n",yytext[0]);
		       return UNKNOWN;    }
 
%%
int main(void){
    int token;
    while(token = yylex()) {
    printf("lexed token: %d\n", token);
   }
}
%%
int yywrap(void){return 1;}

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of mrjoltcola
mrjoltcola
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
ok now it compiles !
Congratulations. You now have a lexer. Now run it and start typing some test strings.

I ran it:

[msmith@vice ~]$ flex grammar.l
[msmith@vice ~]$ gcc lex.yy.c
[msmith@vice ~]$ a.out
[msmith@vice ~]$ ./a.out
if
lexed token: 6
test
lexed token: 28

Make sure, before proceeding:

1) Make a backup of this file! :) If you get it screwed up in the next phase, you can backtrack at least.

2) Everytime you make a change or significant addition, test compile it to make sure. Don't make large sweeping changes without compiling incrementally, it is easier to fix that way until you become more comfortable with debugging grammar files.

I think you have accomplished your task. If you need help on the next step (parser) then you can open a new question and I'll be happy to help.