Link to home
Start Free TrialLog in
Avatar of DahaR
DahaR

asked on

scanning keywords, identifiers through a file

i am workin on making a scanner , and wanted to know how to scan identifiers , keywords etc from a file (that is opened for reading)...sum1 told me that i have 2 generate tokens and then match those tokens with the keywords , identifiers etc ( already mentioned in the program).. but i dunno how to generate tokens and match them. well i am in need of a huge help here...if any1 of u can come up with a program / sample code or even answer my question ..i will be grateful......i hope u all understand me here !!!
Avatar of bcladd
bcladd

A compiler/interpreter typically runs in phases. I am going to leave out detail here but typically a compiler has the following phases:

(1) Scanner (lexical analyzer): Converts a stream of characters (the input file(s)) into a stream of tokens. A typical interface is through a nextToken() function that is called by the Parser when it needs a new token.

(2) Parser: Matches the stream of tokens against a description of the language to be parsed (a grammar), generating a parse tree or some other intermediate representation.

(3) Back-end: The parser's output of an abstract syntax tree (AST) or some other intermediate form is traversed and processed to produce the target program.

The lexical analyzer is usually in charge of figuring out where the various parts of an expression begin and end. For example, given

  myToy.say("Hello, world!", 3.4);

The scanner's output might look like this:

  IDENT:myToy
  DOT:.
  IDENT:say
  LPAREN:(
  STRING:"Hello, World!"
  COMMA:,
  FLOAT:3.4
  RPAREN:)
  SEMI:;

The parser would then assemble an appropriate AST so that this represents a method call with two parameters. As part of the actions of building the AST (or generating the target program) it might also check to make sure the types of the expressions passed into the function's actual parameters match the types expected by the function's formal parameters.

So that is how it works, in general.

Building a lexical analyzer requires a description of what each type of token looks like. For example, an identiifier might be any string of the form [a-z_][a-z0-9_]* (case sensitive or case insensitive). The scanner also needs a description of what constitutes whitespace (spaces, tabs, newlines, comments) so it can skip from token to token.

Given an identifier, you might need to check whether or not it is a keyword (so you return the appropriate token). This can be done with a table of keywords that you compare against.

You want sample code for a lexan?  The following is the nextToken routine from a sample Lisp lexical analyzer. I am not including the definition of Token but it should be fairly clear from the context that a constructor for Token takes a type, some text, a line number, column number, and the name of the file being scanned.

Token Scanner::nextToken()
{
      // the characters that cannot be part of an identifier (defined in a
      // string so that we can use the find member function to find if a
      // character is one of these characters).
  const static std::string notIdentifierCharacters = "\"'`()[]\\#.;:?";
 

      // character being processed
  char ch = currChar();
 
      // consume whitespace
  while (isspace(ch))
    ch = nextChar();

      // if at end of file, return appropriate token
  if (done_) {
    token_ = Token(Token::EndOfFile, "", line_, bufferOffset_, fname_);
    return token_;
  }

      // Type of Token we have seen so far
  Token::Type tt = Token::None;
      // record starting position of token
  Offset start = bufferOffset_;
  Offset startLine = line_;

      /* The following loop only _looks_ infinite. As each character is
         read, part of the processing looks for the end of a token and
         breaks out of the loop.
      */
  while (true) {
    if (isspace(ch) || done_)
      break;
    else if (ch == '(') {
      if (tt == Token::None) {
        tt = Token::LeftParen;
        nextChar();
      }
      break;
    } else if (ch == ')') {
      if (tt == Token::None) {
        tt = Token::RightParen;
        nextChar();
      }
      break;
    } else if (isdigit(ch) && ((tt == Token::None) || (tt == Token::Number))) {
      tt = Token::Number;
    } else if (notIdentifierCharacters.find(ch) == std::string::npos) {
      tt = Token::Identifier;
    } else if (tt == Token::None) { // saw a non-identifier token
      std::string errorMessage = string("Illegal character '") + ch +"'";
      ErrorManager::report(errorMessage, ErrorManager::Error,
                           fname_, line_, bufferOffset_);
      nextChar();
      break;
    } else {
      break;
    }
    ch = nextChar();
  } // while (true)
 
      // extract the substring associated with the Token
  std::string text = buffer_.substr(start, bufferOffset_-start);

  if (tt == Token::Identifier)
    tt = isReservedWord(text)? Token::Reserved : Token::Identifier;

  token_ = (tt != Token::None) ?
           Token(tt, text, startLine, start+1, fname_) :
           Token(Token::Error, "", line_, start, fname_);
    return token_;
} // nextToken()

Hope this helps, -bcl
Avatar of DahaR

ASKER

can i get a simple scanner PROGRAM/SOURCE CODE/help without using classes
What langauge do you want to scan? What is the structure of a token? If tokens are divided by whitespace how about

string str;
while (cin >> str) {
  // process str
}

I need more information on what you are planning on parsing. There are tools for building lexans from a description of the tokens (flex, lex) and even parsers from descriptions of the grammar (yacc, bison, LALRGen, RecDescent(in Perl)).

-bcl
Avatar of DahaR

ASKER

thanks for answering back...language is c++ and lets say i have a file that contains :

begin
write
x=x1+3;
read
y1=x+4;
end

after going through the whole file(EOF)the keywords,identifiers,digits,special symbols,strings,spaces should be separated and displayed.

by tokens i means that it, ex: will take the word BEGIN as 1 separate token and similarly for all ( ex: x,=,write, etc...)

i hope u got what u wanted to ask....thanks once again..
Avatar of DahaR

ASKER

thanks for answering back...language is c++ and lets say i have a file that contains :

begin
write
x=x1+3;
read
y1=x+4;
end

after going through the whole file(EOF)the keywords,identifiers,digits,operators,special symbols,strings,spaces should be separated and displayed.

by tokens i means that it, ex: will take the word BEGIN as 1 separate token and similarly for all ( ex: x,=,write, etc...)

i hope u got what u wanted to ask....thanks once again..
The language looks more like Pascal than C++. What is important in writing a scanner is the source language to be scanned rather than the implementation language of the scanner. So, let's look at your sample. Note that I have played with the indentation (I am assuming the languge is free form):

begin
    write
    x=x1+3;
    read
    y1=x+4;
end

Looking at this it looks like variables have a form like "an initial letter followed by an optional integer" and the keywords are begin, write, read, end, and the operators are = and + and that semicolons appear sometimes (that is a parsing problem rather than a scanning problem).

Your scanner needs to read one character at a time, simulating a finite automaton that keeps track of the current type of the token. Consider the following (I will use nextChar() to get the next character so that I don't have to worry about the exact input method you're using):

  char ch = nextChar();
  while (isspace(ch))
    ch = nextChar();
  // skip whitespace

  if (ch == '=') {
    return EQUAL_SIGN_TOKEN_VALUE;
  } else if (ch == ';') {
    return SEMICOLON_TOKEN_VALUE;
  ... and so on for single character tokens ...
  } else if (isalpha(ch)) { // we're looking at a word
    /* golbal string */ tokenValue = "";
   
    while (isalpha(ch) || isdigit(ch)) {
      tokenValue += ch;
      ch = nextChar();
    }
    putbackChar(ch); // we read one too many characters so have some way to put it back
   
    // time to determine whether it is a key word or a variable
    if (tokenValue == "begin") {
      return BEGIN_KEYWORD_TOKEN_VALUE;
    } else if (tokenValue == "end") {
      return END_TOKEN_VALUE;
    ... and so on ...
    } else {
      return VARIABLE_TOKEN_VALUE;
    }
  }

The above code goes inside the tokenizing function. It returns the token type. tokenValue contains the text of the last keyword or identifier read.

This is just a sample; you will need to adapt it for your needs. This should put you on the way to understanding how a scanner works.

-bcl
Avatar of DahaR

ASKER

hey thanks , that was just what i wanted....i am using the code and its working....thanks again !!!!
hope 2 keep on communicating with u again ...
Not to be whiny but if the code works, would you be good enough to accept my answer?

Thanks, -bcl
Avatar of DahaR

ASKER

yeh , i'll accept with pleasure...tell me 1 thing....what if we come around a word....like y1 ( thats a identifier ) .with the program u sent and with sum changes...the output i get is that it take "y" as a idenifier and "1" as a digit , separately...how to solve this ambugity......and also what if we come around :=  , will this word take as a whole ( delimiter ) or separate....if yes or no...then how??????
ASKER CERTIFIED SOLUTION
Avatar of bcladd
bcladd

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of DahaR

ASKER

thanks man , can i get a favour here....i have done with my program of this stuff ( i mean ) , i am nearly finished with it....well there are sum anbugites and stuff....cud u mail me such kind of a program if u have written ..i just wanted to look at it...i wud be grateful if u do..: )
The code from above hacked so it compiles and runs:


#include <iostream>
#include <cctype>
#include <string>
using namespace std;

int main()
{
  string tokenValue;
  char ch;
  while (true) {
    ch = cin.get();
   
    while (isspace(ch))
      ch = cin.get();
        // skip whitespace
   
    if (ch == '=') {
      cout << "equals" << endl;
    } else if (ch == ';') {
      cout << "semicolon" << endl;
    } else if (ch == ':') {
      ch = cin.get();
      if (ch == '=') {
        cout << "colon equals" << endl;
      } else {
        cin.putback(ch);
        cout << "colon" << endl;
      }
    } else if (isalpha(ch)) { // we're looking at a word
      /* golbal string */ tokenValue = "";
     
      while (isalpha(ch) || isdigit(ch)) {
        tokenValue += ch;
        ch = cin.get();
      }
      cin.putback(ch);
     
          // time to determine whether it is a key word or a variable
      if (tokenValue == "begin") {
        cout << "keyword begin" << endl;
      } else if (tokenValue == "end") {
        cout << "keyword end" << endl;
      } else {
        cout << "variable:" << tokenValue << endl;
      }
    }
  }
} /** End of main **/

Avatar of DahaR

ASKER

thanks for the help on the lexer thing..i have made a program for it and it works great..now i wanna do the next step , that is the PARSER...i know that in its code we have to call our LEXER function...wud ya give me sum pointers on it and sum sample code part on how to start coding it......thanks
Avatar of DahaR

ASKER

hey hi , remem me i asked ya questions about lexical analyzer and the parser.........well can u give me some sample pointers on symbol table , post fix and code generation........small sample code like last times will also help.............anyways hows it going........take care.........