Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

scanning keywords, identifiers through a file

Posted on 2003-03-09
14
Medium Priority
?
276 Views
Last Modified: 2010-07-27
i am workin on making a scanner , and wanted to know how to scan identifiers , keywords etc from a file (that is opened for reading)...sum1 told me that i have 2 generate tokens and then match those tokens with the keywords , identifiers etc ( already mentioned in the program).. but i dunno how to generate tokens and match them. well i am in need of a huge help here...if any1 of u can come up with a program / sample code or even answer my question ..i will be grateful......i hope u all understand me here !!!
0
Comment
Question by:DahaR
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 8
  • 6
14 Comments
 
LVL 11

Expert Comment

by:bcladd
ID: 8098670
A compiler/interpreter typically runs in phases. I am going to leave out detail here but typically a compiler has the following phases:

(1) Scanner (lexical analyzer): Converts a stream of characters (the input file(s)) into a stream of tokens. A typical interface is through a nextToken() function that is called by the Parser when it needs a new token.

(2) Parser: Matches the stream of tokens against a description of the language to be parsed (a grammar), generating a parse tree or some other intermediate representation.

(3) Back-end: The parser's output of an abstract syntax tree (AST) or some other intermediate form is traversed and processed to produce the target program.

The lexical analyzer is usually in charge of figuring out where the various parts of an expression begin and end. For example, given

  myToy.say("Hello, world!", 3.4);

The scanner's output might look like this:

  IDENT:myToy
  DOT:.
  IDENT:say
  LPAREN:(
  STRING:"Hello, World!"
  COMMA:,
  FLOAT:3.4
  RPAREN:)
  SEMI:;

The parser would then assemble an appropriate AST so that this represents a method call with two parameters. As part of the actions of building the AST (or generating the target program) it might also check to make sure the types of the expressions passed into the function's actual parameters match the types expected by the function's formal parameters.

So that is how it works, in general.

Building a lexical analyzer requires a description of what each type of token looks like. For example, an identiifier might be any string of the form [a-z_][a-z0-9_]* (case sensitive or case insensitive). The scanner also needs a description of what constitutes whitespace (spaces, tabs, newlines, comments) so it can skip from token to token.

Given an identifier, you might need to check whether or not it is a keyword (so you return the appropriate token). This can be done with a table of keywords that you compare against.

You want sample code for a lexan?  The following is the nextToken routine from a sample Lisp lexical analyzer. I am not including the definition of Token but it should be fairly clear from the context that a constructor for Token takes a type, some text, a line number, column number, and the name of the file being scanned.

Token Scanner::nextToken()
{
      // the characters that cannot be part of an identifier (defined in a
      // string so that we can use the find member function to find if a
      // character is one of these characters).
  const static std::string notIdentifierCharacters = "\"'`()[]\\#.;:?";
 

      // character being processed
  char ch = currChar();
 
      // consume whitespace
  while (isspace(ch))
    ch = nextChar();

      // if at end of file, return appropriate token
  if (done_) {
    token_ = Token(Token::EndOfFile, "", line_, bufferOffset_, fname_);
    return token_;
  }

      // Type of Token we have seen so far
  Token::Type tt = Token::None;
      // record starting position of token
  Offset start = bufferOffset_;
  Offset startLine = line_;

      /* The following loop only _looks_ infinite. As each character is
         read, part of the processing looks for the end of a token and
         breaks out of the loop.
      */
  while (true) {
    if (isspace(ch) || done_)
      break;
    else if (ch == '(') {
      if (tt == Token::None) {
        tt = Token::LeftParen;
        nextChar();
      }
      break;
    } else if (ch == ')') {
      if (tt == Token::None) {
        tt = Token::RightParen;
        nextChar();
      }
      break;
    } else if (isdigit(ch) && ((tt == Token::None) || (tt == Token::Number))) {
      tt = Token::Number;
    } else if (notIdentifierCharacters.find(ch) == std::string::npos) {
      tt = Token::Identifier;
    } else if (tt == Token::None) { // saw a non-identifier token
      std::string errorMessage = string("Illegal character '") + ch +"'";
      ErrorManager::report(errorMessage, ErrorManager::Error,
                           fname_, line_, bufferOffset_);
      nextChar();
      break;
    } else {
      break;
    }
    ch = nextChar();
  } // while (true)
 
      // extract the substring associated with the Token
  std::string text = buffer_.substr(start, bufferOffset_-start);

  if (tt == Token::Identifier)
    tt = isReservedWord(text)? Token::Reserved : Token::Identifier;

  token_ = (tt != Token::None) ?
           Token(tt, text, startLine, start+1, fname_) :
           Token(Token::Error, "", line_, start, fname_);
    return token_;
} // nextToken()

Hope this helps, -bcl
0
 
LVL 1

Author Comment

by:DahaR
ID: 8105404
can i get a simple scanner PROGRAM/SOURCE CODE/help without using classes
0
 
LVL 11

Expert Comment

by:bcladd
ID: 8105466
What langauge do you want to scan? What is the structure of a token? If tokens are divided by whitespace how about

string str;
while (cin >> str) {
  // process str
}

I need more information on what you are planning on parsing. There are tools for building lexans from a description of the tokens (flex, lex) and even parsers from descriptions of the grammar (yacc, bison, LALRGen, RecDescent(in Perl)).

-bcl
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 1

Author Comment

by:DahaR
ID: 8106564
thanks for answering back...language is c++ and lets say i have a file that contains :

begin
write
x=x1+3;
read
y1=x+4;
end

after going through the whole file(EOF)the keywords,identifiers,digits,special symbols,strings,spaces should be separated and displayed.

by tokens i means that it, ex: will take the word BEGIN as 1 separate token and similarly for all ( ex: x,=,write, etc...)

i hope u got what u wanted to ask....thanks once again..
0
 
LVL 1

Author Comment

by:DahaR
ID: 8106571
thanks for answering back...language is c++ and lets say i have a file that contains :

begin
write
x=x1+3;
read
y1=x+4;
end

after going through the whole file(EOF)the keywords,identifiers,digits,operators,special symbols,strings,spaces should be separated and displayed.

by tokens i means that it, ex: will take the word BEGIN as 1 separate token and similarly for all ( ex: x,=,write, etc...)

i hope u got what u wanted to ask....thanks once again..
0
 
LVL 11

Expert Comment

by:bcladd
ID: 8107261
The language looks more like Pascal than C++. What is important in writing a scanner is the source language to be scanned rather than the implementation language of the scanner. So, let's look at your sample. Note that I have played with the indentation (I am assuming the languge is free form):

begin
    write
    x=x1+3;
    read
    y1=x+4;
end

Looking at this it looks like variables have a form like "an initial letter followed by an optional integer" and the keywords are begin, write, read, end, and the operators are = and + and that semicolons appear sometimes (that is a parsing problem rather than a scanning problem).

Your scanner needs to read one character at a time, simulating a finite automaton that keeps track of the current type of the token. Consider the following (I will use nextChar() to get the next character so that I don't have to worry about the exact input method you're using):

  char ch = nextChar();
  while (isspace(ch))
    ch = nextChar();
  // skip whitespace

  if (ch == '=') {
    return EQUAL_SIGN_TOKEN_VALUE;
  } else if (ch == ';') {
    return SEMICOLON_TOKEN_VALUE;
  ... and so on for single character tokens ...
  } else if (isalpha(ch)) { // we're looking at a word
    /* golbal string */ tokenValue = "";
   
    while (isalpha(ch) || isdigit(ch)) {
      tokenValue += ch;
      ch = nextChar();
    }
    putbackChar(ch); // we read one too many characters so have some way to put it back
   
    // time to determine whether it is a key word or a variable
    if (tokenValue == "begin") {
      return BEGIN_KEYWORD_TOKEN_VALUE;
    } else if (tokenValue == "end") {
      return END_TOKEN_VALUE;
    ... and so on ...
    } else {
      return VARIABLE_TOKEN_VALUE;
    }
  }

The above code goes inside the tokenizing function. It returns the token type. tokenValue contains the text of the last keyword or identifier read.

This is just a sample; you will need to adapt it for your needs. This should put you on the way to understanding how a scanner works.

-bcl
0
 
LVL 1

Author Comment

by:DahaR
ID: 8118441
hey thanks , that was just what i wanted....i am using the code and its working....thanks again !!!!
hope 2 keep on communicating with u again ...
0
 
LVL 11

Expert Comment

by:bcladd
ID: 8139896
Not to be whiny but if the code works, would you be good enough to accept my answer?

Thanks, -bcl
0
 
LVL 1

Author Comment

by:DahaR
ID: 8147672
yeh , i'll accept with pleasure...tell me 1 thing....what if we come around a word....like y1 ( thats a identifier ) .with the program u sent and with sum changes...the output i get is that it take "y" as a idenifier and "1" as a digit , separately...how to solve this ambugity......and also what if we come around :=  , will this word take as a whole ( delimiter ) or separate....if yes or no...then how??????
0
 
LVL 11

Accepted Solution

by:
bcladd earned 750 total points
ID: 8148615
To parse two character tokens, look ahead on the first character to see if the right second charcter is there, something like this:

if (ch == ':') {
  ch = nextChar();
  if (ch == '=')
    return COLON_EQUALS_TOKEN_VALUE;
  else {
    putBackChar(ch); // it isn't :=, it is ':' and something else
    return COLON_TOKEN_VALUE;
  }
}

I typed in the code and it parses y1 as y1, a single identifier rather than as two separate tokens; the problem is most likely with the while loop in the variable branch of the if...not sure but the code works for me (cut, pased, and changed return to cout).

-bcl
0
 
LVL 1

Author Comment

by:DahaR
ID: 8150584
thanks man , can i get a favour here....i have done with my program of this stuff ( i mean ) , i am nearly finished with it....well there are sum anbugites and stuff....cud u mail me such kind of a program if u have written ..i just wanted to look at it...i wud be grateful if u do..: )
0
 
LVL 11

Expert Comment

by:bcladd
ID: 8170684
The code from above hacked so it compiles and runs:


#include <iostream>
#include <cctype>
#include <string>
using namespace std;

int main()
{
  string tokenValue;
  char ch;
  while (true) {
    ch = cin.get();
   
    while (isspace(ch))
      ch = cin.get();
        // skip whitespace
   
    if (ch == '=') {
      cout << "equals" << endl;
    } else if (ch == ';') {
      cout << "semicolon" << endl;
    } else if (ch == ':') {
      ch = cin.get();
      if (ch == '=') {
        cout << "colon equals" << endl;
      } else {
        cin.putback(ch);
        cout << "colon" << endl;
      }
    } else if (isalpha(ch)) { // we're looking at a word
      /* golbal string */ tokenValue = "";
     
      while (isalpha(ch) || isdigit(ch)) {
        tokenValue += ch;
        ch = cin.get();
      }
      cin.putback(ch);
     
          // time to determine whether it is a key word or a variable
      if (tokenValue == "begin") {
        cout << "keyword begin" << endl;
      } else if (tokenValue == "end") {
        cout << "keyword end" << endl;
      } else {
        cout << "variable:" << tokenValue << endl;
      }
    }
  }
} /** End of main **/

0
 
LVL 1

Author Comment

by:DahaR
ID: 8190163
thanks for the help on the lexer thing..i have made a program for it and it works great..now i wanna do the next step , that is the PARSER...i know that in its code we have to call our LEXER function...wud ya give me sum pointers on it and sum sample code part on how to start coding it......thanks
0
 
LVL 1

Author Comment

by:DahaR
ID: 8634728
hey hi , remem me i asked ya questions about lexical analyzer and the parser.........well can u give me some sample pointers on symbol table , post fix and code generation........small sample code like last times will also help.............anyways hows it going........take care.........
0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Go is an acronym of golang, is a programming language developed Google in 2007. Go is a new language that is mostly in the C family, with significant input from Pascal/Modula/Oberon family. Hence Go arisen as low-level language with fast compilation…
Container Orchestration platforms empower organizations to scale their apps at an exceptional rate. This is the reason numerous innovation-driven companies are moving apps to an appropriated datacenter wide platform that empowers them to scale at a …
The goal of the tutorial is to teach the user how to use functions in C++. The video will cover how to define functions, how to call functions and how to create functions prototypes. Microsoft Visual C++ 2010 Express will be used as a text editor an…
The viewer will learn additional member functions of the vector class. Specifically, the capacity and swap member functions will be introduced.
Suggested Courses

721 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question