Building a compiler

Posted on 2004-03-29
Last Modified: 2008-02-01

I'm in the middle of contruction a compiler for new programming language.

I have managed to build a basic compiler consisting of the following:


#include <stdio.h>
#include <ctype.h>

#define BSIZE 128
#define NONE -1
#define EOS '\0'

#define NUM 256
#define DIV 257
#define MOD 258
#define ID 259
#define DONE 260

int tokenval;
int lineno;

struct entry {
  char *lexptr;
  int token;

struct entry symtable[100];


#include "global.h"

emit(t, tval)
     int t, tval;
  switch(t) {
  case '+': case '-': case '*': case '/':
    printf("%c\n", t); break;
  case DIV:
    printf("DIV\n"); break;
  case NUM:
    printf("%d\n", tval); break;
  case ID:
    printf("%s\n", symtable[tval].lexptr); break;
    printf("token %d, tokenval %d\n", t, tval);


#include "global.h"

     char *m;
  fprintf(stderr, "line %d: %s\n", lineno, m);


#include "global.h"

struct entry keywords[] = {
  "div", DIV,
  "mod", MOD,
  0, 0

  struct entry *p;
  for (p = keywords; p->token; p++)
    insert(p->lexptr, p->token);

#include "global.h"

char lexbuf[BSIZE];
int lineno = 1;
int tokenval = NONE;

int lexan()
  int t;
  while(1) {
    t = getchar();
    if (t == ' ' || t == '\t');
    else if (t == '\n')
      lineno = lineno + 1;
    else if (isdigit(t)) {
      ungetc(t, stdin);
      scanf("%d", &tokenval);
      return NUM;
    else if (isalpha(t)) {
      int p, b = 0;
      while (isalnum(t)) {
      lexbuf[b] = t;
      t = getchar();
      b = b + 1;
      if (b >= BSIZE)
        error("We have a compiler error ha ha");
      lexbuf[b] = EOS;
      if (t != EOF)
      ungetc (t, stdin);
      p = lookup(lexbuf);
      if( p == 0)
      p = insert(lexbuf, ID);
      tokenval = p;
      return symtable[p].token;
    else if ( t == EOF)
      return DONE;

    else {
      tokenval = NONE;
      return t;

#include "global.h"


#include "global.h"

int lookahead;

  lookahead = lexan();
  while (lookahead != DONE ) {
    expr(); match(';');
  int t;
    switch (lookahead) {
    case '+': case '-':
      t = lookahead;
      match(lookahead); term(); emit(t, NONE);
  int t;
    switch(lookahead) {
    case '*': case '/': case DIV: case MOD:
      t = lookahead;
      match(lookahead); factor(); emit(t,NONE);
  switch(lookahead) {
  case '(':
    match('('); expr(); match(')'); break;
  case NUM:
    emit(NUM, tokenval); match(NUM); break;
  case ID:
    emit(ID, tokenval); match(ID); break;
    error("syntax error ha ha");
     int t;
  if (lookahead == t)
    lookahead = lexan();
  else error("syntax error again haha");


#include "global.h"

#define STRMAX 999
#define SYMMAX 100

char lexemes[STRMAX];
int lastchar = -1;
struct entry symtable[SYMMAX];
int lastentry = 0;

int lookup(s)
     char s[];
  int p;
  for (p = lastentry; p > 0; p = p - 1)
    if (strcmp(symtable[p].lexptr, s) == 0)
      return p;
  return 0;
int insert(s, tok)
char s[];
int tok;
  int len;
  len = strlen(s);
  if (lastentry + 1 >= SYMMAX)
    error("symbol table full");
  if (lastchar + len + 1 >= STRMAX)
    error("lexemes array is full");
  lastentry = lastentry + 1;
  symtable[lastentry].token = tok;
  symtable[lastentry].lexptr = &lexemes[lastchar + 1];
  lastchar = lastchar + len + 1;
  strcpy(symtable[lastentry].lexptr, s);
  return lastentry;


I have a couple of questions: How do I make the compiler read-in keywords and statement from a specific file-format .oebj ?

Secondly then programming I have noticed that keywords are highlighted after they have been typed-in. Do I need implement anything specific in-order to get this to work in my programming language ?

Sincerely Frank
Question by:Frank-22
  • 4
  • 3
  • 2
  • +2

Accepted Solution

ankuratvb earned 84 total points
ID: 10709970
>I have a couple of questions: How do I make the compiler read-in keywords and >statement from a specific file-format .oebj ?

There are plenty of ways.After all,u will be the one storing the keyword in the .oebj initially
(or is it someone else's),One suggestion,i u want to search this file for keywords,keep this
file sorted,then u can use bsearch() to do a binary search which is much faster.

>Secondly then programming I have noticed that keywords are highlighted after they have >been typed-in. Do I need implement anything specific in-order to get this to work in my >programming language ?

In ur IDE,u just have to check that the typed word is a keyword or not,if it is,change its color.

What u can do is whenever there is a space pressed,check backwards to the last space or till the start of line and store the word in between.Then ,search this word whether its a keyword or not.If it is ,use gotoxy() or its equivalent function in ur environment(one exists for most of them) since u know the lengtrh of the word,u can easily calc. the offset to go back.Then change the foreground color,print the word and set the foreground color back to default.

Of course,this is just a simple explanation,it will require a few more checks(for e.g. when the user presses backspace on a keyword,remove its highlight etc)
LVL 45

Expert Comment

ID: 10710018
Hi Frank,

Compiler construction is one of the few frontiers left in the software development world.  And the available tools make it far less challenging (and fun) than it used to be.

If you'll post the "specific file format" we'll be glad to offer suggestions.


Author Comment

ID: 10710826

The file-format I would like to use for my programming language is .oeb .


LVL 45

Expert Comment

ID: 10713136
Hi Frank,

Which "oeb" format are you trying to use?  I know of at least three different formats that use the .oeb extension.  (I assume that you mean Open EBook, but I'd like to make sure before wandering into the deep end of the pool.)


Author Comment

ID: 10715574
Hi again,

i didn't that that file-extension was in use. I'm implementing a theoretical programming language which has not been implemented before. Therefore I find neccersery to come up with a completely new file-extension.
I found that .oebj isn't in use so I choosen that on insteed.

What I would like to be able to do is the following:

I have compiled all the files shown in my intial post into one compiler-program called "coebj.exe".  I would like to be able to write "coebj test-file.oebj" . This would compile the imaginary source-file test-file.oebj into a .exe program.

Any ideers on how I do that ?  



Comprehensive Backup Solutions for Microsoft

Acronis protects the complete Microsoft technology stack: Windows Server, Windows PC, laptop and Surface data; Microsoft business applications; Microsoft Hyper-V; Azure VMs; Microsoft Windows Server 2016; Microsoft Exchange 2016 and SQL Server 2016.

LVL 45

Expert Comment

ID: 10715937

Now I better understand.

Your code looks a lot more like a parser than a compiler.  (The parser, of course, comes first.)  I don't see anything that indicates any type of code generation.

Code generation is, of course, machine dependent.  If you're writing to an Intel or AMD processor you'll generate one set of instructions (binary data).  If you're writing to IBM mainframe, IBM RS6000, HP PA-RISC, or other processor you'll have to generate instructions for those processors.

Are you familiar with code generation?


Assisted Solution

timbauer earned 83 total points
ID: 10716691
Hi Frank,
I have written a toy compiler before (for a class). It was quite fun.
I don't know what your skill is in this field, so I apologize if I include
information you already know.

Compilers are generally split into two halves.
The "front-end" is the lexer, parser, and part of the optimizer.
The "back-end" is the actual code generation and platform specific instructions (assembly).

I would recommending spitting the project into these two parts.
To me the fun part was the front end.
The back end requires one to know, not only assembly code and how to assembly instructions
for your platform, but executable file formats and info on linkers and loaders so you can assemble
a properly formatted executable. This can also be fun (call me a masochist),
but its definitely a different flavor from the front end stuff.

If you are going to start on the 'front end" I would recommend compiling to a "psuedo-assmebly" text file.
A commonly accepted "psuedo-compiled" form is called a "quad".
Three addresses and an operand (Aho et al).
The pseudocode should be super simple and and very easily translated into any platform specific
code later. You can output it to a text file so it is easily readable.

"ADD A, B, C" might mean add variable A to B and store into C.

A label just might be a line like.

Jump to L1 if A is 0 otherwise continue on
"JZ L1"

So your compiler would change high-level code like
a = b + c + d
if( a!=0 )
  a = 1

into text like:
//a = b + c + d
// if a==0 we want to jump OVER the if block

Then you might make a simple interpreter to test your code.
I would include a simple "PRINT" instruction too help here.
PRINT X might print the value of X. Your interpreter could
just call "printf".
The format I have here is quite arbitrary and completely up to you.

For the back end, all you would have to do is write an
assembler to take your quads and convert them to machine specific code.

I recommend tsplit approach because it is an awful lot of work to attack both problems at once.
This simple approach allows you to focus on the parts of the project you like
without wasting time on the parts you feel to be boring.

A venerable book on the topic is,
"Compilers: Principles, Techniques, and Tools" by Aho, Sethi, Ullman

However, much of this infomration is probably accessible on the web.

Good Luck,
- Tim

Author Comment

ID: 10718416
Hi Kdo,

I'm a bit familiar with code-generation. To make easier I would like to make my programming language run Intel, AMD machines.

As of now the code lack some basic - abilities such as recognizing characters, numbers, deliminators and basic keywords such as "if, else, while, do" etc.

How and where do I define this ?

Do I define this the parser.c file ?

Sincerely Frank


Assisted Solution

gkatz earned 83 total points
ID: 10719116
Hi Frank,
   timbauer started to mention it above the portion of the compilier that you are creating above is the lexer.  In order to create a true language you need to create a grammar for the language.  The grammar is more than having an order for the words its creating different types of words and the grammar structure behind them.  For instance a parser might look at a line of code below and say:

line of code : a = b+c;

the parser would see
var = var + var ;

the grammar parser might have a grammar that looks like this:
assignment => var '='  exp ';'
exp => exp + exp | var

so it would see a tree that looks like this
             /     |      \      \
         var    =     exp      ;
                        /  |  \
                    exp  +  exp
                    |             |
                  var           var

This would allow the compilier to understand a wide range of expressions as opposed to just understanding the specific statements that you lay out for it.  
Writing grammars are not easy and writing a grammar parser is difficult too.  Luckily there are languages such as lex and yacc which take much of the work out of this.  
If you are serious about creating an actual new language then it is necessary to create a grammar for the language also.  Once you have the grammar created it is easy to have it locate key words (although this can be done in a parser also) and keep track of indentations etc.  There are many good books on creating programming languages.  If you would like suggestions please let me know.  

good luck

Expert Comment

ID: 10719336
You can handcraft the parser if the grammar is not too compilcated.
The most popular way of doing that is to write a "recursive descent parser".
You may be able to find an online working example.


Author Comment

ID: 10726067

I have almost written my grammar maybe You Guys can help and maybe route out error if there are any.

Here are the the basics:

0,9 -> int

aZ, Az -> char

+,-,*/,=,<,>, =>, <=,^ .   -> Arithmetic operators.

(),[],{} , | -> Deliminators.

/{ \}  -> Comments (Must be ignored by the compiler).


a + b = c      
a  - b  |    
a * b  |      
a / b   |    
a + b < c
a  - b  |    
a * b  |      
a / b   |    

a + b > c
a - b  |
a * b |
a / b  |

(a +b) +c  = d  -> Evaluating Algebraic properties of regular expressions.
(a - b) * c |
a * (b - c) |

 X := expr; -> A variable is declaired with X as its name seperated by a := followed by the type of the variable and then its value and finally ';' that indicates the end of variable declairation.

a :  = type b;  -> Declairing a variable of a specific type (Int).

b:  =  type 'x';  -> Declairing a variable of  a specific type (Char).

c : =  type "y";   -> Declairing a variable of a specific type (String).

If an expression is starts with a '(' or '[' a scanning for ')' is conducted.

Reserved keywords are: if, then, else, while, do, begin, end, Func, Proc, Bool, Int, String.

This is my grammar as of now. Its lacks procedures, if,then,else , while, do.

Anybody get a good surgestion on how to do a grammar for those ?

Again if the are any inconsistensies in my grammar please correct me !



Featured Post

Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Summary: This tutorial covers some basics of pointer, pointer arithmetic and function pointer. What is a pointer: A pointer is a variable which holds an address. This address might be address of another variable/address of devices/address of fu…
Windows programmers of the C/C++ variety, how many of you realise that since Window 9x Microsoft has been lying to you about what constitutes Unicode ( They will have you believe that Unicode requires you to use…
The goal of this video is to provide viewers with basic examples to understand recursion in the C programming language.
The goal of this video is to provide viewers with basic examples to understand how to use strings and some functions related to them in the C programming language.

863 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now