A C Compiler

Introduction

This is a C compiler that performs some lexical, syntax and semantic error checking with the help of Flex (lexer) and Bison (YACC parser) and then converts the C code to 8086 assembly language code. It is not a complete C compiler but covers most of the basic features of the language. For more details, see here.

Lexical Analyser

Lexer returns the following tokens to the parser:

Keywords

  Matched lexeme  :       Returned Token

  if              :       IF
  else            :       ELSE
  for             :       FOR
  while           :       WHILE
  do              :       DO
  break           :       BREAK
  int             :       INT
  char            :       CHAR
  float           :       FLOAT
  double          :       DOUBLE
  void            :       VOID
  return          :       RETURN
  switch          :       SWITCH
  case            :       CASE
  default         :       DEFAULT
  continue        :       CONTINUE

Constants

  Returned Tokens

  CONST_INT
  CONST_FLOAT
  CONST_CHAR

Operators and punctuators

  Matched Lexeme          :       Returned Token

  +, -                    :       ADDOP           *
  *, /, %                 :       MULOP           *
  ++, --                  :       INCOP           *
  <, <=, >, >=, ==, !=    :       RELOP           *
  =                       :       ASSIGNOP
  &&, ||                  :       LOGICOP         *
  !                       :       NOT
  (                       :       LPAREN
  )                       :       RPAREN
  {                       :       LCURL
  }                       :       RCURL
  [                       :       LTHIRD
  ]                       :       RTHIRD
  ,                       :       COMMA
  ;                       :       SEMICOLON

Identifiers

  ID                                              *

Strings
```
  STRING
```
Whitespaces and comments are identified by the lexer but these are not passed to the parser.
Lexer also counts the line numbers when it finds a newline.

Tokens with * are passed as SymbolInfo objects to the parser. The SymbolInfo contains <matched lexeme, returned token> of the lexeme

Lexical Errors:

Detect lexical errors in the source program and reports it along with corresponding line no. Detects the following type of errors:

Too many decimal point error for character sequence like 1.2.345
Ill formed number such as 1E10.7
Invalid Suffix on numeric constant or invalid prefix on identifier for character sequence like 12abcd
Multi character constant error for character sequence like ‘ab’
Unfinished character such as ‘a , ‘\n or ‘\’
Empty character constant error ''
Unfinished string like "this is an unfinished string
Unfinished comment like /* This is an unfinished comment
Unrecognized character (Any character that does not match any defined regular expressions)
Also counts the total number of errors.

back to top

Syntax Analyser

Our chosen subset of the C language has the following characteristics:

There can be multiple functions. No two functions will have the same name. A function needs to be defined or declared before it is called. Also, a function and a global variable cannot have the same symbol.
There will be no pre-processing directives like #include or #define.
Variables can be declared at suitable places inside a function. Variables can also be declared in the global scope.
Precedence and associativity rules are as per standard. Although we will ignore consecutive logical operators or consecutive relational operators like, a && b && c, a < b < c.
No break statement and switch-case statement will be used.
println(int n) is used instead of printf(“%d\n”, n) to simplify the analysis.

Error recovery:

Some common syntax errors are handled and recovered so that the parser does not stop parsing immediately after recongnizing an error.

back to top

Semantic Analyser

Following semantics are checked in the compiler:

Type Checking

Generates error message if operands of an assignment operator are not consistent with each other. The second operand of the assignment operator will be an expression that may contain numbers, variables, function calls, etc.
Generates an error message if the index of an array is not an integer.
Both the operands of the modulus operator should be integers.
A void function cannot be called as a part of an expression.

Type Conversion

Conversion from float to integer in any expression generates an error. Also, the result of RELOP and LOGICOP operations are integers.

Uniqueness Checking

Checks whether a variable used in an expression is declared or not. Also, checks whether there are multiple declarations of variables with the same ID in the same scope.

Array Index

Checks whether there is an index used with array and vice versa.

Function Parameters

Check whether a function is called with appropriate number of parameters with appropriate types. Function definitions should also be consistent with declaration if there is any. Besides that, a function call cannot be made with non-function type identifier.

Non terminal data types used:

Used to check the consistancy of different terms and expressions with variable types.

Trivial data types:

  CONST_INT
  CONST_FLOAT
  CONST_CHAR
  CONST_INT*
  CONST_FLOAT*
  CONST_CHAR*

Other data types: UNDEC (If an ID is found which has never been declared) ERROR (If an expression contains error/s) FUNC_VOID (If the return type of a function is void. This is only used in function calls to check if a void function is used in an expression or not) VARIABLE (Any other type of non terminals.)

back to top

Intermediate Code Generation

After the syntax analyser and the semantic analyser confirms that the source program is correct, the compiler generates the intermediate code. Ideally a three address code is generated in real life compilers. But we have used 8086 Assembly Language as our intermediate code so that we can run it in emu 8086 and justify that our compilation is correct.

On the fly intermediate code generation strategy:

The Algorithm:

We have generated the intermediate code on the fly. Which means that, instead of using any data structure and passing the whole code one after another to the production rules of the grammar, we have generated the intermediate code as soon as we match a rule and write it in the code.asm file. To do that, we have to use the PUSH and POP instructions in the assembly code which utilize the stack.

Let's understand the algorithm with an example. Let's say we have a grammar like this:
```
    E -> E + T
    E -> T
    T -> T * F
    T -> F
    F -> id
    F -> digit
```

While bison evaluates any string like a + 2 * c the lexer will first read the string and generate the tokens like id, +, digit, *, id. So the order of the matched rules will be as follows:

  a found : F -> id reduced
  + found : nothing reduced (no rules matched yet)
  2 found : F -> digit reduced
  * found : nothing reduced (no rules matched yet)
  c found : F -> id reduced

At first only the production rules of F matched. So, we are going to push the id and digit to the stack of 8086.

        PUSH a
        PUSH 2
        PUSH c

The stack will look like this now:

    SP -> c
          2
          a

    * here SP points to the top of the stack

Now that all the lexemes are matched, the parser will match the production rules of E and T as follows:
As '2' and 'a' reduced to 'F' previously, now, for the F of '2', T -> F is reduced. Now, it's finally the time to match the rule corresponding to '*'
```
    2 * c found : T -> T * F reduced
```
As this rule has been matched, we are going to pop the ID c and digit 2 from the stack, evaluate the multiplication operation and push the result to the stack.

        POP BX          ;this will pop c from the stack
        POP AX          ;this will pop 2 from the stack
        MUL BX          ;this will multiply 2 with c and store the result in AX
        PUSH AX         ;this will push the result of 2 * c to the stack

Now the stack will look like this:
```
  SP -> 2*c
        a
```
Similarly E -> E + T will now be reduced. So, we can pop the last two values from the stack and simply add them.

        POP BX          ;this will pop 2*c from the stack
        POP AX          ;this will pop a from the stack
        ADD AX, BX      ;this will add a and 2*c and store the result in AX
        PUSH AX         ;this will push the result of a+2*c  to the stack

Note 1: After the whole expression is evaluated, the value of the expression is already pushed in the stack. So, we can use it in our next expression if needed or we can just pop it. For example, if we had d = a + 2 * c, we could use the value of a + 2 * c in d by popping it out. The stack will look like this:
```
  SP -> a + 2*c
        d

  * here 'd' was pushed because it matched F -> id first. But this value of 'd' is useless. So, we can pop it out.
```
Here is the required asm code to evaluate the expression

        POP BX          ;this will pop a+2*c from the stack
        POP AX          ;this will pop the (useless) value of d from the stack
        MOV d, BX       ;this will store the value of a+2*c in d
        PUSH d         ;this will push the value of d to the stack

The stack will look like this now:
```
  SP -> d
```
Note2: With the above algorithm, we can evaluate any expression. But the catch is that, there is always an extra PUSH at the end of every expression. So, when we don't need that value, we can just pop it out. i.e; when we get a SEMICOLON ; after the expression above, that will mean that the expression is over. For the example above, if it ends with a semicolon then we don't need the value of d anymore. So, we can pop it out like this:

        POP BX          ;this will pop d from the stack

The stack is empty again.
So the complete code for evaluating d = a + 2 * c; looks like this:

        PUSH d
        PUSH a
        PUSH 2
        PUSH c

        POP BX          ;this will pop c from the stack
        POP AX          ;this will pop 2 from the stack
        MUL BX          ;this will multiply 2 with c and store the result in AX
        PUSH AX         ;this will push the result of 2 * c to the stack

        POP BX          ;this will pop 2*c from the stack
        POP AX          ;this will pop a from the stack
        ADD AX, BX      ;this will add a and 2*c and store the result in AX
        PUSH AX         ;this will push the result of a+2*c  to the stack

        POP BX          ;this will pop a+2*c from the stack
        POP AX          ;this will pop the (useless) value of d from the stack
        MOV d, BX       ;this will store the value of a+2*c in d
        PUSH d         ;this will push the value of d to the stack

        POP BX          ;this will pop d (useless) from the stack. This was pushed at the start of the expression.

If there was no semicolon in the expression, then we would not pop out the last value of d. This helps in passing parameters to functions or evaluating if statements.

back to top

Evaluating `for` loop is somewhat tricky:

Though the above algorithm works for every expression used in the code, it requires a small trick to evaluate for loops. A for loop's grammar looks like this:

expression -> FOR LPAREN expression_statement expression_statement expression RPAREN LCURL statements RCURL
expression_statement -> expression SEMICOLON

The problem in evaluating for loops is that we need to evaluate the third expression after the statements in the curly braces are executed. Here, we have done this by saving the line number of the starting of third expression and then inserting the codes corresponding the statements from that line. The following pseudo code might help:

expression -> FOR LPAREN expression_statement expression_statement
        {
                isInForLoop = true;
                lineNo = currentLineNo; // Saves the line number before third expression
        }
        expression RPAREN LCURL statements RCURL

statements -> ... {
                if(isInForLoop) {
                        insertCode(lineNo);
                } else {
                        insertCode(currentLineNo);
                }
        }

/* You might need to write a function to insert text in the middle of a file. */

Evaluating `functions`:

Please read pages 303-305 (14.5.3 Using the stack for procedures) of the book Assembly Language Programming and Organization of the IBM PC by Ytha Yu, Charles Marut

back to top

Declaring variables:

Global variables: We have inserted the global variables in the data segment. This is done by saving the end line number of the data segment in a variable and then inserting the code corresponding to the global variables using the writeAt(filename, lineNo) function of fileUtils.h header.
Global arrays: We have done this the same way as we have done for global variables. But the syntax is a little different. Follow the example below.
Local variables: Whenever a local variable is declared inside a function, we have PUSHed a dummy value to the stack. Then we just saved the offset of that stack address with respect to BP (bottom pointer referenced in the book here) in the IdInfo of that id.
Local Arrays: Here we have followed the same technique as local variables but the stack is pushed "size of the array" times. Offset of the first element along with the size of the array is saved in the IdInfo of the array id.
- Note: A better approach is to subtract the byte size of the array from SP. It will just move the SP at the end of the array.
Example:

        int a, b[3];
        int main() {
            int c, d[3], e;
        }

        ...
        .DATA
            a DW ?
            b DW 3 DUP(?)
        .CODE
        main PROC
            PUSH AX         ;A garbage value pushed for c
                            ;Offset is -2
            ;PUSH AX
            ;PUSH AX
            ;PUSH AX        ;3 garbage values pushed for d[3]
                            ;offset is -4
            ;better way of allocating space for an array
            SUB SP, 6       ;SP is moved to the end of d[3]

            PUSH AX         ;A garbage value pushed for e
                            ;offset is -10
        main ENDP
        END MAIN

Accessing variables:

Global variables: We just check if it's a global variable or not. If yes then we just used the name of the variable.
Global arrays: We have to use the name of the array and the index of the array. See the example below.
Local variables: We have to use the offset of the stack address with respect to BP.
Local Arrays: We have to use the offset of the stack address with respect to BP and then add the array index times 2 with it.
Example: Continued from the previous example.

        a               ;global variable a is accessed
        b[0]            ;global array b at index 0 is accessed
        b[2]            ;global array b at index 1 is accessed

        [BP + -2]       ;local variable c is accessed
        [BP + -4]       ;local array d at index 0 is accessed
        [BP + -8]       ;local array d at index 2 is accessed
        [BP + -10]      ;local variable e is accessed

Conclusion

It requires a lot of push and pop instructions in this approach. So, sometimes the same value is pushed and popped consecutively. For that reason, an optimization is required. This is called peephole optimization. We have done it in the second pass (after all the expressions are evaluated).

back to top

Optimizing assembly code

As the above algorithm can generate some redundant instructions, we have to optimize the code. The following optimizations are done:

Remove redundant push and pop instructions.

If the first instruction is push and the second is pop and those contains the same address or register then we can remove both the instructions. For exammple,

        *code.asm                               *optimized_code.asm
        PUSH AX                 ->              ;PUSH AX
        POP AX                                  ;POP AX

If the first is push and the second is a pop containing a register or an address then we can replace the two instructions with one MOV instruction. For example,

        *code.asm                               *optimized_code.asm
        PUSH [BP + -2]          ->              MOV AX, [BP + -2]
        POP AX

Remove redundant move instructions

If a move instruction has the same source and destination then we can remove the instruction. For example,

        *code.asm                               *optimized_code.asm
        MOV AX, AX              ->              ;MOV AX, AX

If consecutive two instructions are move and the first instruction contains the same register or address as the second instruction then we can remove the first instruction. For example,

        *code.asm                              *optimized_code.asm
        MOV AX, BX              ->             ;MOV AX, BX
        MOV AX, CX                             MOV AX, CX

If consecutive two instructions are move and the source of the first instruction is the destination of the second instruction and the source of the second one is the destination of the first then we can remove the second instruction. For example,

        *code.asm                              *optimized_code.asm
        MOV AX, BX              ->             MOV AX, BX
        MOV BX, AX                             ;MOV BX, AX

Remove jump instructions followed by a label that is used by that jump

            *code.asm                              *optimized_code.asm
            CMP AX, BX              ->              CMP AX, BX
            JE L1                                   ;JE L1
            L1:                                     L1:

Remove compare instructions that are not followed by any jump instruction

It is not likely to occur in the first pass of optimization. But it might occur in the next passes if the jump instructions that followed the compare instructions were removed in a previous pass. For example, in the first pass,

        *code.asm                              *optimized_code.asm
        CMP AX, BX              ->              CMP AX, BX
        JE L1                                   ;JE L1
        L1:                                     L1:

Now, the compare instruction is useless.

        *code.asm                              *optimized_code.asm
        CMP AX, BX              ->              ;CMP AX, BX
        ;JE L1                                  ;JE L1
        L1:                                     L1:

back to top

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
Offline-1 (Scope Table)		Offline-1 (Scope Table)
Offline-2 (Lexical Analysis)		Offline-2 (Lexical Analysis)
Offline-3 (Syntax and Semantic Analysis)		Offline-3 (Syntax and Semantic Analysis)
Offline-4 (Intermediate Code Generation)		Offline-4 (Intermediate Code Generation)
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A C Compiler

Introduction

Table of Contents

Lexical Analyser

Lexer returns the following tokens to the parser:

Lexical Errors:

Syntax Analyser

Our chosen subset of the C language has the following characteristics:

Error recovery:

Semantic Analyser

Following semantics are checked in the compiler:

Non terminal data types used:

Intermediate Code Generation

On the fly intermediate code generation strategy:

Table of Contents:

The Algorithm:

Evaluating `for` loop is somewhat tricky:

Evaluating `functions`:

Declaring variables:

Accessing variables:

Conclusion

Optimizing assembly code

About

Languages

License

fardinanam/A-C-Compiler

Folders and files

Latest commit

History

Repository files navigation

A C Compiler

Introduction

Table of Contents

Lexical Analyser

Lexer returns the following tokens to the parser:

Lexical Errors:

Syntax Analyser

Our chosen subset of the C language has the following characteristics:

Error recovery:

Semantic Analyser

Following semantics are checked in the compiler:

Non terminal data types used:

Intermediate Code Generation

On the fly intermediate code generation strategy:

Table of Contents:

The Algorithm:

Evaluating for loop is somewhat tricky:

Evaluating functions:

Declaring variables:

Accessing variables:

Conclusion

Optimizing assembly code

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Evaluating `for` loop is somewhat tricky:

Evaluating `functions`: