Over the course of the semester, I built a functional compiler for a subset of the C language, written in Lex, Bison, & C.
To run the compiler, navigate to /Parser
and type make compile
. Optional fields input
, output
, and error
can be used to change the default input, output, and error files from stdin, stdout, and stderr respectively.
The compiler was completed in stages:
The first stage of the compiler is the lexical analyzer, which was built using Lex. This stage reads in one character at a time and groups them into tokens, determining the token semantic value when applicable. File names and line numbers are also ingested by the lexer for error reporting.
The second stage of the compiler is the parser, which was built using Bison. The parser consumes tokens from the lexer and uses a grammar to perform syntactical & semantic analysis. As the grammar is used to determine the relationship between tokens, an abstract syntax tree is built.
This stage was built in increments to support a majority of the C language:
The parser is able to properly handle most of the C expression syntax.
The parser is able to properly handle most C declarations. The parser is also able to properly store variables within the symbol table, differentiating by namespace and scope, as well as keep track of the current scope and search for identifiers throughout the scope stack.
The parser is able to properly handle most C statements, including expression statements, for statements, if-else statements, etc.
In this stage, the abstract syntax tree nodes generated by the parser are used to create quads, an intermediary representation of the program which can then be translated into assembly code.
Finally, the quads are translated into X86 - 32 bit assembly code, which can then be run by the computer.
An abstract syntax tree is a data structure that graphs the abstract relationship between elements of the language as they are parsed. An example can be seen below for the expression xyz = (1024 + abc);
.
In my implementation, an AST node is represented as a union of various AST node structs, each containing fields relevant to the node type. For example, a binary operation AST node contains pointers to the left and right expression AST nodes, as well as a field to store the operation.
The symbol table is the data structure used to contain a mapping between variables and values for each scope and namespace. When a variable is added to the symbol table, its identifier, or symbol, is hashed and a pointer to its AST node is inserted at the hashed index.
When the compiler begins parsing the program, a scope stack is created. Each layer, or scope, contains the type of scope (file, block, function, or prototype) and a symbol table for each namespace (label, tag, & other). Additionally, each scope contains the file and line on which it begins for better error reporting.
Variables are always inserted into the innermost scope and specified namespace, and when searching for a variable, the search begins in the innermost scope and specified namespace and works up the scope stack.
Certain areas of the C language were not implemented, including:
- Initialized declarations
- Expressions within an array declarator (i.e a[2+2])
- C99 variable-length arrays
- Type Qualifiers (the keywords are still accepted in the grammar)
- Enums
- Bit Fields
- Prototype / Formal Parameter Lists
- Typedef