Skip to content

LL parser-generator and tokenizer generator written in C.

License

Notifications You must be signed in to change notification settings

ZanderThannhauser/zebu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zebu

Introduction

Zebu is an LL parser-generator written in C.

It does the work of both GNU Bison & Flex by allowing the user to put regular expressions directly in the grammar rules.

It builds on top of Bison, keeping the syntax, but adding features.

non-static functions, globals, and structs that will be generated by zebu.

Currently, this program only builds & runs on Linux.

The parser and the tokenizer(s) are generated simultaneously, resulting in context-sensitive tokenization. A tokenizer is generated to recognize only the tokens that would be valid to read next for each parser state.

Talk about parser templates...

Mention that zebu generates C-structs to build the parse-tree based on tags.

This enables reading multiple languages (with different keywords or comment-style) with the same parser. (CSS embedded in HTML, for instance)

Command-Line Arguments (Usage)

  • -i <PATH> (--input=<PATH>): Specifies path to initial input file.
  • -o <PATH> (--output=<PATH>): Specifies path for output files without the file extension. For instance, if one wanted zebu to produce "dir/foo.c" and "dir/foo.h", one would invoke zebu with -o dir/foo.
  • -p <PREFIX> (--prefix=<PREFIX>): Sets the prefix for the names used in the generated code. This is can be used for avoiding symbol conflicts between multiple generated parsers linking into the same program. The default prefix is "zebu".
  • -m (--minimize-lexer): Tells zebu to combine and simplify all tokenizers generated across all parser-contexts. This operation's may take time depending on the language. By default it is not performed.
  • -M (--make-dependencies): Tells zebu to print into a separate file a make rule suitable for describing the dependencies of the output file. Zebu outputs one make rule containing the output file name, a colon, and the names of all the included files. The default dependency-file path is output path (set with the -o flag), with the .d suffixed appended. This can be overwritten with:
  • -F <PATH> (--make-dependencies-file=<PATH>): Sets the path zebu will output the dependency information into. This flag implies -M.
  • -t <TEMPLATE> (--template=<TEMPLATE>): Sets the parser-template for generating output. The default value is "just-tables".
  • -T <PATH> (--custom-template=<PATH>): TODO
  • -v (--verbose): Enables progress-percentage print-outs to the terminal for the most time-consuming algorithms: NFA-to-DFA, DFA simplification, lexer minimization (see above), and LL parser generation.
  • -h (--help): Prints this help message.

Building from Source

Dependencies

This program has very few build-time dependencies (namely GNU Make, Glibc & GCC), and even fewer run-time dependencies (none at all). The release builds are statically linked in the standard C library.

Building (Linux)

Zebu builds with the command:

make

or

make -j `nproc`

After a build using default settings, the executable can be found at the path bin/release-build/yes-verbose/no-dotout/zebu. Installation can be done with the command:

make install

or

make install PREFIX=<PREFIX>

PREFIX sets the installation prefix; the default value is "${HOME}/bin."

Cross-Compiling for Windows

Zebu can be built for MS Windows on Linux using MinGW's GCC for Windows. On Ubuntu, this can be installed with sudo apt install -y gcc-mingw-w64-x86-64.

The command to build zebu for windows is:

make platform=windows

or

make -j `nproc` platform=windows

After the build is complete, the executable can be found at bin/windows-platform/release-build/no-verbose/no-dotout/zebu.exe.

Input File Specification

An input file lists a set of directives, named characters-sets, named regular-expressions and named grammar rules; these are described below. For each of these items, names cannot be reused.

Directives

  • %start: Defines the "start" grammar of the language. After this directive and a colon, a grammar rule follows, ending with a semicolon.
    • Example: %start: "abc" | "def";
  • %skip: After a colon, a regular expression follows, ending with a semicolon. This regular expression describes the set of characters or strings that should be ignored by the tokenizer when scanning for tokens. Note that the specified "skip" directive will affect only the tokens that are described below it; previously-described tokens will not be affected. Common usages are listed below:
    • %skip: ' ' | '\t' | '\n'; (Ignores space, tab and newline characters)
    • %skip: ' ' | '\t' | '\n' | "#" [!'\n']* "\n"; (Ignores shell-style comments and space, tab and newline characters)
    • %skip: ' ' | '\t' | '\n' | "//" [!'\n']* "\n"; (Ignores C++-style comments and space, tab and newline characters)
    • %skip: [128 - 255]; (Ignores all bytes with their high-bits set)
  • %extra_field: TODO: grammar-name, type (as string), field-name (as identifier), optional function name.
  • %include: Gives a path to read using either "path" or <path> syntax. The former will resolve the path relative to the current file, and the latter will resolve the path relative to the first file zebu was given on the command-line. Each file does not have its own namespace; all of the defined character sets, regular expressions, and grammars will be copied into the global scope. A file included more than once will not be read more than once.

Character Sets

Inside of either a regular-expression or grammar-rule context, one enters the character set-expression context using '[' and ']'. One can name a character set using [name]: <character-set-expression>; syntax in the root level of input files.

Remember that one can also refer to the value of a defined character set by name inside of other character-set expressions. The referenced character set must be defined before its usage.

C-style literals, both as characters in single quotes and as integers, can be used inside character sets. For example, the letter 'a' can be indicated by 'a', 0141, 97, or 0x61. Any unquoted whitespace is ignored.

Parentheses (( & )) can be used to raise the precedence of operators.

Operators for Character Sets

  1. !(...): Complement (unary operator). Results in a character set of all the characters not in the givencharacter set.
  2. (...) - (...): Range (binary operator, not associative). If given two literals, returns a character set containing those literals and any characters between them. If the first operand is a character set, it will use the minimum character contained in the set. If the second operand is a character set, it will use the maximum character contained in the set. Examples:
  • ['a'-'c'] yields ['a','b','c']
  • [('a','b') - ('d','e')] yields ['a','b','c','d','e']
  1. (...) & (...): Intersection (binary operator, left-associative). Returns a character set of all the elements that are contained in both of the two given character sets. Examples:
  • [('a','b','c') & ('b','c','d')] yields ['b','c']
  • [('a','b','c') & !'a'] yields ['b','c']
  1. (...) ^ (...): Symmetric difference (binary operator, left-associative). Returns a character set of all the elements that are contained in only one of the two given character sets.
  • Example: [('a','b','c') ^ ('b','c','d')] yields ['a','d']
  1. (...) | (...) or (...), (...) or (...) (...): Union (binary operator, left-associative). Results in a character set of all elements that are contained in either given character set. This is the default behavior if no operator is given between two character sets.
  • Example: [('a','b','c') | ('b','c','d')] yields ['a','b','c','d'].

Character-set Examples

  • ['a' 'b' 'c'] or ['a','b','c'] or ['a'|'b'|'c'] or ['a'-'c'] or [97 - 99]: Describes a character set containing the letters 'a', 'b' and 'c'.
  • [('a','b') ^ ('b','c')]: Describes a character set containing the letters 'a' and 'c'.
  • ['a'-'c' | 'b'-'d'] or [!(!('a'-'c') & !('b'-'d'))]: Describes a character set containing 'a', 'b', 'c' and 'd'.
  • ['a'-'c' & 'b'-'d'] or [!(!('a'-'c') | !('b'-'d'))]: Describes a character set containing 'b' and 'c'.
  • ['a'-'z' & !('a','e','i','o','u')]: Describes a character set containing all English consonant letters.
  • [0 - 127]: Describes any ASCII character.

Regular Expressions

Inside of a grammar-rule context, one can enter the regular-expression context using a backtick (`). One can also name a regular expression using `name`: value; syntax on the root level of an input file, and subsequently refer to this expression by name within other regular expressions.

C-style character literals, integer literals, and string literals can be used in regular expressions. The dot (.) literal can be used as a shorthand for [0 - 255], matching any valid next character. Any unquoted whitespace is ignored.

Square brackets ([ & ]) can be used to enclose a character-set expression, described above.

One can refer to the value of a named regular expression inside other regular expressions. The expression to which one refers must be defined before it can be used.

Parentheses (( & )) can be used to raise the precedence of operators.

Operators for Regular Expressions

  1. (...)?, (...)*, (...)+, (...){N}, (...){N,}, (...){,N}, (...){N,M}: Repetition (unary operators). Each returns a regular expression to match strings that repeat 0 or 1 times, 0 or more times, 1 or more times, N times, N or more times, up through N times, and from N through M times, respectively. ('N' and 'M' must be numeric literals.) Examples:
    • 'a'? matches "", "a"
    • 'a'* matches "", "a", "aa", "aaa", etc.
    • 'a'+ matches "a", "aa", "aaa", etc.
    • 'a'{3} matches "aaa"
    • 'a'{,3} matches "", "a", "aa", "aaa"
    • 'a'{3,} matches "aaa", "aaaa", "aaaaa", etc.
    • 'a'{2,4} matches "aa", "aaa", "aaaa"
  2. (...) (...): Concatenation/Juxtaposition (binary operator, left-associative). Returns a regular expression to match the concatenation of all matches of the first given regular expression with all matches of the second given regular expression.
    • Example: 'a'+ 'b'+ matches "ab", "aab", "abb", etc.
  3. (...) & (...) or (...) & !(...): Intersection and Difference respectively (binary operators, left-associative). Intersection returns a regular expression to match both the first and the second given regular expressions. Difference returns a regular expression to match the first but not the second given regular expression. Examples:
    • "aa"* & "aaa"* matches "", "aaaaaa", "aaaaaaaaaaaa", etc.
    • "aa"* &! "aaa"* matches "aa", "aaaa", "aaaaaaaa", etc.
  4. (...) | (...): Union (binary operator, left-associative). Returns a regular expression to match either the first or second given regular expression. Examples:
    • "a" | "b" matches "a" or "b".
    • ("a" | "b")* matches "", "a", "b", "ab", "ba", etc.

Regular-expression Examples

  • [!'b']* 'b' [!'b']* 'b' [!'b']*: Matches any string containing exactly two 'b's.
  • ['0'-'9']+ ('.' ['0'-'9']*)?: Matches any decimal figure with or without digits past the decimal point.
  • 'a'+'b'+'c'+ & (..)*: Describes all even-length strings with one or more 'a's, one or more 'b's, and one or more 'c's, consecutively.
  • 'a'+'b'+'c'+ &! 'a''b'+'c': Describes all strings with one or more 'a's, one or more 'b's, and one or more 'c's, consecutively, in which 'a', 'c', or both occur more than once.
  • 'a'+'b'+'c'+ &! "abc": Describes all strings with one or more 'a's, one or more 'b's, and one or more 'c's, consecutively, in which at least one letter occurs more than once.
  • 'a'? 'b'? 'c'?: Describes all the alphabetically-sorted strings made only from 'a', 'b' and 'c' where no letter can be repeated.
  • [0x00-0x7F]|[0xC0-0xDF][0x80-0xBF]{1}|[0xE0-0xEF][0x80-0xBF]{2}|[0xF0-0xF7][0x80-0xBF]{3}|[0xF8-0xFB][0x80-0xBF]{4}|[0xFC-0xFD][0x80-0xBF]{5}: Describes all UTF8-encoded multi-byte characters.

Grammar Rules

TODO: Briefly review what a grammar rule is exactly...

C-style character literals, integer literals, and string literals can be used in grammar rules. Square-brackets ([ & ]) can be used to enclose a character-set expression, described above. Backticks (`) can be used to enclose a regular expression, described above.

Parentheses (( & )) can be used to raise the precedence of operators.

Operators for Grammar Rules

Grammar-rule operators follow a similar syntax and behavior as the regular-expression operators above. Grammar rules do not (currently) support the intersection operator.

Parentheses (( & )) can be used to raise the precedence of operators.

  1. (...)?, (...)*, (...)+, (...){N}, (...){N,}, (...){,N}, (...){N,M}: Repetition (unary operators). Each describes a grammar rule that would accept the strings that repeat 0 or 1 times, 0 or more times, 1 or more times, N times, N or more times, up through N times, and from N through M times, respectively. ('N' and 'M' must be numeric literals.) Examples:
    • A: 'a'; B:'b'; %start: A? B matches "b", "ab".
    • A: 'a'; B:'b'; %start: A* B matches "b", "ab", "aab", "aaab", etc.
    • A: 'a'; B:'b'; %start: A+ B matches "ab", "aab", "aaab", etc.
    • A: 'a'; B:'b'; %start: A{3} B matches just "aaab".
    • A: 'a'; B:'b'; %start: A{,3} B matches "b", "ab", "aab", "aaab".
    • A: 'a'; B:'b'; %start: A{3,} B matches "aaab", "aaaab", "aaaaab", etc.
    • A: 'a'; B:'b'; %start: A{2,4} B matches just "aab", "aaab", "aaaab".
  2. %<name>:<expression>, <name>:<expression>, ...: (...): Subdefinitions.
  3. (...) (...): Concatenation/Juxtaposition (binary operator, left-associative).
  4. (...) | (...): Union (binary operator, left-associative). Parses the grammars described by either subgrammar rule.

Tags

Tags provide a way to "name" the data that is getting parsed.

TODO: Mention that format strings can come before the end of a gravemark-context.

TODO: Talk about datatypes, and struct generation.

Inline-Grammars

TODO

Examples of Grammar Rules

TODO

Examples of Language Input

This section provides a small catalog of sample input files for zebu.

JSON

%skip: ' ' | '\t' | '\n';

%start: value #value;

`string`: '\"' ['a' - 'z']+ '\"';

keyvalue: `string` #key ":" value #value;

value
	: "true"  #boolean
	| "false" #boolean
	| `['0' - '9']+` #number
	| `string` #string
	| "[" value #elements[] ("," value #elements[])* "]"
	| ("{" keyvalue #dict[] ("," keyvalue #dict[])* "}")
	;

CSV

`cell`: [!',','\"','\n']* ([!',','\"','\n']* '\"' (('\"' '\"') | [!'\"','\n'])* '\"')* [!',','\"','\n']*;

row: `cell` #cells[] (',' `cell` #cells[])* `'\n'+`;

%start: (row #rows[])+;

Math Expressions

%skip: ' ';

highest: `['0' - '9']+` #literal | '(' root #subexpression ')' ;

multiply: highest #base ('*' highest #multiplymes[] | '/' highest #dividemes[])*;

addition: multiply #base ('+' multiply #addmes[] | '-' multiply #subtractmes[])*;

root: addition #root;

%start: root #root;

Implementation Details: Tokenizer

TODO

Implementation Details: Grammar Rules

TODO

Parser Templates

  • -t just-tables (--template=just-tables): Outputs the tables used for tokenization: the transition-table, start states, accepting states, and EOF-transition table, and outputs tables used for parsing: the shift-table, reduce-table and goto-table, and prints a brief description of other templates. This is the default parser-template.
  • -t really-just-tables (--template=really-just-tables): Gives the same output as the above option, but without the descriptions of other templates.
  • -t charbuffer (--template=charbuffer): Generates a parser function that will read len number of characters from the given (unsigned) character buffer, and returns the parse-tree. Compile with -D ZEBU_DEBUG to insert code to pretty-print the parse-tree.
  • -t piecewise-charbuffer (--template=piecewise-charbuffer): Generates several functions for maintaining and using a thread-safe parser that can save/restore its state between parser invocations.
  • -t fileio (--template=fileio): Generates a parser function that will read from the given FILE* stream until EOF is reached, and returns the parse-tree. Compile with -D ZEBU_DEBUG to insert code to pretty-print the parse-tree.
  • -t fileio-with-driver (--template=fileio-with-driver): Generates a program that will open and read the path given in the first argument to on the command-line. The file is parsed, and the parse-tree is pretty-printed.
  • -t readline (--template=readline): Generates a parser function that will call use the GNU Readline Library to read a line of text from the terminal. The function parses the line and returns the parse-tree. Remember to link the resultant program with -lreadline. Compile the generated function with -D ZEBU_DEBUG to insert code to pretty-print the parse-tree.
  • -t readline-with-driver (--template=readline-with-driver): Generates a program that will repeatedly invoke the GNU Readline Library to read lines from the terminal. Each line is parsed, and pretty-printed. Remember to link the resultant program with -lreadline.

Parser Template File Format

TODO

  • {{PREFIX}}: TODO
  • {{SHIFT_TABLE}}: TODO
  • {{REDUCE_TABLE}}: TODO
  • {{GOTO_TABLE}}: TODO
  • {{LEXER_TABLE}}: TODO
  • {{LEXER_STARTS_TABLE}}: TODO
  • {{LEXER_ACCEPTS_TABLE}}: TODO
  • {{LEXER_EOF_TABLE}}: TODO
  • {{PARSE_TREE_STRUCTS}}: TODO
  • {{PARSE_TREE_PRINT_TREE_FUNCTIONS}}: TODO
  • {{PARSE_TREE_INC_FUNCTIONS}}: TODO
  • {{PARSE_TREE_FREE_FUNCTIONS}}: TODO
  • {{REDUCTION_RULE_SWITCH}}: TODO
  • {{START_GRAMMAR_ID}}: TODO
  • {{TOKEN_IDS_TO_SETS}}: Used for debugging.

Development & Implementation Details

Build Types

If you are a developer of zebu the three build flags below may help you. They can be set as environment variables or as arguments to the make command.

  • buildtype=debug: 'Debug' builds insert print statements whenever entering or exiting functions marked with the ENTER or EXIT macros, and print values for debugging with the dpv() macro. All Source files also include all (relevant) system headers and utility functions within zebu.
  • buildtype=test: 'Test' builds remove the debug print-statements described above, but will still keep the extra includes. This build type is intended for when a new feature has been debugged and needs to be tested before its implementation is set in stone.
  • buildtype=release: 'Release' builds remove anything extra: no debug statements, no extra includes. Compiler flags are added for optimizations and static linking. This is the default setting for buildtype.
  • verbose=on: Inserts the extra functions to make the -v (--verbose) command-line flag function, otherwise, the flag is ignored. To disable this option, use verbose=off.
  • dotout=off: This flag enables the generation of Graphviz's dot output for every automata creation or modification. This feature is intended to aid in tracking down bugs. Files are put into a "dot/" directory; this directory is created if it does not exist. To enable this option, use dotout=on.

Implementation Details: Tokenizer

TODO

Implementation Details: Grammar Rules

TODO

Future Features

  • Way of articulating in the input file how to handle shift-reduce errors, perhaps something not unlike the way GNU Bison or Yacc deals with them: (Tokens having a "precedence", if the last token of a production rule has a higher precedence than the look-ahead token: reduce, otherwise shift.)
  • Something to let you insert custom C code into the tokenizer to change which token the tokenizer reports detecting, an application for this would be to get Zebu to differentiate between C variable usage and a typedef-ed type usage.