Zebu is an LL parser-generator written in C.
It does the work of both GNU Bison & Flex by allowing the user to put regular expressions directly in the grammar rules.
It builds on top of Bison, keeping the syntax, but adding features.
non-static
functions, globals, and structs that will be generated by zebu.
Currently, this program only builds & runs on Linux.
The parser and the tokenizer(s) are generated simultaneously, resulting in context-sensitive tokenization. A tokenizer is generated to recognize only the tokens that would be valid to read next for each parser state.
Talk about parser templates...
Mention that zebu generates C
-structs to build the parse-tree based on tags.
This enables reading multiple languages (with different keywords or comment-style) with the same parser. (CSS embedded in HTML, for instance)
-i <PATH>
(--input=<PATH>
): Specifies path to initial input file.-o <PATH>
(--output=<PATH>
): Specifies path for output files without the file extension. For instance, if one wanted zebu to produce "dir/foo.c" and "dir/foo.h", one would invoke zebu with-o dir/foo
.-p <PREFIX>
(--prefix=<PREFIX>
): Sets the prefix for the names used in the generated code. This is can be used for avoiding symbol conflicts between multiple generated parsers linking into the same program. The default prefix is "zebu".-m
(--minimize-lexer
): Tells zebu to combine and simplify all tokenizers generated across all parser-contexts. This operation's may take time depending on the language. By default it is not performed.-M
(--make-dependencies
): Tells zebu to print into a separate file a make rule suitable for describing the dependencies of the output file. Zebu outputs one make rule containing the output file name, a colon, and the names of all the included files. The default dependency-file path is output path (set with the-o
flag), with the.d
suffixed appended. This can be overwritten with:-F <PATH>
(--make-dependencies-file=<PATH>
): Sets the path zebu will output the dependency information into. This flag implies-M
.-t <TEMPLATE>
(--template=<TEMPLATE>
): Sets the parser-template for generating output. The default value is "just-tables".-T <PATH>
(--custom-template=<PATH>
): TODO-v
(--verbose
): Enables progress-percentage print-outs to the terminal for the most time-consuming algorithms: NFA-to-DFA, DFA simplification, lexer minimization (see above), and LL parser generation.-h
(--help
): Prints this help message.
This program has very few build-time dependencies (namely GNU Make, Glibc & GCC), and even fewer run-time dependencies (none at all). The release builds are statically linked in the standard C library.
Zebu builds with the command:
make
or
make -j `nproc`
After a build using default settings, the executable can be found at the path bin/release-build/yes-verbose/no-dotout/zebu
. Installation can be done with the command:
make install
or
make install PREFIX=<PREFIX>
PREFIX
sets the installation prefix; the default value is "${HOME}/bin
."
Zebu can be built for MS Windows on Linux using MinGW's GCC for Windows. On Ubuntu, this can be installed with sudo apt install -y gcc-mingw-w64-x86-64
.
The command to build zebu for windows is:
make platform=windows
or
make -j `nproc` platform=windows
After the build is complete, the executable can be found at bin/windows-platform/release-build/no-verbose/no-dotout/zebu.exe
.
An input file lists a set of directives, named characters-sets, named regular-expressions and named grammar rules; these are described below. For each of these items, names cannot be reused.
%start
: Defines the "start" grammar of the language. After this directive and a colon, a grammar rule follows, ending with a semicolon.- Example:
%start: "abc" | "def";
- Example:
%skip
: After a colon, a regular expression follows, ending with a semicolon. This regular expression describes the set of characters or strings that should be ignored by the tokenizer when scanning for tokens. Note that the specified "skip" directive will affect only the tokens that are described below it; previously-described tokens will not be affected. Common usages are listed below:%skip: ' ' | '\t' | '\n';
(Ignores space, tab and newline characters)%skip: ' ' | '\t' | '\n' | "#" [!'\n']* "\n";
(Ignores shell-style comments and space, tab and newline characters)%skip: ' ' | '\t' | '\n' | "//" [!'\n']* "\n";
(Ignores C++-style comments and space, tab and newline characters)%skip: [128 - 255];
(Ignores all bytes with their high-bits set)
%extra_field
: TODO: grammar-name, type (as string), field-name (as identifier), optional function name.%include
: Gives a path to read using either"path"
or<path>
syntax. The former will resolve the path relative to the current file, and the latter will resolve the path relative to the first file zebu was given on the command-line. Each file does not have its own namespace; all of the defined character sets, regular expressions, and grammars will be copied into the global scope. A file included more than once will not be read more than once.
Inside of either a regular-expression or grammar-rule context, one enters the character set-expression context using '['
and ']'
. One can name a character set using [name]: <character-set-expression>;
syntax in the root level of input files.
Remember that one can also refer to the value of a defined character set by name inside of other character-set expressions. The referenced character set must be defined before its usage.
C-style literals, both as characters in single quotes and as integers, can be used inside character sets. For example, the letter 'a' can be indicated by 'a'
, 0141
, 97
, or 0x61
. Any unquoted whitespace is ignored.
Parentheses ((
& )
) can be used to raise the precedence of operators.
!(...)
: Complement (unary operator). Results in a character set of all the characters not in the givencharacter set.(...) - (...)
: Range (binary operator, not associative). If given two literals, returns a character set containing those literals and any characters between them. If the first operand is a character set, it will use the minimum character contained in the set. If the second operand is a character set, it will use the maximum character contained in the set. Examples:
['a'-'c']
yields['a','b','c']
[('a','b') - ('d','e')]
yields['a','b','c','d','e']
(...) & (...)
: Intersection (binary operator, left-associative). Returns a character set of all the elements that are contained in both of the two given character sets. Examples:
[('a','b','c') & ('b','c','d')]
yields['b','c']
[('a','b','c') & !'a']
yields['b','c']
(...) ^ (...)
: Symmetric difference (binary operator, left-associative). Returns a character set of all the elements that are contained in only one of the two given character sets.
- Example:
[('a','b','c') ^ ('b','c','d')]
yields['a','d']
(...) | (...)
or(...), (...)
or(...) (...)
: Union (binary operator, left-associative). Results in a character set of all elements that are contained in either given character set. This is the default behavior if no operator is given between two character sets.
- Example:
[('a','b','c') | ('b','c','d')]
yields['a','b','c','d']
.
['a' 'b' 'c']
or['a','b','c']
or['a'|'b'|'c']
or['a'-'c']
or[97 - 99]
: Describes a character set containing the letters 'a', 'b' and 'c'.[('a','b') ^ ('b','c')]
: Describes a character set containing the letters 'a' and 'c'.['a'-'c' | 'b'-'d']
or[!(!('a'-'c') & !('b'-'d'))]
: Describes a character set containing 'a', 'b', 'c' and 'd'.['a'-'c' & 'b'-'d']
or[!(!('a'-'c') | !('b'-'d'))]
: Describes a character set containing 'b' and 'c'.['a'-'z' & !('a','e','i','o','u')]
: Describes a character set containing all English consonant letters.[0 - 127]
: Describes any ASCII character.
Inside of a grammar-rule context, one can enter the regular-expression context using a backtick (`
). One can also name a regular expression using `name`: value;
syntax on the root level of an input file, and subsequently refer to this expression by name within other regular expressions.
C-style character literals, integer literals, and string literals can be used in regular expressions. The dot (.
) literal can be used as a shorthand for [0 - 255]
, matching any valid next character. Any unquoted whitespace is ignored.
Square brackets ([
& ]
) can be used to enclose a character-set expression, described above.
One can refer to the value of a named regular expression inside other regular expressions. The expression to which one refers must be defined before it can be used.
Parentheses ((
& )
) can be used to raise the precedence of operators.
(...)?
,(...)*
,(...)+
,(...){N}
,(...){N,}
,(...){,N}
,(...){N,M}
: Repetition (unary operators). Each returns a regular expression to match strings that repeat 0 or 1 times, 0 or more times, 1 or more times,N
times,N
or more times, up throughN
times, and fromN
throughM
times, respectively. ('N' and 'M' must be numeric literals.) Examples:'a'?
matches "", "a"'a'*
matches "", "a", "aa", "aaa", etc.'a'+
matches "a", "aa", "aaa", etc.'a'{3}
matches "aaa"'a'{,3}
matches "", "a", "aa", "aaa"'a'{3,}
matches "aaa", "aaaa", "aaaaa", etc.'a'{2,4}
matches "aa", "aaa", "aaaa"
(...) (...)
: Concatenation/Juxtaposition (binary operator, left-associative). Returns a regular expression to match the concatenation of all matches of the first given regular expression with all matches of the second given regular expression.- Example:
'a'+ 'b'+
matches "ab", "aab", "abb", etc.
- Example:
(...) & (...)
or(...) & !(...)
: Intersection and Difference respectively (binary operators, left-associative). Intersection returns a regular expression to match both the first and the second given regular expressions. Difference returns a regular expression to match the first but not the second given regular expression. Examples:"aa"* & "aaa"*
matches "", "aaaaaa", "aaaaaaaaaaaa", etc."aa"* &! "aaa"*
matches "aa", "aaaa", "aaaaaaaa", etc.
(...) | (...)
: Union (binary operator, left-associative). Returns a regular expression to match either the first or second given regular expression. Examples:"a" | "b"
matches "a" or "b".("a" | "b")*
matches "", "a", "b", "ab", "ba", etc.
[!'b']* 'b' [!'b']* 'b' [!'b']*
: Matches any string containing exactly two 'b's.['0'-'9']+ ('.' ['0'-'9']*)?
: Matches any decimal figure with or without digits past the decimal point.'a'+'b'+'c'+ & (..)*
: Describes all even-length strings with one or more 'a's, one or more 'b's, and one or more 'c's, consecutively.'a'+'b'+'c'+ &! 'a''b'+'c'
: Describes all strings with one or more 'a's, one or more 'b's, and one or more 'c's, consecutively, in which 'a', 'c', or both occur more than once.'a'+'b'+'c'+ &! "abc"
: Describes all strings with one or more 'a's, one or more 'b's, and one or more 'c's, consecutively, in which at least one letter occurs more than once.'a'? 'b'? 'c'?
: Describes all the alphabetically-sorted strings made only from 'a', 'b' and 'c' where no letter can be repeated.[0x00-0x7F]|[0xC0-0xDF][0x80-0xBF]{1}|[0xE0-0xEF][0x80-0xBF]{2}|[0xF0-0xF7][0x80-0xBF]{3}|[0xF8-0xFB][0x80-0xBF]{4}|[0xFC-0xFD][0x80-0xBF]{5}
: Describes all UTF8-encoded multi-byte characters.
TODO: Briefly review what a grammar rule is exactly...
C-style character literals, integer literals, and string literals can be used in grammar rules. Square-brackets ([
& ]
) can be used to enclose a character-set expression, described above. Backticks (`
) can be used to enclose a regular expression, described above.
Parentheses ((
& )
) can be used to raise the precedence of operators.
Grammar-rule operators follow a similar syntax and behavior as the regular-expression operators above. Grammar rules do not (currently) support the intersection operator.
Parentheses ((
& )
) can be used to raise the precedence of operators.
(...)?
,(...)*
,(...)+
,(...){N}
,(...){N,}
,(...){,N}
,(...){N,M}
: Repetition (unary operators). Each describes a grammar rule that would accept the strings that repeat 0 or 1 times, 0 or more times, 1 or more times,N
times,N
or more times, up throughN
times, and fromN
throughM
times, respectively. ('N' and 'M' must be numeric literals.) Examples:A: 'a'; B:'b'; %start: A? B
matches "b", "ab".A: 'a'; B:'b'; %start: A* B
matches "b", "ab", "aab", "aaab", etc.A: 'a'; B:'b'; %start: A+ B
matches "ab", "aab", "aaab", etc.A: 'a'; B:'b'; %start: A{3} B
matches just "aaab".A: 'a'; B:'b'; %start: A{,3} B
matches "b", "ab", "aab", "aaab".A: 'a'; B:'b'; %start: A{3,} B
matches "aaab", "aaaab", "aaaaab", etc.A: 'a'; B:'b'; %start: A{2,4} B
matches just "aab", "aaab", "aaaab".
%<name>:<expression>, <name>:<expression>, ...: (...)
: Subdefinitions.(...) (...)
: Concatenation/Juxtaposition (binary operator, left-associative).(...) | (...)
: Union (binary operator, left-associative). Parses the grammars described by either subgrammar rule.
Tags provide a way to "name" the data that is getting parsed.
TODO: Mention that format strings can come before the end of a gravemark-context.
TODO: Talk about datatypes, and struct generation.
TODO
TODO
This section provides a small catalog of sample input files for zebu.
%skip: ' ' | '\t' | '\n';
%start: value #value;
`string`: '\"' ['a' - 'z']+ '\"';
keyvalue: `string` #key ":" value #value;
value
: "true" #boolean
| "false" #boolean
| `['0' - '9']+` #number
| `string` #string
| "[" value #elements[] ("," value #elements[])* "]"
| ("{" keyvalue #dict[] ("," keyvalue #dict[])* "}")
;
`cell`: [!',','\"','\n']* ([!',','\"','\n']* '\"' (('\"' '\"') | [!'\"','\n'])* '\"')* [!',','\"','\n']*;
row: `cell` #cells[] (',' `cell` #cells[])* `'\n'+`;
%start: (row #rows[])+;
%skip: ' ';
highest: `['0' - '9']+` #literal | '(' root #subexpression ')' ;
multiply: highest #base ('*' highest #multiplymes[] | '/' highest #dividemes[])*;
addition: multiply #base ('+' multiply #addmes[] | '-' multiply #subtractmes[])*;
root: addition #root;
%start: root #root;
TODO
TODO
-t just-tables
(--template=just-tables
): Outputs the tables used for tokenization: the transition-table, start states, accepting states, and EOF-transition table, and outputs tables used for parsing: the shift-table, reduce-table and goto-table, and prints a brief description of other templates. This is the default parser-template.-t really-just-tables
(--template=really-just-tables
): Gives the same output as the above option, but without the descriptions of other templates.-t charbuffer
(--template=charbuffer
): Generates a parser function that will readlen
number of characters from the given (unsigned) character buffer, and returns the parse-tree. Compile with-D ZEBU_DEBUG
to insert code to pretty-print the parse-tree.-t piecewise-charbuffer
(--template=piecewise-charbuffer
): Generates several functions for maintaining and using a thread-safe parser that can save/restore its state between parser invocations.-t fileio
(--template=fileio
): Generates a parser function that will read from the givenFILE*
stream until EOF is reached, and returns the parse-tree. Compile with-D ZEBU_DEBUG
to insert code to pretty-print the parse-tree.-t fileio-with-driver
(--template=fileio-with-driver
): Generates a program that will open and read the path given in the first argument to on the command-line. The file is parsed, and the parse-tree is pretty-printed.-t readline
(--template=readline
): Generates a parser function that will call use the GNU Readline Library to read a line of text from the terminal. The function parses the line and returns the parse-tree. Remember to link the resultant program with-lreadline
. Compile the generated function with-D ZEBU_DEBUG
to insert code to pretty-print the parse-tree.-t readline-with-driver
(--template=readline-with-driver
): Generates a program that will repeatedly invoke the GNU Readline Library to read lines from the terminal. Each line is parsed, and pretty-printed. Remember to link the resultant program with-lreadline
.
TODO
{{PREFIX}}
: TODO{{SHIFT_TABLE}}
: TODO{{REDUCE_TABLE}}
: TODO{{GOTO_TABLE}}
: TODO{{LEXER_TABLE}}
: TODO{{LEXER_STARTS_TABLE}}
: TODO{{LEXER_ACCEPTS_TABLE}}
: TODO{{LEXER_EOF_TABLE}}
: TODO{{PARSE_TREE_STRUCTS}}
: TODO{{PARSE_TREE_PRINT_TREE_FUNCTIONS}}
: TODO{{PARSE_TREE_INC_FUNCTIONS}}
: TODO{{PARSE_TREE_FREE_FUNCTIONS}}
: TODO{{REDUCTION_RULE_SWITCH}}
: TODO{{START_GRAMMAR_ID}}
: TODO{{TOKEN_IDS_TO_SETS}}
: Used for debugging.
If you are a developer of zebu the three build flags below may help you. They can be set as environment variables or as arguments to the make
command.
buildtype=debug
: 'Debug' builds insert print statements whenever entering or exiting functions marked with theENTER
orEXIT
macros, and print values for debugging with thedpv()
macro. All Source files also include all (relevant) system headers and utility functions within zebu.buildtype=test
: 'Test' builds remove the debug print-statements described above, but will still keep the extra includes. This build type is intended for when a new feature has been debugged and needs to be tested before its implementation is set in stone.buildtype=release
: 'Release' builds remove anything extra: no debug statements, no extra includes. Compiler flags are added for optimizations and static linking. This is the default setting forbuildtype
.verbose=on
: Inserts the extra functions to make the-v
(--verbose
) command-line flag function, otherwise, the flag is ignored. To disable this option, useverbose=off
.dotout=off
: This flag enables the generation of Graphviz'sdot
output for every automata creation or modification. This feature is intended to aid in tracking down bugs. Files are put into a "dot/
" directory; this directory is created if it does not exist. To enable this option, usedotout=on
.
TODO
TODO
- Way of articulating in the input file how to handle shift-reduce errors, perhaps something not unlike the way GNU Bison or Yacc deals with them: (Tokens having a "precedence", if the last token of a production rule has a higher precedence than the look-ahead token: reduce, otherwise shift.)
- Something to let you insert custom C code into the tokenizer to change which token the tokenizer reports detecting, an application for this would be to get Zebu to differentiate between C variable usage and a
typedef
-ed type usage.