Lexical analyzer generator.
./trans <filename.trans>
It generates two files with names filename.h and filename.c.
filename.h contains three function prototypes:
typedef enum { LEX_ERROR = -1, LEX_SUCCESS = 0, LEX_EOF = 1 } lexer_res_t;
lexer_t* lexer_create(const char *filename);
lexer_res_t lexer_next_tok(lexer_t *lex, lexeme_t *m);
void lexer_free(lexer_t *lex);
if lexer_next_tok returns LEX_ERROR or LEX_EOF, lexeme will not contain a valid value.
int main() {
lexer_t *lex = lexer_create("filename");
if(!lex)
return 1;
for(;;) {
lexeme_t m;
lexer_res_t res = lexer_next_tok(lex, &m);
if(res == LEX_ERROR) {
// handle error
} else if(res == LEX_EOF) {
break;
} else {
// valid lexeme value
}
}
lexer_free(lex);
}
.trans files are used to describe language for which lexical analyzer is created. They have following structure:
[sectionname]
section content
[nextsection]
next section content
...
There are two necessary sections in any .trans file:
- header — must contain lexeme_t definition and all definitions related to it. This section entirely includes into result .h file.
- regexes — must contain regular expressions which defines lexemes. Specification of regular expressions format is written below.
And three not-necessary:
- hinclude — must contain headers which will be included into result .h file.
- cinclude — must contain headers which will be included into result .c file.
- funcs — must contain ancillary functions for parsing lexeme. This section entirely includes into result .c file.
Header section must conatain lexeme_t definition in corresponding format:
typedef struct {
// necessary fields
int class;
char *str;
size_t str_len;
// not-necessary fields
} lexeme_t;
Every lexeme_t definition must contain at least three fields:
- int class — lexeme class. For example: ID, IF, THEN, etc.
- char *str — string corresponding to lexeme. If you need to use this string after lexeme parsing, you'll need to allocate memory and copy this string to it.
- size_t str_len — length of string corresponding to lexeme.
Regexes section must contains at least one regular expression in following format:
"regexp1" { /* some c code */ return integer_value_1; }
"regexp2" {
/* some c code */
return integer_value_2;
}
Corresponding to regular expression function will be called if input string matches regular expression. All functions will be written in .c file in following format:
static int func(lexeme_t *lex) {
// code written in .trans file
}
if returned value is lower than zero, it will be considered as error. If it equals to zero, then corresponding to lexeme string will considered as delimeter and parsing will continue. If returned value is greater than zero, it will be consider as lexeme class.
Supported special characters:
- * — zero or more.
- | — or.
- () — brackets.
- . — any character.
- \w — ascii letter. [A-Za-z].
- \W — non-letter character.
- \d — digit. [0-9].
- \D — non-digit character.
- \s — space, newline or tab character.
- \S — non-space character.
- \" — " character.
- \* — * character.
- \| — | character.
- \( — ( character.
- \) — ) character.
- \. — . character.
All other character are considered as usual.