A question about matching keywords #210

ethindp · 2024-06-27T13:19:32Z

ethindp
Jun 27, 2024

This question is a pretty newbie-ish question, but I'm not very experienced with generated lexers, so... :)

I have a token enum (defined in my lexer class for convenience) that has keyword tokens. Then in my rules I have:

\s+ /* Do nothing, whitespace */
"//".*                          // ignore inline comment
"/*"(.|\n)*?"*/"  /* no action: ignore multiline comments */
"and" return Token::kw_and;
"abstract" return Token::kw_abstract;
"auto" return Token::kw_auto;
"bool" return Token::kw_bool;
"break" return Token::kw_break;
"case" return Token::kw_case;
...

The issue is that I'm facing an issue that I think is pretty common with regex systems like this: when I tokenize (say) void main_loop(), I get the kw_void token, then a couple identifier tokens and then a kw_in token. Which obviously is wrong. (I don't show all the keywords here but they're all defined similarly). I then have my identifier rule defined (after) the keywords, to try to match keywords before identifiers without everything being an identifier:

[A-Za-z_\p{UnicodeIdentifierStart}][0-9A-Za-z_\p{UnicodeIdentifierPart}]*? return Token::identifier;

I don't know if this rule is right or if I should use greedy matching here, so I'm curious what the correct way of doing this is, to match keywords on their own but identifiers otherwise? The language is indent-unaware, so I've used option noindent. (I know one way is to do the matching in code and not in the generator part, but I'm wondering if I can do it in the generated lexer or if it would be more efficient to just write a LUT or something.)

genivia-inc · 2024-07-01T18:50:27Z

genivia-inc
Jul 1, 2024
Maintainer

It is a common question how to avoid clashes between token names and identifier names. However, a Lex, Flex and RE/flex lexer always picks the rule with the longest match when there are multiple choices of rules that match. Therefore, identifiers can contain keywords as part of their name without causing a problem. If more than one rule match the same input pattern of the same length, then a Lex, Flex and RE/flex lexer picks the first rule that matches. Therefore, make sure to add the lexer identifier rule after the keyword rules in the lexer specification. Do not use a lazy repeat *? in the regex pattern to match identifiers. It should be a normal greedy + repeat.

If you use a Perl-regex-based lexer such as PCRE2 instead of the default RE/flex regex engine, then the above no longer holds true. A Perl-regex-based lexer always matches the first rule, not the longest, which may cause clashes between keywords and identifiers. For example, the identifier autoloop will match the keyword auto then identifier loop, which is not what we want. This can be resolved by defining keywords with word delimiters \< and \>, like in the pattern \<auto\> so that auto cannot be part of a name.

Another approach is to match identifier names as a regex pattern and then distinguish keywords later when looking them up in a table. This approach is used in the mini C compiler examples/minic.l that has a function ID() that calls keyword_token() to check if an identifier is a keyword to return a keyword token instead.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A question about matching keywords #210

{{title}}

Replies: 1 comment

{{title}}

Select a reply

A question about matching keywords #210

ethindp Jun 27, 2024

Replies: 1 comment

genivia-inc Jul 1, 2024 Maintainer

ethindp
Jun 27, 2024

genivia-inc
Jul 1, 2024
Maintainer