Replies: 1 comment
-
It is a common question how to avoid clashes between token names and identifier names. However, a Lex, Flex and RE/flex lexer always picks the rule with the longest match when there are multiple choices of rules that match. Therefore, identifiers can contain keywords as part of their name without causing a problem. If more than one rule match the same input pattern of the same length, then a Lex, Flex and RE/flex lexer picks the first rule that matches. Therefore, make sure to add the lexer identifier rule after the keyword rules in the lexer specification. Do not use a lazy repeat If you use a Perl-regex-based lexer such as PCRE2 instead of the default RE/flex regex engine, then the above no longer holds true. A Perl-regex-based lexer always matches the first rule, not the longest, which may cause clashes between keywords and identifiers. For example, the identifier Another approach is to match identifier names as a regex pattern and then distinguish keywords later when looking them up in a table. This approach is used in the mini C compiler examples/minic.l that has a function |
Beta Was this translation helpful? Give feedback.
-
This question is a pretty newbie-ish question, but I'm not very experienced with generated lexers, so... :)
I have a token enum (defined in my lexer class for convenience) that has keyword tokens. Then in my rules I have:
The issue is that I'm facing an issue that I think is pretty common with regex systems like this: when I tokenize (say)
void main_loop()
, I get the kw_void token, then a couple identifier tokens and then akw_in
token. Which obviously is wrong. (I don't show all the keywords here but they're all defined similarly). I then have my identifier rule defined (after) the keywords, to try to match keywords before identifiers without everything being an identifier:I don't know if this rule is right or if I should use greedy matching here, so I'm curious what the correct way of doing this is, to match keywords on their own but identifiers otherwise? The language is indent-unaware, so I've used option noindent. (I know one way is to do the matching in code and not in the generator part, but I'm wondering if I can do it in the generated lexer or if it would be more efficient to just write a LUT or something.)
Beta Was this translation helpful? Give feedback.
All reactions