Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion About Separation Between Tokens and Parsers #36

Open
BenjaminHolland opened this issue Feb 8, 2021 · 2 comments
Open

Discussion About Separation Between Tokens and Parsers #36

BenjaminHolland opened this issue Feb 8, 2021 · 2 comments

Comments

@BenjaminHolland
Copy link

I've tried to use parser combinator libraries across multiple languages, and I've never seen the kind of hard distinction between tokens and parsers this library has. Perhaps I wasn't paying attention (This library is actually the one I've been the least frustrated with), but it's interesting and comes with a set of advantages and disadvantages. I'm interested in why this decision was made. I'd also like to get feedback on my own understanding of the concepts. This might also help you write good docs, or Id be willing to write them and PR if my understanding is good enough. feel free to close and ignore if neither of these discussions interest you.

The advantage is that it's very, very clear (after a bit of conceptual learning) what each piece of a grammar is for. Tokens are specifically about character sequence recognition, while parsers are about token sequence recognition and mapping. Once you get the distinction, it's easy to write grammars.

I see two main disadvantages.

  1. Being forced to declare tokens separately from parsers feels redundant. Consider val id by regexToken(\\w+) use { text }. This creates a token and a parser, but only registers the parser. The solution of val idToken by regexToken... val idParser by idToken use { text } is fine, but feels very clunky.

  2. It's not very easy to combine grammars, or reuse grammars as parsers in a parent grammar, specifically because tokens are separate entities. Consider two grammars A and B. If I want a third grammar C that expresses A or B, simply doing the obvious thing of setting the rootParser to this expression is insufficient, because C doesn't have the tokens defined in A and B, and in fact has no tokens at all. this problem gets worse with more grammars and deeper nesting. It's also not clear from your docs how such a merge operation should function.

Are these assessments fair? Am I missing something?
Thanks.

@pragmaticpandy
Copy link

Thanks for bringing this up—was stuck on this for a good half hour because my token declarations were anonymous; this issue revealed my problem. So, while I agree with Benjamin's points, I'd also like to bring up the related issue that it wasn't clear to me in the documentation that this is required.

@kevinb9n
Copy link

I've found this frequently vexing in my attempts to structure my code better.

At present I find myself resorting to this kind of thing:

    fun cacheLiteral(text: String, name: String) =
        map.computeIfAbsent(name to false) { literalToken(name, text) }
    fun cacheRegex(regex: Regex, name: String) =
        map.computeIfAbsent(name to true) { regexToken(name, regex) }

This feels weird, but... well, I had to store the tokens in something to pass them to DefaultTokenizer anyway, so....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants