Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TypeScript] Add methods for BufferedTokenStream #4569

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

Phlosioneer
Copy link
Contributor

The typescript class was almost completely empty, so it was not possible to directly use the token stream in a type-safe way.

The typescript class was almost completely empty, so it was not possible
to directly use the token stream in a type-safe way.

Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>
nextTokenOnChannel(i: number, channel: number): number;
previousTokenOnChannel(i: number, channel: number): number;

protected lazyInit(): void;
Copy link
Contributor

@ericvergnaud ericvergnaud Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need these protected members ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjustSeekIndex is required for subclassing BufferedTokenStream, and it only exists for subclasses to use. See CommonTokenStream::adjustSeekIndex for an example.

lazyInit is needed if the subclass tries to access any of the fields/properties safely, possibly before one of the BufferedTokenStream methods is called. Technically more protected functions are available, like setup, but these two were the minimum required to correctly subclass the stream. Technically all the fields should be marked protected but that felt too restrictive, because the missing underscore convention (_index, _tokens) indicates the fields are public in JS.

I've actually made a mistake, there should be an additional index: number line. I'll push that to the PR branch tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do you need to subclass BufferedTokenStream ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I needed a token stream that did extra processing/rewriting, between the lexer and parser steps. It's a normal/supported thing to do, the Parser accepts any TokenStream as input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that normally done by a TokenStreamRewriter, designed just for that purpose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TokenStreamRewriter is made with a C-style preprocessor in mind. Its output is text after rewrites have happened. It's not able to e.g. transform tokens, merge them together or split them apart, add context to tokens (technically subclass, but JS makes that very blurry), and generally fails to act as a stream transformer between the lexer and parser. You would need to run a file through a lexer, then the token stream rewriter, then through a different lexer, then through a parser. I'm not dealing with a preprocessor, I need a token stream transformer that outputs a stream of tokens.

I think this is a supported use case; otherwise there would be no need for a distinction between CommonTokenStream and BufferedTokenStream.

Copy link
Contributor Author

@Phlosioneer Phlosioneer Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To ground this discussion, we're discussing two protected annotations in a typescript definition file as a companion to the already-implemented, already-publicly-accessible javascript functions. The ship has already sailed about whether these functions were a "good" interface or if the buffer is a "good" extension point. These functions are already part of the public API on every other language runtime except typescript.

Also, it should be noted that other runtimes put lazyInit as protected, never private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.

Copy link
Contributor Author

@Phlosioneer Phlosioneer Apr 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know antlr was getting another rewrite. I'll detail my use case, perhaps it will help. However, yesterday I rewrote the whole parser & lexer from scratch and achieved much better results than antlr was able to provide. This is for a personal project. Read only if you're interested, nothing here directly bears on the current protected member discussion.

The DSL I made has identifiers / names of the form foo.bar.baz, dot-separated paths. It also allows a shorthand starting with a dot, .bar.baz, that meant the identifier was relative to some other identifier. There are block statements like with foo { ... .bar.baz ... } that should result in foo.bar.baz at the end of parsing.

OK, all fairly normal so far. The issue is that with blocks could be anywhere in the file. Inside any block or at the top level or nested within other with statements. I tried to spec this out in the parser's .g4 but my simple ~60 line parser ballooned into a few hundred lines, to account for all the possible places where with could happen; and the parser-listener became similarly complex. And that was still with a subset of the intended DSL.

What I really needed was something like with_generic<inner_parser>: 'with' (IDENT | REL_IDENT) '{' inner_parser* '}'; but that didn't seem possible. So I tried to move the handling of with into its own parser / transformer. The rule is extremely simple if stated as its own grammar; it's nearly find-and-replace. I tried to use the provided stream-rewriter, but the translation back to text destroyed metadata that I was tracking in the tokens, particularly the IDENT / REL_IDENT tokens, about the path they represent. I can see the utility of that rewriter for large complex transformations, but this task is extremely simple.

So I wrote a class that took in a token stream, located with blocks, and emitted corrected IDENT tokens. No new tokens were added to the stream, so the line, column, and character indexes of each token were still valid. The text was also still valid; the relative-to information was stored as a separate field. But then I ran into the above issue, where the Parser would not accept just any input stream, only a buffered input stream; and the buffered input stream wouldn't accept an arbitrary token stream, just a Lexer. So I made adjustments until it worked.

The DSL is something I made and could adjust, but it's being used as a game/puzzle element directly, so I couldn't just remove or dramatically change how the with blocks worked. What eventually drove me to rewrite the lexer & parser was frustration over the lexer not tokenizing some IDENT tokens (and not understanding how to properly handle error messages). Turns out, the grammar was actually ambiguous to lex, but I only understood that after rewriting. The lexer and parser ended up being less code, though; the full power and flexibility of antlr wasn't needed. The DSL is LL(1) with finite depth, so the state machine is extremely simple. And with access to generic functions, the with situation could be handled simply in the parser with a single rule-function (parseWith<T, U>(innerParser: (data: U) => T, data: U): void) invoked within the generic block rule-function (parseBlock<T, U>(statementParser: (data: U) => T, data: U): void which corresponds to antlr pseudocode parse_block<inner>: '{' (with_block<inner> | inner)* '}';). The most complex thing I had to do was implementing a prefix tree for lexing keywords. I probably could have just done a hash map with string keys but it was a fun challenge.


I took a look at the antlr5 repo. I like the concept. Though webassembly seems to be splitting in two with the WASI stuff, even two very similar runtimes is far better than 10 unrelated ones. I'll poke around a bit, maybe try to get involved.

Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>
BufferedTokenStream's input was changed from Lexer to TokenSource, but
TokenSource was empty. BufferedTokenStream expects two functions to exist
on any TokenSource, getSourceName and nextToken.

Parser also expects a _factory field, though the TokenFactory type doesn't
exist yet in typescript. It also isn't used for anything within antlr4's
runtime, though it may be used by generated files?

These three entries are a subset of the Java runtime's TokenSource interface.

Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>
@Phlosioneer
Copy link
Contributor Author

I've added the index field, and also added a minimal interface for TokenSource. This is needed because I switched the input field's type from Lexer to TokenSource without checking whether TokenSource had the needed methods.

@@ -1,3 +1,7 @@
export declare class TokenSource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm... not sure how this can work... the JS TokenSource does not implement these fields and methods.
I'd suggest making it a TS interface implemented by Lexer.
please bring this to a dedicated PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do. For this PR I'll put BufferedTokenStream back to "tokenSource: Lexer", that was type safe though limiting.

nextTokenOnChannel(i: number, channel: number): number;
previousTokenOnChannel(i: number, channel: number): number;

protected lazyInit(): void;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants