-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TypeScript] Add methods for BufferedTokenStream #4569
base: dev
Are you sure you want to change the base?
Conversation
The typescript class was almost completely empty, so it was not possible to directly use the token stream in a type-safe way. Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>
nextTokenOnChannel(i: number, channel: number): number; | ||
previousTokenOnChannel(i: number, channel: number): number; | ||
|
||
protected lazyInit(): void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need these protected members ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjustSeekIndex
is required for subclassing BufferedTokenStream
, and it only exists for subclasses to use. See CommonTokenStream::adjustSeekIndex
for an example.
lazyInit
is needed if the subclass tries to access any of the fields/properties safely, possibly before one of the BufferedTokenStream methods is called. Technically more protected
functions are available, like setup
, but these two were the minimum required to correctly subclass the stream. Technically all the fields should be marked protected
but that felt too restrictive, because the missing underscore convention (_index
, _tokens
) indicates the fields are public in JS.
I've actually made a mistake, there should be an additional index: number
line. I'll push that to the PR branch tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why do you need to subclass BufferedTokenStream
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I needed a token stream that did extra processing/rewriting, between the lexer and parser steps. It's a normal/supported thing to do, the Parser accepts any TokenStream as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that normally done by a TokenStreamRewriter, designed just for that purpose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TokenStreamRewriter is made with a C-style preprocessor in mind. Its output is text after rewrites have happened. It's not able to e.g. transform tokens, merge them together or split them apart, add context to tokens (technically subclass, but JS makes that very blurry), and generally fails to act as a stream transformer between the lexer and parser. You would need to run a file through a lexer, then the token stream rewriter, then through a different lexer, then through a parser. I'm not dealing with a preprocessor, I need a token stream transformer that outputs a stream of tokens.
I think this is a supported use case; otherwise there would be no need for a distinction between CommonTokenStream and BufferedTokenStream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ground this discussion, we're discussing two protected
annotations in a typescript definition file as a companion to the already-implemented, already-publicly-accessible javascript functions. The ship has already sailed about whether these functions were a "good" interface or if the buffer is a "good" extension point. These functions are already part of the public API on every other language runtime except typescript.
Also, it should be noted that other runtimes put lazyInit
as protected
, never private
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know antlr was getting another rewrite. I'll detail my use case, perhaps it will help. However, yesterday I rewrote the whole parser & lexer from scratch and achieved much better results than antlr was able to provide. This is for a personal project. Read only if you're interested, nothing here directly bears on the current protected member discussion.
The DSL I made has identifiers / names of the form foo.bar.baz
, dot-separated paths. It also allows a shorthand starting with a dot, .bar.baz
, that meant the identifier was relative to some other identifier. There are block statements like with foo { ... .bar.baz ... }
that should result in foo.bar.baz
at the end of parsing.
OK, all fairly normal so far. The issue is that with
blocks could be anywhere in the file. Inside any block or at the top level or nested within other with
statements. I tried to spec this out in the parser's .g4 but my simple ~60 line parser ballooned into a few hundred lines, to account for all the possible places where with
could happen; and the parser-listener became similarly complex. And that was still with a subset of the intended DSL.
What I really needed was something like with_generic<inner_parser>: 'with' (IDENT | REL_IDENT) '{' inner_parser* '}';
but that didn't seem possible. So I tried to move the handling of with
into its own parser / transformer. The rule is extremely simple if stated as its own grammar; it's nearly find-and-replace. I tried to use the provided stream-rewriter, but the translation back to text destroyed metadata that I was tracking in the tokens, particularly the IDENT / REL_IDENT tokens, about the path they represent. I can see the utility of that rewriter for large complex transformations, but this task is extremely simple.
So I wrote a class that took in a token stream, located with
blocks, and emitted corrected IDENT
tokens. No new tokens were added to the stream, so the line, column, and character indexes of each token were still valid. The text was also still valid; the relative-to information was stored as a separate field. But then I ran into the above issue, where the Parser would not accept just any input stream, only a buffered input stream; and the buffered input stream wouldn't accept an arbitrary token stream, just a Lexer. So I made adjustments until it worked.
The DSL is something I made and could adjust, but it's being used as a game/puzzle element directly, so I couldn't just remove or dramatically change how the with
blocks worked. What eventually drove me to rewrite the lexer & parser was frustration over the lexer not tokenizing some IDENT tokens (and not understanding how to properly handle error messages). Turns out, the grammar was actually ambiguous to lex, but I only understood that after rewriting. The lexer and parser ended up being less code, though; the full power and flexibility of antlr wasn't needed. The DSL is LL(1) with finite depth, so the state machine is extremely simple. And with access to generic functions, the with
situation could be handled simply in the parser with a single rule-function (parseWith<T, U>(innerParser: (data: U) => T, data: U): void
) invoked within the generic block rule-function (parseBlock<T, U>(statementParser: (data: U) => T, data: U): void
which corresponds to antlr pseudocode parse_block<inner>: '{' (with_block<inner> | inner)* '}';
). The most complex thing I had to do was implementing a prefix tree for lexing keywords. I probably could have just done a hash map with string keys but it was a fun challenge.
I took a look at the antlr5 repo. I like the concept. Though webassembly seems to be splitting in two with the WASI stuff, even two very similar runtimes is far better than 10 unrelated ones. I'll poke around a bit, maybe try to get involved.
Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>
BufferedTokenStream's input was changed from Lexer to TokenSource, but TokenSource was empty. BufferedTokenStream expects two functions to exist on any TokenSource, getSourceName and nextToken. Parser also expects a _factory field, though the TokenFactory type doesn't exist yet in typescript. It also isn't used for anything within antlr4's runtime, though it may be used by generated files? These three entries are a subset of the Java runtime's TokenSource interface. Signed-off-by: Phlosioneer <mattmdrr2@gmail.com>
I've added the |
@@ -1,3 +1,7 @@ | |||
export declare class TokenSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm... not sure how this can work... the JS TokenSource does not implement these fields and methods.
I'd suggest making it a TS interface implemented by Lexer.
please bring this to a dedicated PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do. For this PR I'll put BufferedTokenStream back to "tokenSource: Lexer", that was type safe though limiting.
nextTokenOnChannel(i: number, channel: number): number; | ||
previousTokenOnChannel(i: number, channel: number): number; | ||
|
||
protected lazyInit(): void; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't take me wrong, I'm just trying to understand your use case, notably in the context of the antlr5 rewrite.
The typescript class was almost completely empty, so it was not possible to directly use the token stream in a type-safe way.