-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE]: IR to Listen and Generate Code Comments #869
Comments
Alternatives for the representation:
|
python corner cases:
and in example IR optimizer rules for tranforming override def apply(expr: ir.Expression): ir.Expression = expr transformUp {
// option A:
case ir.And(left, right, comment = Some(text)) => py.Comment(methodOf(left, "and", Seq(right)), text)
// and we'll probably have to change the invariants for ~200 nodes
// option B:
case ir.And(Comment(left, text), right) => py.Comment(methodOf(left, "and", Seq(right)), text)
case ir.And(left, Comment(right, text)) => py.Comment(methodOf(left, "and", Seq(right)), text)
// option C:
case x @ ir.And(left, right) if x.hasComment => py.Comment(methodOf(left, "and", Seq(right)), text)
// and we'll probably have to change the invariants for ~200 nodes
|
flowchart TD
antlr --> lib
lib --> commonlex.g4
lib --> grammars
grammars --> snowflake
snowflake --> SnowflakeLexer.g4
snowflake --> SnowflakeParser.g4
snowflake --> basesnowflake.g4
SnowflakeLexer.g4 --> commonlex.g4
SnowflakeLexer.g4 --> basesnowflake.g4
grammars --> tsql
tsql --> TSqlLexer.g4
tsql --> TSqlParser.g4
TSqlLexer.g4 --> commonlex.g4
TSqlLexer.g4 --> basetsql.g4
tsql --> basetsql.g4
|
Following a number of tentative PRs and discussions, I am proposing to discuss a lightweight spec here (as opposed to a full solution architecture spec which is overkilling for the discussion we need).
Per my various experiments (that work) the following topics need to be discussed. I'll get into details in separate comments. 1 - Origin is not fit for purpose |
1 - Origin is not fit for purpose The definition of Origin is as follows:
This structure is not accurate enough:
Moreover, the origin field in Due to backwards compatibility concerns, we can't refactor Instead, I propose to introduce the following:
and a |
StartLine is the human interpretation of how many \n there are I think Origin is fit for purpose information wise for resolving which IR node is deemed closest to a collection of comments. Though we can extract comments from the token stream, I think we could just build a collection of them directly in the common lexer. We can talk about it next week. |
2 - A parsing location needs to be attached during building of the IR nodes The question arises as to how we attach such location to the node. There are various options:
I prefer 1 because it augments catalyst with a crucial capability. The lack of guarantees can be mitigated by a simple test that travers the tree and checks that each node has a non-default |
Maybe something like this is more practical to implement: Gather a collection of all the comments directly within the lexcommon.g4, then they can be skipped as tokens To process comments:
Of course, this not very Catalyst like. But
We can adorn the Ir with Origin as per Valentin's idea as well, though I think as at some point we need to resolve which comment is for which node anyway. I think the above is very simple at least. |
3 - Our simple parsing pipeline does not allow a second pass Context: grammars focus on meaningful content. Whitespace and comments are parsed but sent to hidden channels, thus skipped during matching of grammar rules. Without this, since comments could be literally anywhere in the source code, a rule like the following: Using Our current parsing pipeline (in
Beyond the fact that
PRs #1200 and #1201 provide a simpler yet more flexible approach: a concrete We need to opine on this. |
But I am saying we pick them up in the common lexer then skip them. They won't even be in the token stream.
But now we do not need two passes. But even if we did, we can modify the generic plan parser to take care of that in a simliar way that generate does.
// Here we create the comment map created by the lexer and put it somewhere
commentMap.load(lexer.comments)
We don't/should not need the token stream at that point. Ir creation should not rely on the token stream, but use the tokens it needs in the ParserContext. However, the Result can carry information from previous phases should we ned to do that and we can find the token stream should we need it.
No - that is not where the replacement converter will go, that is far too low level. There will be some sort of program level interface defined (say yml that says 'her is the input location' 'here is the output location') and a different converter would probably be an entirely different process/engine.
We can insert more phases in PlanParser./
I do not agree with this. We are not solving a problem that we have right now and at the moment I cannot see why we would need something like a token stream rewriter.
|
We do have a problem to solve. The fact that the provided solution also solves a future problem is a side-effect but things can't stay as is. |
4 - Attaching comments to nodes can be done in various ways Once comments are discovered, and the node to which they should be attached is located, we need to technically attach them to that node. They need to be attached before (for example line comments that appear before a statement) or after (for example line comments that appear after some code on the same line) There are at least 3 ways comments could be attached to nodes: A - using B - adding comments as C - wrapping nodes in
Using Whatever we decide, nodes are supposedly immutable, so this is not a trivial task. We can either accept that it's ok to have mutable nodes during parsing, or accept that we need to run a full transformation on the entire tree, which affects performance |
Overall we are extremely constrained by the software architecture decision to use Catalyst for our IR, and the informal goal of contributing back to it. We could at minimal consider alternative options:
|
Is there an existing issue for this?
Category of feature request
Transpile
Problem statement
Currently the ANTLR Parser moves all comments into Hidden Channel. Need to map them to concrete IR and introduce them at appropriate place during code generation.
Proposed Solution
Resolving comments back to output in a sensible manner is not a trivial task and first we must establish some principles in the source language and target language.
Goal - Generate comments in the translated code and preserve their context wherever possible.
In our case the source and target languages can be considered equivalent as the comments have common syntax, so there are no complications caused by the need to resolve comment structure; for instance if the source allows multi-line in-syntax comments (
/* xxx */
), but the target only allows line comments (-- xxx
).Additionally, as we will pretty print the resulting output, we can ignore trying to preserve the number of spaces and so on that precede or follow a particular token. So long as the correct comment style is placed in the correct place, then the formatter will make it all look good.
Comment style
Using the Databricks documentation for some examples, and adding our own, we can create a sample input that challenges our system such that if the test input is translated correctly, then all input should be translated correctly. The starting point for such a test is in the Additional context section. Note that this is a starting point and has not been thought through exhaustively.
If we study the test we can establish some principles around comment association, knowing that as we are not actually reading the comments, we make guess the associativity incorrectly, but by using these principles, we should get the intended output anyway.
The principles are therefore:
These principles will need to be tested in practice - these kinds of things always have strange edge-cases, where the source author thought they were being mighty clever.
Procedure
We have talked about tokens, but in fact our IR does not preserve tokens, so we need to somehow match our Ir nodes to comments. We could, when building the Ir, map each node we produce with a lookup table. However, as all nodes derive from TreeNode (or should), we can in fact adorn the nodes with the Origin information. Although, the Origin may need to be extended to include stop line and offset - at first glance it may or may not already have enough information to create a span. However, this means that any Ir node we produce must be adorned with the origin information for a token or start stop token, which is slog work but easily achieved. (NB: Don't couple the TreeNodes with ANTLR).
Comments are stored out-of-band by the lexers, on the HIDDEN channel, which is channel 1. Hence they are not consumed by the parser, but are easily scanned by anything else with a reference to the token stream by resetting it after parsing, then asking for all the tokens on the HIDDEN channel. While traversing this HIDDEN channel (or maybe a custom one for comments), we can build a comment map for each input line (note there may be more than one comment on a line.
The approaches here are then to either walk the tree with the comment map and assign the comments as preceding comments to the nearest node. The codegen then just produces any comments before it produces it's own text. We would have:
If we use the map and not adorn the nodes then at code generation, every production needs something like:
Assuming we used the generator context to store the comment map. Though of course the generate method itself could also do this, rather then each producer.
We should be careful to use any comment only once. So in the above example, maybe leading comments are left to the first element of batch. The pattern should fall out from there. Again, note that this test is about the approach and my not cover all cases.
Adorning the TreeNodes seems the better approach.
Additional Context
Starting point for comment preservation test
Note that most dialects allow nested comments, our lexers may need to be adjusted.
Non-functional requirements:
com.databricks.labs.remorph.intermediate
package should be avoided if possible and iff we need to make a change to any structures in this package, we have to provide different alternatives.The text was updated successfully, but these errors were encountered: