-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: initial version of the textplan to binary plan parser #36
Conversation
EpsilonPrime
commented
Mar 17, 2023
- Includes ANTLR4 tool to compile the grammar.
- Computes anchor references for extension spaces and functions.
- Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters.
8f54173
to
482d6a9
Compare
Ok, all of the PRs have been rebased. This one took a while because I had to make sure the generated parser files were integrated into the new clang tidy checks. |
LGTM Seeing that you have a lot of commits related to format, I suggest that you can use the make format or make tidy-fix command to automatically repair it. Did you encounter any problems? |
I did try the repair commands but encountered some issues related to my setup. When I have time I will look into why it didn't work for me. |
e850730
to
e6db23f
Compare
|
||
std::any SubstraitPlanVisitor::visitOperation( | ||
SubstraitPlanParser::OperationContext* ctx) { | ||
// TODO -- Implement this in a second visitor as described below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section, like most of the file, is not yet implemented. This comment is merely a reminder about how I plan to solve the issue. Adding support for functions and relations is part of third PR relating to the parser technology which I'm starting on next week. Since github doesn't have issue dependencies could we just consider this one to be part of #10?
void readText(const char* filename) { | ||
auto stream = io::substrait::textplan::loadTextFile(filename); | ||
if (!stream.has_value()) { | ||
std::cerr << "An error occurred while reading: " << filename << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, should we introduce a macro or error logging mechanism? Definitely not for this PR, but for future work. Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser and converters have the capability to handle multiple errors long with their source locations. This particular file is a special case similar only to the converter's Tool.cpp. Eventually these two will be merged and there will be only one place handling errors like this (everything else is going to rely on the caller to deal with the list of error messages).
one question : why not consider to use sql as input to generate the substrait plan like substrait-java does with help of Apache Calcite, for example Duckdb has the ability to help us generate either logical plan or physical plan from a single sql statement? |
The purpose of the text plan format is to be a nearly identical form of the binary plan, just in a more readable form. While a SQL converter would be useful that form would not necessarily have a 1:1 conversion with the binary plan format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally managed to get a good look at this. I'll have to ponder about the grammar some more as we go. I have an idea about relations I will try and flesh out further soon.
@@ -0,0 +1,31 @@ | |||
# SPDX-License-Identifier: Apache-2.0 | |||
|
|||
add_test_case( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests appear to be ending up in <BUILD_DIR>/src/debug/xyz_test
instead of <BUILD_DIR>/debug/xyz_test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this test fails if I run it directly (as opposed to running it from ctest). Maybe that is ok? But it would be nice if it would work. It's easier to debug that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you've built the binary with CMake the test files should have been copied as a post build rule into the data directory. Since the test data is relative to the test binary merely running the test from the directory it resides in should work. The only way I can think of to get the test to run from any directory would be to give it a path to somewhere on the system to get its data which we could do by generating a C++ file with that path in it to refer to at runtime.
I put the CMake fixes into #49.
|
||
cmake_path(GET CMAKE_CURRENT_SOURCE_DIR PARENT_PATH TEXTPLAN_SOURCE_DIR) | ||
|
||
add_custom_command( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused (due to my own cmake ignorance). What is this doing? Are you adding a new custom command or are you extending the existing test with a post-build step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to modify the existing build process to copy the test data into the runtime output directory before testing happens. This is what makes the tests with external data work in ctest. It's also the same pattern we used for the converter (not that it's right, just letting you know about scope).
return cases; | ||
} | ||
|
||
TEST(TextPlanParser, LoadFromFile) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test that fails unless run from ctest. Probably because it can't always locate data/provided_sample1.splan
as that path depends on the CWD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being handled in CMake by copying the data files into a test-relative directory. Passing the test data directory as an argument could be an alternative but that would both require CMake wrangling and would not work if the test was not run with CMake.
|
||
class TextPlanParserTestFixture : public ::testing::TestWithParam<TestCase> {}; | ||
|
||
std::vector<TestCase> getTestCases() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but long term I think it would be good to start having some tests that go the full round trip:
text -> symbols -> proto -> symbols -> text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Text to proto tests are implemented two PRs from now. There are no full roundtrip tests yet but I was planning for them to be more like an integration test and would be run on full plans.
relation_detail | ||
: COMMON SEMICOLON # relationCommon | ||
| BASE_SCHEMA id SEMICOLON # relationUsesSchema | ||
| FILTER operation SEMICOLON # relationFilter | ||
| PROJECTION SEMICOLON # relationProjection | ||
| EXPRESSION operation SEMICOLON # relationExpression | ||
| ADVANCED_EXTENSION SEMICOLON # relationAdvancedExtension | ||
| source_reference SEMICOLON # relationSourceReference | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the most concerning part to me I think. Most parts of the spec are going to be fairly consistent / limited. However, there might be hundreds of relations once we start to really invest in adding physical relations. While I'm sure we can perhaps update this file as we go I worry about the mental overhead of knowing all these rules as authoring files. Is there any way to make this more consistent? I'll try and give a better example of what I'm thinking of later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a table of all of the kinds of operations that can be done in relations. The terms aren't consistently used but I've started merging the similar ones. For instance, in two PRs you'll see BESTEFFORT and POSTJOIN qualifiers for filters which consolidate the syntax for read and join relations. I've also decided to use FILTER instead of CONDITION so the filter clause works for filter relations as well. The only new clause required for the existing physical relations (hash and merge join) would be a way of specifying the left keys and the right keys. There are typically one or two new clauses required for each new relation. (The only scary relations are the currently inaccessible write relations which would require 12 new clauses but fortunately all three share the same new clauses.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about something like this (please forgive my horrible ANTLR abilities):
relation_key_value
: id COLON relation_value
relation_value
: COMMON # relationCommon (ideally this is part of relation_detail and not relation_obj but I can't ANTLR it out at the moment)
| SCHEMA id # Schema reference
| RELATION id # Input relation reference
| EXPRESSION operation # An expression
| FUNCTION id # A function
| LITERAL constant # A basic configurable property
| [ relation_value* ] # Er...joined with COMMA, however you say that
| { relation_key_value* }
;
relation_detail
: relation_key_value* # I lost the SEMICOLON somewhere but we can put it back if desired
In other words, we end up with something very similar to JSON. However, instead of our base types being string, number, boolean, null (all of which can be expressed with LITERAL
) our base types also include schema, input, expression, and function.
Examples:
project relation project1 {
input: relation scan
expressions: [
expression o_orderkey,
expression o_custkey
]
}
join relation join1 {
left: relation project1
right: relation project2
expression: expression equal(o_customerid, c_id)
post_join_filter: expression equal(c_id, 7)
type: literal "JOIN_TYPE_INNER" # Well, it's an enum, but using string for now, we can figure out enums
}
custom_extension_relation relation cer1 {
left: relation project1
middle: relation project2
right: relation project3
fizziness: literal 73.4
approach: expression gt(inp1, 12)
}
(feel free to add back in semicolons or get rid of the colons or tweak however. I'm just trying to communicate the broader idea)
Pros
- Can represent any relation without advanced knowledge (this is the primary motivation)
- Naturally provides for advanced_extension, etc.
Cons
- Symbol table is more "abstract" (e.g. there won't be a "project" object in the symbol table anywhere. However, I think something like this could still exist for the common logical types, it would just be a separate representation)
- Serializing to protobuf will be tricky (will require reflection and use of descriptors. I'm only "mostly sure" this is possible)
- Slightly more verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For closure, @EpsilonPrime and I spoke about this offline. It's an interesting idea and may be something we need for extensions. However, sticking with the current approach for the logical / popular relations makes a lot of sense.
32f7ae3
to
1fbbfef
Compare
* Includes ANTLR4 tool to compile the grammar. * Computes anchor references for extension spaces and functions. * Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters. * Standardizes the format of the internal representation of pipelines between the converter and the parster.
6415be1
to
9a387f3
Compare
With the latest converter update there was some shared code that needed updating (since this PR also was modifying the in-memory representation). That update has happened and the PR has been rebased to main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one additional comment (finally managed to express the idea that had been percolating) but it is a big one 😆
relation_detail | ||
: COMMON SEMICOLON # relationCommon | ||
| BASE_SCHEMA id SEMICOLON # relationUsesSchema | ||
| FILTER operation SEMICOLON # relationFilter | ||
| PROJECTION SEMICOLON # relationProjection | ||
| EXPRESSION operation SEMICOLON # relationExpression | ||
| ADVANCED_EXTENSION SEMICOLON # relationAdvancedExtension | ||
| source_reference SEMICOLON # relationSourceReference | ||
; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about something like this (please forgive my horrible ANTLR abilities):
relation_key_value
: id COLON relation_value
relation_value
: COMMON # relationCommon (ideally this is part of relation_detail and not relation_obj but I can't ANTLR it out at the moment)
| SCHEMA id # Schema reference
| RELATION id # Input relation reference
| EXPRESSION operation # An expression
| FUNCTION id # A function
| LITERAL constant # A basic configurable property
| [ relation_value* ] # Er...joined with COMMA, however you say that
| { relation_key_value* }
;
relation_detail
: relation_key_value* # I lost the SEMICOLON somewhere but we can put it back if desired
In other words, we end up with something very similar to JSON. However, instead of our base types being string, number, boolean, null (all of which can be expressed with LITERAL
) our base types also include schema, input, expression, and function.
Examples:
project relation project1 {
input: relation scan
expressions: [
expression o_orderkey,
expression o_custkey
]
}
join relation join1 {
left: relation project1
right: relation project2
expression: expression equal(o_customerid, c_id)
post_join_filter: expression equal(c_id, 7)
type: literal "JOIN_TYPE_INNER" # Well, it's an enum, but using string for now, we can figure out enums
}
custom_extension_relation relation cer1 {
left: relation project1
middle: relation project2
right: relation project3
fizziness: literal 73.4
approach: expression gt(inp1, 12)
}
(feel free to add back in semicolons or get rid of the colons or tweak however. I'm just trying to communicate the broader idea)
Pros
- Can represent any relation without advanced knowledge (this is the primary motivation)
- Naturally provides for advanced_extension, etc.
Cons
- Symbol table is more "abstract" (e.g. there won't be a "project" object in the symbol table anywhere. However, I think something like this could still exist for the common logical types, it would just be a separate representation)
- Serializing to protobuf will be tricky (will require reflection and use of descriptors. I'm only "mostly sure" this is possible)
- Slightly more verbose
…e instead of at the end of the type.
It's been a while and I haven't been able to get to this so I apologize. I'm used to working on a project with many contributors and I think my practices need to be different as these projects get up to speed. I'm going to start focusing reviews more on making sure I understand the high level concept, the direction, the design, and the unit testing seems sufficient. I'm probably not going to dig too much into style (we have checks for that now) or subtle bugs (these will be detected with time and it's probably not the highest priority at the moment to find them). Let's move forward and merge this PR. I'd like to align on rules for |
…t-io#36) * Includes ANTLR4 tool to compile the grammar. * Computes anchor references for extension spaces and functions. * Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters.