Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: initial version of the textplan to binary plan parser #36

Merged
merged 5 commits into from
Apr 27, 2023

Conversation

EpsilonPrime
Copy link
Member

  • Includes ANTLR4 tool to compile the grammar.
  • Computes anchor references for extension spaces and functions.
  • Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters.

@EpsilonPrime
Copy link
Member Author

Ok, all of the PRs have been rebased. This one took a while because I had to make sure the generated parser files were integrated into the new clang tidy checks.

@chaojun-zhang
Copy link
Contributor

chaojun-zhang commented Mar 31, 2023

LGTM

Seeing that you have a lot of commits related to format, I suggest that you can use the make format or make tidy-fix command to automatically repair it. Did you encounter any problems?

@EpsilonPrime
Copy link
Member Author

LGTM

Seeing that you have a lot of commits related to format, I suggest that you can use the make format or make tidy-fix command to automatically repair it. Did you encounter any problems?

I did try the repair commands but encountered some issues related to my setup. When I have time I will look into why it didn't work for me.


std::any SubstraitPlanVisitor::visitOperation(
SubstraitPlanParser::OperationContext* ctx) {
// TODO -- Implement this in a second visitor as described below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section, like most of the file, is not yet implemented. This comment is merely a reminder about how I plan to solve the issue. Adding support for functions and relations is part of third PR relating to the parser technology which I'm starting on next week. Since github doesn't have issue dependencies could we just consider this one to be part of #10?

void readText(const char* filename) {
auto stream = io::substrait::textplan::loadTextFile(filename);
if (!stream.has_value()) {
std::cerr << "An error occurred while reading: " << filename << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, should we introduce a macro or error logging mechanism? Definitely not for this PR, but for future work. Any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser and converters have the capability to handle multiple errors long with their source locations. This particular file is a special case similar only to the converter's Tool.cpp. Eventually these two will be merged and there will be only one place handling errors like this (everything else is going to rely on the caller to deal with the list of error messages).

@chaojun-zhang
Copy link
Contributor

chaojun-zhang commented Apr 4, 2023

one question : why not consider to use sql as input to generate the substrait plan like substrait-java does with help of Apache Calcite, for example Duckdb has the ability to help us generate either logical plan or physical plan from a single sql statement?

@EpsilonPrime
Copy link
Member Author

one question : why not consider to use sql as input to generate the substrait plan like substrait-java does with help of Apache Calcite, for example Duckdb has the ability to help us generate either logical plan or physical plan from a single sql statement?

The purpose of the text plan format is to be a nearly identical form of the binary plan, just in a more readable form. While a SQL converter would be useful that form would not necessarily have a 1:1 conversion with the binary plan format.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally managed to get a good look at this. I'll have to ponder about the grammar some more as we go. I have an idea about relations I will try and flesh out further soon.

scripts/run-clang-tidy.sh Outdated Show resolved Hide resolved
@@ -0,0 +1,31 @@
# SPDX-License-Identifier: Apache-2.0

add_test_case(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests appear to be ending up in <BUILD_DIR>/src/debug/xyz_test instead of <BUILD_DIR>/debug/xyz_test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this test fails if I run it directly (as opposed to running it from ctest). Maybe that is ok? But it would be nice if it would work. It's easier to debug that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've built the binary with CMake the test files should have been copied as a post build rule into the data directory. Since the test data is relative to the test binary merely running the test from the directory it resides in should work. The only way I can think of to get the test to run from any directory would be to give it a path to somewhere on the system to get its data which we could do by generating a C++ file with that path in it to refer to at runtime.

I put the CMake fixes into #49.


cmake_path(GET CMAKE_CURRENT_SOURCE_DIR PARENT_PATH TEXTPLAN_SOURCE_DIR)

add_custom_command(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused (due to my own cmake ignorance). What is this doing? Are you adding a new custom command or are you extending the existing test with a post-build step?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to modify the existing build process to copy the test data into the runtime output directory before testing happens. This is what makes the tests with external data work in ctest. It's also the same pattern we used for the converter (not that it's right, just letting you know about scope).

return cases;
}

TEST(TextPlanParser, LoadFromFile) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test that fails unless run from ctest. Probably because it can't always locate data/provided_sample1.splan as that path depends on the CWD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being handled in CMake by copying the data files into a test-relative directory. Passing the test data directory as an argument could be an alternative but that would both require CMake wrangling and would not work if the test was not run with CMake.


class TextPlanParserTestFixture : public ::testing::TestWithParam<TestCase> {};

std::vector<TestCase> getTestCases() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but long term I think it would be good to start having some tests that go the full round trip:

text -> symbols -> proto -> symbols -> text

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text to proto tests are implemented two PRs from now. There are no full roundtrip tests yet but I was planning for them to be more like an integration test and would be run on full plans.

Comment on lines 53 to 61
relation_detail
: COMMON SEMICOLON # relationCommon
| BASE_SCHEMA id SEMICOLON # relationUsesSchema
| FILTER operation SEMICOLON # relationFilter
| PROJECTION SEMICOLON # relationProjection
| EXPRESSION operation SEMICOLON # relationExpression
| ADVANCED_EXTENSION SEMICOLON # relationAdvancedExtension
| source_reference SEMICOLON # relationSourceReference
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most concerning part to me I think. Most parts of the spec are going to be fairly consistent / limited. However, there might be hundreds of relations once we start to really invest in adding physical relations. While I'm sure we can perhaps update this file as we go I worry about the mental overhead of knowing all these rules as authoring files. Is there any way to make this more consistent? I'll try and give a better example of what I'm thinking of later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a table of all of the kinds of operations that can be done in relations. The terms aren't consistently used but I've started merging the similar ones. For instance, in two PRs you'll see BESTEFFORT and POSTJOIN qualifiers for filters which consolidate the syntax for read and join relations. I've also decided to use FILTER instead of CONDITION so the filter clause works for filter relations as well. The only new clause required for the existing physical relations (hash and merge join) would be a way of specifying the left keys and the right keys. There are typically one or two new clauses required for each new relation. (The only scary relations are the currently inaccessible write relations which would require 12 new clauses but fortunately all three share the same new clauses.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about something like this (please forgive my horrible ANTLR abilities):

relation_key_value
   : id COLON relation_value

relation_value
   : COMMON # relationCommon (ideally this is part of relation_detail and not relation_obj but I can't ANTLR it out at the moment)
   | SCHEMA id # Schema reference
   | RELATION id # Input relation reference
   | EXPRESSION operation  # An expression
   | FUNCTION id # A function
   | LITERAL constant # A basic configurable property
   | [ relation_value* ] # Er...joined with COMMA, however you say that
   | { relation_key_value* }
   ;
relation_detail
   : relation_key_value* # I lost the SEMICOLON somewhere but we can put it back if desired

In other words, we end up with something very similar to JSON. However, instead of our base types being string, number, boolean, null (all of which can be expressed with LITERAL) our base types also include schema, input, expression, and function.

Examples:

project relation project1 {
  input: relation scan
  expressions: [
    expression o_orderkey,
    expression o_custkey
  ]
}
join relation join1 {
  left: relation project1
  right: relation project2
  expression: expression equal(o_customerid, c_id)
  post_join_filter: expression equal(c_id, 7)
  type: literal "JOIN_TYPE_INNER" # Well, it's an enum, but using string for now, we can figure out enums
}

custom_extension_relation relation cer1 {
  left: relation project1
  middle: relation project2
  right: relation project3
  fizziness: literal 73.4
  approach: expression gt(inp1, 12)
}

(feel free to add back in semicolons or get rid of the colons or tweak however. I'm just trying to communicate the broader idea)

Pros

  • Can represent any relation without advanced knowledge (this is the primary motivation)
  • Naturally provides for advanced_extension, etc.

Cons

  • Symbol table is more "abstract" (e.g. there won't be a "project" object in the symbol table anywhere. However, I think something like this could still exist for the common logical types, it would just be a separate representation)
  • Serializing to protobuf will be tricky (will require reflection and use of descriptors. I'm only "mostly sure" this is possible)
  • Slightly more verbose

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For closure, @EpsilonPrime and I spoke about this offline. It's an interesting idea and may be something we need for extensions. However, sticking with the current approach for the logical / popular relations makes a lot of sense.

src/substrait/textplan/parser/CMakeLists.txt Show resolved Hide resolved
src/substrait/textplan/parser/ParseText.cpp Outdated Show resolved Hide resolved
src/substrait/textplan/parser/ParseText.h Show resolved Hide resolved
* Includes ANTLR4 tool to compile the grammar.
* Computes anchor references for extension spaces and functions.
* Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters.
* Standardizes the format of the internal representation of pipelines between the converter and the parster.
@EpsilonPrime
Copy link
Member Author

With the latest converter update there was some shared code that needed updating (since this PR also was modifying the in-memory representation). That update has happened and the PR has been rebased to main.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one additional comment (finally managed to express the idea that had been percolating) but it is a big one 😆

Comment on lines 53 to 61
relation_detail
: COMMON SEMICOLON # relationCommon
| BASE_SCHEMA id SEMICOLON # relationUsesSchema
| FILTER operation SEMICOLON # relationFilter
| PROJECTION SEMICOLON # relationProjection
| EXPRESSION operation SEMICOLON # relationExpression
| ADVANCED_EXTENSION SEMICOLON # relationAdvancedExtension
| source_reference SEMICOLON # relationSourceReference
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about something like this (please forgive my horrible ANTLR abilities):

relation_key_value
   : id COLON relation_value

relation_value
   : COMMON # relationCommon (ideally this is part of relation_detail and not relation_obj but I can't ANTLR it out at the moment)
   | SCHEMA id # Schema reference
   | RELATION id # Input relation reference
   | EXPRESSION operation  # An expression
   | FUNCTION id # A function
   | LITERAL constant # A basic configurable property
   | [ relation_value* ] # Er...joined with COMMA, however you say that
   | { relation_key_value* }
   ;
relation_detail
   : relation_key_value* # I lost the SEMICOLON somewhere but we can put it back if desired

In other words, we end up with something very similar to JSON. However, instead of our base types being string, number, boolean, null (all of which can be expressed with LITERAL) our base types also include schema, input, expression, and function.

Examples:

project relation project1 {
  input: relation scan
  expressions: [
    expression o_orderkey,
    expression o_custkey
  ]
}
join relation join1 {
  left: relation project1
  right: relation project2
  expression: expression equal(o_customerid, c_id)
  post_join_filter: expression equal(c_id, 7)
  type: literal "JOIN_TYPE_INNER" # Well, it's an enum, but using string for now, we can figure out enums
}

custom_extension_relation relation cer1 {
  left: relation project1
  middle: relation project2
  right: relation project3
  fizziness: literal 73.4
  approach: expression gt(inp1, 12)
}

(feel free to add back in semicolons or get rid of the colons or tweak however. I'm just trying to communicate the broader idea)

Pros

  • Can represent any relation without advanced knowledge (this is the primary motivation)
  • Naturally provides for advanced_extension, etc.

Cons

  • Symbol table is more "abstract" (e.g. there won't be a "project" object in the symbol table anywhere. However, I think something like this could still exist for the common logical types, it would just be a separate representation)
  • Serializing to protobuf will be tricky (will require reflection and use of descriptors. I'm only "mostly sure" this is possible)
  • Slightly more verbose

@westonpace
Copy link
Member

It's been a while and I haven't been able to get to this so I apologize. I'm used to working on a project with many contributors and I think my practices need to be different as these projects get up to speed. I'm going to start focusing reviews more on making sure I understand the high level concept, the direction, the design, and the unit testing seems sufficient. I'm probably not going to dig too much into style (we have checks for that now) or subtle bugs (these will be detected with time and it's probably not the highest priority at the moment to find them).

Let's move forward and merge this PR.

I'd like to align on rules for T, const T&, and T* (see comment) and I'm a little uncertain about the cmake for antlr but I'll open separate issues for those.

@westonpace westonpace merged commit f707840 into substrait-io:main Apr 27, 2023
EpsilonPrime added a commit to EpsilonPrime/substrait-cpp that referenced this pull request May 6, 2023
…t-io#36)

* Includes ANTLR4 tool to compile the grammar.
* Computes anchor references for extension spaces and functions.
* Updates the data stored for extension spaces and functions in the
symbol table to align parsers and converters.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants