feat: initial version of the textplan to binary plan parser #36

EpsilonPrime · 2023-03-17T22:05:29Z

Includes ANTLR4 tool to compile the grammar.
Computes anchor references for extension spaces and functions.
Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters.

third_party/antlr4/cmake/FindANTLR.cmake

src/substrait/textplan/parser/SubstraitPlanVisitor.cpp

third_party/antlr4/binarytool/antlr-4.12.0-complete.jar

EpsilonPrime · 2023-03-31T00:08:36Z

Ok, all of the PRs have been rebased. This one took a while because I had to make sure the generated parser files were integrated into the new clang tidy checks.

chaojun-zhang · 2023-03-31T02:33:02Z

LGTM

Seeing that you have a lot of commits related to format, I suggest that you can use the make format or make tidy-fix command to automatically repair it. Did you encounter any problems?

EpsilonPrime · 2023-04-01T02:10:56Z

LGTM

Seeing that you have a lot of commits related to format, I suggest that you can use the make format or make tidy-fix command to automatically repair it. Did you encounter any problems?

I did try the repair commands but encountered some issues related to my setup. When I have time I will look into why it didn't work for me.

src/substrait/textplan/parser/SubstraitPlanVisitor.cpp

vibhatha · 2023-04-01T08:36:05Z

src/substrait/textplan/parser/SubstraitPlanVisitor.cpp

+
+std::any SubstraitPlanVisitor::visitOperation(
+    SubstraitPlanParser::OperationContext* ctx) {
+  // TODO -- Implement this in a second visitor as described below.


This section, like most of the file, is not yet implemented. This comment is merely a reminder about how I plan to solve the issue. Adding support for functions and relations is part of third PR relating to the parser technology which I'm starting on next week. Since github doesn't have issue dependencies could we just consider this one to be part of #10?

vibhatha · 2023-04-01T08:38:08Z

src/substrait/textplan/parser/Tool.cpp

+void readText(const char* filename) {
+  auto stream = io::substrait::textplan::loadTextFile(filename);
+  if (!stream.has_value()) {
+    std::cerr << "An error occurred while reading: " << filename << std::endl;


For future reference, should we introduce a macro or error logging mechanism? Definitely not for this PR, but for future work. Any thoughts?

The parser and converters have the capability to handle multiple errors long with their source locations. This particular file is a special case similar only to the converter's Tool.cpp. Eventually these two will be merged and there will be only one place handling errors like this (everything else is going to rely on the caller to deal with the list of error messages).

src/substrait/textplan/parser/tests/TextPlanParserTest.cpp

src/substrait/textplan/parser/ParseText.h

chaojun-zhang · 2023-04-04T15:05:52Z

one question : why not consider to use sql as input to generate the substrait plan like substrait-java does with help of Apache Calcite, for example Duckdb has the ability to help us generate either logical plan or physical plan from a single sql statement?

EpsilonPrime · 2023-04-04T18:17:46Z

one question : why not consider to use sql as input to generate the substrait plan like substrait-java does with help of Apache Calcite, for example Duckdb has the ability to help us generate either logical plan or physical plan from a single sql statement?

The purpose of the text plan format is to be a nearly identical form of the binary plan, just in a more readable form. While a SQL converter would be useful that form would not necessarily have a 1:1 conversion with the binary plan format.

westonpace

Finally managed to get a good look at this. I'll have to ponder about the grammar some more as we go. I have an idea about relations I will try and flesh out further soon.

scripts/run-clang-tidy.sh

westonpace · 2023-04-11T05:11:16Z

src/substrait/textplan/parser/tests/CMakeLists.txt

@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: Apache-2.0
+
+add_test_case(


These tests appear to be ending up in <BUILD_DIR>/src/debug/xyz_test instead of <BUILD_DIR>/debug/xyz_test.

Also, this test fails if I run it directly (as opposed to running it from ctest). Maybe that is ok? But it would be nice if it would work. It's easier to debug that way.

If you've built the binary with CMake the test files should have been copied as a post build rule into the data directory. Since the test data is relative to the test binary merely running the test from the directory it resides in should work. The only way I can think of to get the test to run from any directory would be to give it a path to somewhere on the system to get its data which we could do by generating a C++ file with that path in it to refer to at runtime.

I put the CMake fixes into #49.

westonpace · 2023-04-11T05:13:34Z

src/substrait/textplan/parser/tests/CMakeLists.txt

+
+cmake_path(GET CMAKE_CURRENT_SOURCE_DIR PARENT_PATH TEXTPLAN_SOURCE_DIR)
+
+add_custom_command(


I'm a little confused (due to my own cmake ignorance). What is this doing? Are you adding a new custom command or are you extending the existing test with a post-build step?

This is supposed to modify the existing build process to copy the test data into the runtime output directory before testing happens. This is what makes the tests with external data work in ctest. It's also the same pattern we used for the converter (not that it's right, just letting you know about scope).

westonpace · 2023-04-11T05:17:10Z

src/substrait/textplan/parser/tests/TextPlanParserTest.cpp

+  return cases;
+}
+
+TEST(TextPlanParser, LoadFromFile) {


This is the test that fails unless run from ctest. Probably because it can't always locate data/provided_sample1.splan as that path depends on the CWD.

This is being handled in CMake by copying the data files into a test-relative directory. Passing the test data directory as an argument could be an alternative but that would both require CMake wrangling and would not work if the test was not run with CMake.

westonpace · 2023-04-11T05:18:36Z

src/substrait/textplan/parser/tests/TextPlanParserTest.cpp

+
+class TextPlanParserTestFixture : public ::testing::TestWithParam<TestCase> {};
+
+std::vector<TestCase> getTestCases() {


Not for this PR but long term I think it would be good to start having some tests that go the full round trip:

text -> symbols -> proto -> symbols -> text

Text to proto tests are implemented two PRs from now. There are no full roundtrip tests yet but I was planning for them to be more like an integration test and would be run on full plans.

src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4

westonpace · 2023-04-11T05:43:23Z

src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4

+relation_detail
+   : COMMON SEMICOLON                # relationCommon
+   | BASE_SCHEMA id SEMICOLON        # relationUsesSchema
+   | FILTER operation SEMICOLON      # relationFilter
+   | PROJECTION SEMICOLON            # relationProjection
+   | EXPRESSION operation SEMICOLON  # relationExpression
+   | ADVANCED_EXTENSION SEMICOLON    # relationAdvancedExtension
+   | source_reference SEMICOLON      # relationSourceReference
+   ;


This is the most concerning part to me I think. Most parts of the spec are going to be fairly consistent / limited. However, there might be hundreds of relations once we start to really invest in adding physical relations. While I'm sure we can perhaps update this file as we go I worry about the mental overhead of knowing all these rules as authoring files. Is there any way to make this more consistent? I'll try and give a better example of what I'm thinking of later.

I have a table of all of the kinds of operations that can be done in relations. The terms aren't consistently used but I've started merging the similar ones. For instance, in two PRs you'll see BESTEFFORT and POSTJOIN qualifiers for filters which consolidate the syntax for read and join relations. I've also decided to use FILTER instead of CONDITION so the filter clause works for filter relations as well. The only new clause required for the existing physical relations (hash and merge join) would be a way of specifying the left keys and the right keys. There are typically one or two new clauses required for each new relation. (The only scary relations are the currently inaccessible write relations which would require 12 new clauses but fortunately all three share the same new clauses.)

How do you feel about something like this (please forgive my horrible ANTLR abilities):

relation_key_value : id COLON relation_value relation_value : COMMON # relationCommon (ideally this is part of relation_detail and not relation_obj but I can't ANTLR it out at the moment) | SCHEMA id # Schema reference | RELATION id # Input relation reference | EXPRESSION operation # An expression | FUNCTION id # A function | LITERAL constant # A basic configurable property | [ relation_value* ] # Er...joined with COMMA, however you say that | { relation_key_value* } ; relation_detail : relation_key_value* # I lost the SEMICOLON somewhere but we can put it back if desired

In other words, we end up with something very similar to JSON. However, instead of our base types being string, number, boolean, null (all of which can be expressed with LITERAL) our base types also include schema, input, expression, and function.

Examples:

project relation project1 { input: relation scan expressions: [ expression o_orderkey, expression o_custkey ] }

join relation join1 { left: relation project1 right: relation project2 expression: expression equal(o_customerid, c_id) post_join_filter: expression equal(c_id, 7) type: literal "JOIN_TYPE_INNER" # Well, it's an enum, but using string for now, we can figure out enums } custom_extension_relation relation cer1 { left: relation project1 middle: relation project2 right: relation project3 fizziness: literal 73.4 approach: expression gt(inp1, 12) }

(feel free to add back in semicolons or get rid of the colons or tweak however. I'm just trying to communicate the broader idea)

Pros

Can represent any relation without advanced knowledge (this is the primary motivation)

Naturally provides for advanced_extension, etc.

Cons

Symbol table is more "abstract" (e.g. there won't be a "project" object in the symbol table anywhere. However, I think something like this could still exist for the common logical types, it would just be a separate representation)

Serializing to protobuf will be tricky (will require reflection and use of descriptors. I'm only "mostly sure" this is possible)

Slightly more verbose

For closure, @EpsilonPrime and I spoke about this offline. It's an interesting idea and may be something we need for extensions. However, sticking with the current approach for the logical / popular relations makes a lot of sense.

src/substrait/textplan/parser/CMakeLists.txt

src/substrait/textplan/parser/ParseText.cpp

src/substrait/textplan/parser/ParseText.h

* Includes ANTLR4 tool to compile the grammar. * Computes anchor references for extension spaces and functions. * Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters. * Standardizes the format of the internal representation of pipelines between the converter and the parster.

EpsilonPrime · 2023-04-14T23:10:57Z

With the latest converter update there was some shared code that needed updating (since this PR also was modifying the in-memory representation). That update has happened and the PR has been rebased to main.

westonpace

Just one additional comment (finally managed to express the idea that had been percolating) but it is a big one 😆

westonpace · 2023-04-18T17:33:37Z

src/substrait/textplan/parser/grammar/SubstraitPlanParser.g4

+relation_detail
+   : COMMON SEMICOLON                # relationCommon
+   | BASE_SCHEMA id SEMICOLON        # relationUsesSchema
+   | FILTER operation SEMICOLON      # relationFilter
+   | PROJECTION SEMICOLON            # relationProjection
+   | EXPRESSION operation SEMICOLON  # relationExpression
+   | ADVANCED_EXTENSION SEMICOLON    # relationAdvancedExtension
+   | source_reference SEMICOLON      # relationSourceReference
+   ;


How do you feel about something like this (please forgive my horrible ANTLR abilities):

relation_key_value : id COLON relation_value relation_value : COMMON # relationCommon (ideally this is part of relation_detail and not relation_obj but I can't ANTLR it out at the moment) | SCHEMA id # Schema reference | RELATION id # Input relation reference | EXPRESSION operation # An expression | FUNCTION id # A function | LITERAL constant # A basic configurable property | [ relation_value* ] # Er...joined with COMMA, however you say that | { relation_key_value* } ; relation_detail : relation_key_value* # I lost the SEMICOLON somewhere but we can put it back if desired

In other words, we end up with something very similar to JSON. However, instead of our base types being string, number, boolean, null (all of which can be expressed with LITERAL) our base types also include schema, input, expression, and function.

Examples:

project relation project1 { input: relation scan expressions: [ expression o_orderkey, expression o_custkey ] }

join relation join1 { left: relation project1 right: relation project2 expression: expression equal(o_customerid, c_id) post_join_filter: expression equal(c_id, 7) type: literal "JOIN_TYPE_INNER" # Well, it's an enum, but using string for now, we can figure out enums } custom_extension_relation relation cer1 { left: relation project1 middle: relation project2 right: relation project3 fizziness: literal 73.4 approach: expression gt(inp1, 12) }

(feel free to add back in semicolons or get rid of the colons or tweak however. I'm just trying to communicate the broader idea)

Pros

Can represent any relation without advanced knowledge (this is the primary motivation)

Naturally provides for advanced_extension, etc.

Cons

Symbol table is more "abstract" (e.g. there won't be a "project" object in the symbol table anywhere. However, I think something like this could still exist for the common logical types, it would just be a separate representation)

Serializing to protobuf will be tricky (will require reflection and use of descriptors. I'm only "mostly sure" this is possible)

Slightly more verbose

…elsewhere.

…e instead of at the end of the type.

westonpace · 2023-04-27T20:40:38Z

It's been a while and I haven't been able to get to this so I apologize. I'm used to working on a project with many contributors and I think my practices need to be different as these projects get up to speed. I'm going to start focusing reviews more on making sure I understand the high level concept, the direction, the design, and the unit testing seems sufficient. I'm probably not going to dig too much into style (we have checks for that now) or subtle bugs (these will be detected with time and it's probably not the highest priority at the moment to find them).

Let's move forward and merge this PR.

I'd like to align on rules for T, const T&, and T* (see comment) and I'm a little uncertain about the cmake for antlr but I'll open separate issues for those.

…t-io#36) * Includes ANTLR4 tool to compile the grammar. * Computes anchor references for extension spaces and functions. * Updates the data stored for extension spaces and functions in the symbol table to align parsers and converters.

EpsilonPrime requested a review from westonpace as a code owner March 17, 2023 22:05

chaojun-zhang reviewed Mar 19, 2023

View reviewed changes

third_party/antlr4/cmake/FindANTLR.cmake Outdated Show resolved Hide resolved

chaojun-zhang reviewed Mar 23, 2023

View reviewed changes

src/substrait/textplan/parser/SubstraitPlanVisitor.cpp Outdated Show resolved Hide resolved

chaojun-zhang reviewed Mar 23, 2023

View reviewed changes

third_party/antlr4/binarytool/antlr-4.12.0-complete.jar Outdated Show resolved Hide resolved

EpsilonPrime force-pushed the parser-part1 branch from 8f54173 to 482d6a9 Compare March 30, 2023 18:56

chaojun-zhang approved these changes Mar 31, 2023

View reviewed changes

EpsilonPrime force-pushed the parser-part1 branch from e850730 to e6db23f Compare April 1, 2023 08:24

vibhatha reviewed Apr 1, 2023

View reviewed changes

src/substrait/textplan/parser/SubstraitPlanVisitor.cpp Outdated Show resolved Hide resolved

vibhatha reviewed Apr 1, 2023

View reviewed changes

src/substrait/textplan/parser/SubstraitPlanVisitor.cpp Outdated Show resolved Hide resolved

vibhatha reviewed Apr 1, 2023

View reviewed changes

src/substrait/textplan/parser/tests/TextPlanParserTest.cpp Show resolved Hide resolved

chaojun-zhang reviewed Apr 4, 2023

View reviewed changes

src/substrait/textplan/parser/ParseText.h Show resolved Hide resolved

westonpace requested changes Apr 11, 2023

View reviewed changes

EpsilonPrime force-pushed the parser-part1 branch from 32f7ae3 to 1fbbfef Compare April 11, 2023 09:40

EpsilonPrime requested a review from westonpace April 12, 2023 06:11

EpsilonPrime force-pushed the parser-part1 branch from 6415be1 to 9a387f3 Compare April 14, 2023 23:09

westonpace reviewed Apr 18, 2023

View reviewed changes

EpsilonPrime added 4 commits April 18, 2023 11:48

Renamed operation to expression to be more in line with nomenclature …

0618179

…elsewhere.

Clang tidy.

a3323c0

Change the grammar to keep the nullability operator near the base typ…

a0c745d

…e instead of at the end of the type.

Moved the invalid error into its own test case.

bc1094a

westonpace approved these changes Apr 27, 2023

View reviewed changes

westonpace merged commit f707840 into substrait-io:main Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: initial version of the textplan to binary plan parser #36

feat: initial version of the textplan to binary plan parser #36

EpsilonPrime commented Mar 17, 2023

EpsilonPrime commented Mar 31, 2023

chaojun-zhang commented Mar 31, 2023 •

edited

Loading

EpsilonPrime commented Apr 1, 2023

vibhatha Apr 1, 2023

EpsilonPrime Apr 1, 2023

vibhatha Apr 1, 2023

EpsilonPrime Apr 1, 2023

chaojun-zhang commented Apr 4, 2023 •

edited

Loading

EpsilonPrime commented Apr 4, 2023

westonpace left a comment

westonpace Apr 11, 2023

westonpace Apr 11, 2023

EpsilonPrime Apr 12, 2023

westonpace Apr 11, 2023

EpsilonPrime Apr 11, 2023

westonpace Apr 11, 2023

EpsilonPrime Apr 12, 2023

westonpace Apr 11, 2023

EpsilonPrime Apr 11, 2023

westonpace Apr 11, 2023

EpsilonPrime Apr 11, 2023

westonpace Apr 18, 2023

westonpace Apr 27, 2023

EpsilonPrime commented Apr 14, 2023

westonpace left a comment

westonpace Apr 18, 2023

westonpace commented Apr 27, 2023

		@@ -0,0 +1,31 @@
		# SPDX-License-Identifier: Apache-2.0

		add_test_case(


		cmake_path(GET CMAKE_CURRENT_SOURCE_DIR PARENT_PATH TEXTPLAN_SOURCE_DIR)

		add_custom_command(


		class TextPlanParserTestFixture : public ::testing::TestWithParam<TestCase> {};

		std::vector<TestCase> getTestCases() {

feat: initial version of the textplan to binary plan parser #36

feat: initial version of the textplan to binary plan parser #36

Conversation

EpsilonPrime commented Mar 17, 2023

EpsilonPrime commented Mar 31, 2023

chaojun-zhang commented Mar 31, 2023 • edited Loading

EpsilonPrime commented Apr 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chaojun-zhang commented Apr 4, 2023 • edited Loading

EpsilonPrime commented Apr 4, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EpsilonPrime commented Apr 14, 2023

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Apr 27, 2023

chaojun-zhang commented Mar 31, 2023 •

edited

Loading

chaojun-zhang commented Apr 4, 2023 •

edited

Loading