-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-6834][CORE] feat: add other join types from the official Substrait #6835
Conversation
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
I suspect how it actually helps if we make such changes. For example both Spark and Velox don't have a join type As commented in #6833 (comment), physical plans would vary more than logical plans in Gluten. We are already having to follow both Spark's and Velox/CH's plan protocols. It could have chance to start messing things up if we strictly follow another one in the middle layer. |
There are also physical relations in Substrait. The approach I am taking at unforking is to slowly chip away at the differences so we can swap the main instance in seamlessly. A swap of everything at once seems unlikely so I'm tackling what I can -- especially if there are minor code changes related to each small change. |
May add a new type for existence join ? existence join is transformed to left semi join at present |
We are in the process of adding new join types to the specification at the moment so adding another is definitely a possibility. (Currently I have a pr out to add mark join.) |
Run Gluten Clickhouse CI |
@EpsilonPrime Before we decide to take effort on migrating to mainstream Substrait, we may do some study on how much it could help us on next moves. Thing is if we unfork Substrait, we should use it for supporting more backends in Gluten. Otherwise we get less benefit than cost. Do you happen to know the progress of Substrait integration of some projects, for example, DuckDB, Arrow and Datafusion, if Gluten decide to add support for these libraries, are their Substrait consumer implementations reliable enough for us to use? |
we should have the same protocol in long term, substrait should be able to fit all requirements from different frameworks, libraries and accelerators. It's substrait's goal. But now for Gluten it's not urgent. |
@JkSelf @rui-mo Do you happen to have the list of our modifications in Glute's substrait? I remember we have several pending PRs to upstream substrait but there is no active review, so we paused there. @EpsilonPrime you may start to review the pending PRs in substrait if we have. Add the missing features Gluten needs to upstream. |
Gluten modifications to Substrait: https://github.com/apache/incubator-gluten/blob/main/docs/developers/SubstraitModifications.md |
All of the pending changes were merged last year (about the time I became a reviewer). I have been also merging changes upstream (for instance the equivalent of TextReadOptions was added last week). |
DuckDB has the best Substrait support of the three. Datafusion has a few issues which I'm hoping are addressed in their next release. Acero is in maintenance mode but has a working implementation but it's very strict about what it accepts. Other benefits include tools which run on Substrait (like the validator and text plan format) which aren't really being used by Gluten at the moment. The Spark proposal to move the Gluten communication logic (which may or may not have included Substrait) there was the reason I started looking into this effort. |
c4f689a
to
6ce90e0
Compare
Run Gluten Clickhouse CI |
2 similar comments
Run Gluten Clickhouse CI |
Run Gluten Clickhouse CI |
bdd63fe
to
fab3250
Compare
Run Gluten Clickhouse CI |
It seems CH CI failing at
|
Run Gluten Clickhouse CI |
Adds the join types currently defined in Substrait to the Gluten copy. This is one of a vast set of changes aimed at reducing the differences from the official version in hope that one day there would be no differences.