Should we modify the logical plan to insert pre/post-project? #4308
Replies: 6 comments 2 replies
-
Beta Was this translation helpful? Give feedback.
-
@liujiayi771 Thanks for taking efforts to prototype on this optimization. I recognize this work can be divided into two parts. One is about code refactor, and the other is about performance optimization. As you desribed above, 1) to modify logical plan provides several advantages on the code simplicity, but also introduces possible disturbances and risks for Spark plan. 2) to modify logical plan brings extra performance improvement as it reduces fallback. |
Beta Was this translation helpful? Give feedback.
-
I think pull out pre-project at logical side sounds better to me. Besides the fallback, it's more easy and efficient. Let's say a plan with aggregate and distinct. We can reduce two expression evaluation and project if we pull out project at logical side.
Let's say a plan with shuffled hash join. We can also reduce one expression evaluation and project if we pull out project at logical side. BTW, we can reduce two expression evaluation if the join is SMJ, one extra project for sort.
I also did not see vanilla Spark has significant plan changes at physical side. In general, the physcial rules are used to eliminate the operators. One more reference is that, there already exists an logical rule in vanilla Spark to pull out the complex expression for aggregate grouping keys PullOutGroupingExpressions. I think it's good to follow it. |
Beta Was this translation helpful? Give feedback.
-
The issue related to sort We will handle the parts that are suitable for logical plan processing in the logical plan rule, and those that are suitable for physical plan processing in the physical plan rule. |
Beta Was this translation helpful? Give feedback.
-
Some Strategy in Spark will add function to Agg's |
Beta Was this translation helpful? Give feedback.
-
In #4245, we hope to insert pre/post-project through spark Rule, which could simplify the handling of pre/post-project in operators such as agg, sort, join, window, etc.
Recently, I've been attempting to add the Rule for logical plan, and there are numerous advantages to this approach:
aggExprs
, if there is a func that gluten does not support, we may only need a pre-project fallback. The agg can then use the result computed by the pre-project, eliminating the need for the agg to fallback.During the process of fixing UT, I find that the modifications to the logical plan can indeed cause many side effects. While these side effects usually do not pose correctness issues, they can lead to failures when gluten executes Spark UTs.
For example,
ReusedSubqueryExec
. Before the modification, someReusedSubqueryExec
appear inside an agg. After being converted to a physical plan, both the partial and final agg will haveReusedSubqueryExec
. However, if we move the subqueries from agg to a pre-project, the count will not double. As a result, the number ofReusedSubqueryExec
will not match the number asserted in the Spark UTs.InjectRuntimeFilterSuite
, before comparison, theColumnPruning
rule is executed. Our newly addedInsertPreProject
rule's output plan would be re-optimized by theColumnPruning
rule, which changes the plan and leads to a validation failure.https://github.com/apache/spark/blob/018808236708bea7a78618abf750bea39be3c9f8/sql/core/src/test/scala/org/apache/spark/sql/InjectRuntimeFilterSuite.scala#L278-L290
Currently, in the PR, we have only dealt with agg and sort. I believe that if we were to add support for join and window, many more issues would come to light.
I've been considering whether gluten, as a columnar plan transformer, should be modifying the logical plan. Should we perhaps limit ourselves to operating at the columnar plan level? Alternatively, should we spend time investigating the causes of these unit test failures? If the comparison is against the vanilla plan, then we should consider excluding or overriding the new tests.
Alternatively, we could consider using columnar plan Rule to modify the plan, but this might result in losing some of the advantages mentioned above.
But it can also bring some benefits.
Beta Was this translation helpful? Give feedback.
All reactions