[VL] Fix NullPointerException when collect_list / collect_set are partially fallen back #5655

zhztheplayer · 2024-05-08T06:42:11Z

Edit: Fixes #5649. Added vanilla implementation of velox_collect_list and velox_collect_set.

Edit: (2024/12/13) Why doing the expression replacement on logical plan: The replacement changes the intermediate data type so will change partial aggregation's output schema if it's against physical plan. So we adjust the logical plan directly.

Velox backend's collect_list / collect_set implementations require for ARRAY intermediate data however Spark uses BINARY. To address this we did some tricks to forcibly modify the physical plan to change the output schema of partial aggregate operator to align with Velox, but that way the actual information for the two functions in Velox backend is still hidden from query plan so advanced optimizations or compatibility checks are made difficult during planning phase.

This patch adds new functions velox_collect_list / velox_collect_set to correctly map to Velox backend's implementation for the two functions and does essential code cleanup and refactors.

Add functions velox_collect_list / velox_collect_set.
Remove physical rule RewriteCollect / RewriteTypedImperativeAggregate, add logical rule CollectRewriteRule to incorporate functionalities of the formers.
CollectRewriteRule will replace collect_list / collect_set with velox_collect_list / velox_collect_set.
Since velox_collect_list / velox_collect_set becomes DeclarativeAggregate, some UTs should be disabled as they do some plan checks for existence of ObjectHashAggregateExec.
Some optimizations for UT facility code.

github-actions · 2024-05-08T06:42:41Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T06:46:09Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T06:47:57Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T06:49:11Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T07:01:44Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-05-08T07:06:17Z

/Benchmark Velox

GlutenPerfBot · 2024-05-08T07:58:53Z

===== Performance report for TPCH SF2000 with Velox backend, for reference only ====

query	log/native_5655_time.csv	log/native_master_05_07_2024_254d62e72_time.csv	difference	percentage
q1	34.16	34.69	0.528	101.55%
q2	22.17	23.52	1.349	106.08%
q3	36.72	39.12	2.402	106.54%
q4	38.01	38.99	0.982	102.58%
q5	69.14	70.47	1.331	101.93%
q6	5.85	7.99	2.145	136.68%
q7	82.92	82.58	-0.335	99.60%
q8	84.23	85.88	1.654	101.96%
q9	124.80	125.69	0.890	100.71%
q10	46.97	44.17	-2.799	94.04%
q11	20.58	19.52	-1.062	94.84%
q12	27.60	25.07	-2.524	90.85%
q13	53.40	54.78	1.379	102.58%
q14	18.46	16.46	-1.993	89.20%
q15	31.18	29.15	-2.027	93.50%
q16	13.60	14.27	0.674	104.96%
q17	102.39	104.34	1.951	101.91%
q18	148.53	144.89	-3.634	97.55%
q19	13.42	15.16	1.737	112.95%
q20	27.16	26.39	-0.771	97.16%
q21	289.52	283.25	-6.273	97.83%
q22	16.11	14.47	-1.643	89.80%
total	1306.89	1300.85	-6.038	99.54%

github-actions · 2024-05-08T08:35:16Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T08:36:01Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T08:38:12Z

Run Gluten Clickhouse CI

github-actions · 2024-05-08T08:38:14Z

Run Gluten Clickhouse CI

zhouyuan · 2024-05-08T11:42:09Z

CC @zhli1142015

github-actions · 2024-05-09T00:31:10Z

Run Gluten Clickhouse CI

github-actions · 2024-05-09T00:55:20Z

Run Gluten Clickhouse CI

fixup fixup fixup fixup fixup fixup f

fixup fixup

github-actions · 2024-05-09T06:29:11Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-05-09T06:30:31Z

/Benchmark Velox

github-actions · 2024-05-09T06:34:17Z

Run Gluten Clickhouse CI

github-actions · 2024-05-10T04:09:12Z

Run Gluten Clickhouse CI

github-actions · 2024-05-10T04:20:05Z

Run Gluten Clickhouse CI

zhztheplayer · 2024-05-10T06:49:09Z

@zhztheplayer Do you think this approach is better, or is it better for us to convert it to binary in the project during row extraction after partial agg?

I am not 100% sure but seems to be that the PR's approach looks more portable? Row extraction would work but if it's done at operator level, the function still outputs ARRAY intermediate data which doesn't match Spark's CollectList / CollectSet definition.

github-actions · 2024-05-10T07:56:14Z

Run Gluten Clickhouse CI

github-actions · 2024-05-10T07:58:30Z

Run Gluten Clickhouse CI

ulysses-you · 2024-05-10T08:23:54Z

backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala

+  override def defaultResult: Option[Literal] = Option(Literal.create(Array(), dataType))
+}
+
+case class VeloxCollectSet(override val child: Expression) extends VeloxCollect {


Use ArrayUnion for updateExpressions and mergeExpressions ?

It internally uses Array. A single de-dup operation should cost ~O(n). I would do a distinct at end of aggregation.

It's mainly for reduce shuffle data if the partial mode aggreagte fallback.

Yes, it costs more space. May be we can use Spark Map type for collect_set?

Is it possible to fully copy Collect code and change serialize, deserialize to adapt datatype ? It seems we can re-point binary to a UnsafeArrayData and then getArray.

Isn't binary data only able to obtain when it's a ImperativeAggregate? Would you elaborate on the suggestion?

ulysses-you · 2024-05-10T08:32:45Z

backends-velox/src/main/scala/org/apache/gluten/expression/aggregate/VeloxCollect.scala

+case class VeloxCollectSet(override val child: Expression) extends VeloxCollect {
+  override def prettyName: String = "velox_collect_set"
+
+  override def nullable: Boolean = true


can we preserve the comment ?

incubator-gluten/gluten-core/src/main/scala/org/apache/gluten/extension/columnar/rewrite/RewriteCollect.scala

Lines 101 to 103 in 5d8ac72

// We should mark attribute as withNullability since the collect_set and collect_set

// are not nullable but velox may return null. This is to avoid potential issue when

// the post project fallback to vanilla Spark.

Added some comment to VeloxCollectSet.

For now we don't need the withNullability part of the comment.

ulysses-you · 2024-05-10T08:45:17Z

backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala

+      case w: Window =>
+        w.transformExpressions {
+          case func @ WindowExpression(ToVeloxCollectSet(newAggFunc), _) =>
+            val out = ensureNonNull(func.copy(newAggFunc))


shall we ensureNonNull for newAggFunc rather than window func ?

That doesn't work. Spark has some checks in checkAnalysis to enforce "WindowExpression(WindowFunction, ...)" pattern.

I'm not sure adding ensureNonNull for window expression is been actually evaluated. It seems window operator only collects the window expression to eval. Is there a test for window + collect_set with null input ?

It's evaluated with the project created in WindowExecBase#createResultProjection. Velox's window implementation doesn't support collect_set yet. I'd add a case for vanilla Spark + velox_collect_set if you are concerned about this part of code.

ulysses-you · 2024-05-10T08:45:35Z

backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala

+        agg.transformExpressions {
+          case ToVeloxCollectSet(newAggFunc) =>
+            val out = ensureNonNull(newAggFunc)
+            out


unnecessary variable

I didn't find a proper way for placing breakpoint on return value for debugging code without doing this. Do you know some?

If there is not a good way, Personally I would keep something like this and it doesn't increase code complexity.

I usually dig the internal method returned value.

ulysses-you · 2024-05-10T08:50:29Z

backends-velox/src/test/scala/org/apache/gluten/execution/FallbackSuite.scala

+    withSQLConf(
+      GlutenConfig.EXPRESSION_BLACK_LIST.key -> "collect_set"
+    ) {
+      CollectRewriteRule.forceEnableAndRun {


why do we need to force enable the rule ? Should it it already enabled if have collect_set ?

I'll remove this API and the two tests. They are not useful for now.

ulysses-you · 2024-05-10T09:01:47Z

backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala

+    if (out.fastEquals(plan)) {
+      return plan
+    }
+    spark.sessionState.analyzer.checkAnalysis(out)


what does this check for ? I did not see Spark call it at optimizer phase...

I'll try removing this. It should not be required with current approach either.

github-actions · 2024-05-10T09:19:41Z

Run Gluten Clickhouse CI

github-actions · 2024-05-10T09:21:37Z

Run Gluten Clickhouse CI

github-actions · 2024-05-11T02:56:02Z

Run Gluten Clickhouse CI

github-actions · 2024-05-11T03:12:50Z

Run Gluten Clickhouse CI

github-actions · 2024-05-11T03:13:38Z

Run Gluten Clickhouse CI

github-actions · 2024-05-11T03:17:45Z

Run Gluten Clickhouse CI

ulysses-you · 2024-05-11T06:11:09Z

backends-velox/src/main/scala/org/apache/gluten/extension/CollectRewriteRule.scala

+    }
+  }
+
+  private def has[T <: Expression: ClassTag]: Boolean = {


After droping test, do we still need this check ? It seems this rule is only added by velox backend and the same is VeloxCollectSet. So when we going to this rule, there must have VeloxCollectSet ?

I agree with removing the check. Although we can do that later in some version...

For example VeloxCollectSet is still expensive than vanilla Spark so removing this check may cause a user who disabled collect_set experience performance regression.

BTW I'll raise some optimizations on VeloxCollectSet after this patch is merged. When I am confident with vanilla Spark + velox_collect_list / velox_collect_set being used in production enough, I'll remove this check.

JkSelf

LGTM. Thanks.

zhztheplayer · 2024-05-11T08:16:58Z

Thank you all for reviewing!

zhztheplayer force-pushed the wip-rework-collect branch from adb07c7 to a199917 Compare May 8, 2024 06:47

zhztheplayer force-pushed the wip-rework-collect branch from bd6849e to 76d3634 Compare May 9, 2024 00:30

zhztheplayer force-pushed the wip-rework-collect branch from 76d3634 to 0eb763c Compare May 9, 2024 00:54

zhztheplayer added 8 commits May 9, 2024 14:13

fixup

e7727c6

fixup fixup fixup fixup fixup fixup f

fixup

1da006a

fixup fixup

fixup

428ee53

fixup

9668c09

fixup

cbfaba4

fixup

b9f6db3

fixup

4c39cb5

fixup

c40a5ad

zhztheplayer force-pushed the wip-rework-collect branch from 0eb763c to b9f6db3 Compare May 9, 2024 06:28

apache deleted a comment from github-actions bot May 9, 2024

fixup

b7d9843

zhztheplayer added 3 commits May 10, 2024 15:27

fixup

d5a7585

fixup

cbabcef

fixup

539171b

ulysses-you reviewed May 10, 2024

View reviewed changes

zhztheplayer added 3 commits May 10, 2024 17:03

fixup

74fce48

fixup

b740c12

fixup

aaa1b5c

fixup

f32b04b

zhztheplayer added 3 commits May 11, 2024 10:56

fixup

f4df5ab

fixup

a748ba9

fixup

bb3d486

ulysses-you reviewed May 11, 2024

View reviewed changes

JkSelf approved these changes May 11, 2024

View reviewed changes

ulysses-you approved these changes May 11, 2024

View reviewed changes

zhztheplayer merged commit 2efa2e6 into apache:main May 11, 2024
43 checks passed

ivoson mentioned this pull request May 14, 2024

[VL] Drop the test table after all tests in FallbackSuite #5737

Merged

zhztheplayer mentioned this pull request Dec 13, 2024

[GLUTEN-8229][VL] Don't rewrite collect_list/collect_set in window #8230

Merged

	// We should mark attribute as withNullability since the collect_set and collect_set
	// are not nullable but velox may return null. This is to avoid potential issue when
	// the post project fallback to vanilla Spark.

[VL] Fix NullPointerException when collect_list / collect_set are partially fallen back #5655

[VL] Fix NullPointerException when collect_list / collect_set are partially fallen back #5655

Conversation

zhztheplayer commented May 8, 2024 • edited Loading

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

zhztheplayer commented May 8, 2024

GlutenPerfBot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

github-actions bot commented May 8, 2024

zhouyuan commented May 8, 2024

github-actions bot commented May 9, 2024

github-actions bot commented May 9, 2024

github-actions bot commented May 9, 2024

zhztheplayer commented May 9, 2024

github-actions bot commented May 9, 2024

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

zhztheplayer commented May 10, 2024

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhztheplayer May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhztheplayer May 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhztheplayer May 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 10, 2024

github-actions bot commented May 10, 2024

github-actions bot commented May 11, 2024

github-actions bot commented May 11, 2024

github-actions bot commented May 11, 2024

github-actions bot commented May 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JkSelf left a comment

Choose a reason for hiding this comment

zhztheplayer commented May 11, 2024

zhztheplayer commented May 8, 2024 •

edited

Loading

zhztheplayer May 10, 2024 •

edited

Loading

zhztheplayer May 11, 2024 •

edited

Loading

zhztheplayer May 10, 2024 •

edited

Loading