[Spark] Improve missing stats column message for unsupported data skipping types #3577

dabao521 · 2024-08-19T01:57:13Z

Which Delta project/connector is this regarding?

Description

Today when either the clustering column data type is unsupported for data skipping or clustering column is not in the first 32 columns, it shows the following message

[DELTA_CLUSTERING_COLUMN_MISSING_STATS] Liquid clustering requires clustering columns to have stats. Couldn't find clustering column(s) 'current_version' in stats schema:

This is confusing to users when the column data type is not supported for data skipping. To improve this scenario this PR introduces a new error class DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE when cluster by non data skipping data type such as cluster by Boolean column

How was this patch tested?

Existing unit tests.

Does this PR introduce any user-facing changes?

chirag-s-db · 2024-08-19T18:55:48Z

spark/src/main/resources/error/delta-error-classes.json

@@ -387,6 +387,12 @@
    ],
    "sqlState" : "42P10"
  },
+  "DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE" : {


Nit: can we change the wording here to eitherDELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED orDELTA_CLUSTERING_COLUMNS_UNSUPPORTED_DATATYPE? I think it would read better and be more in line with existing error classes.

Makes sense and I reworded it to DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED

chirag-s-db · 2024-08-19T18:59:59Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/ClusteredTableUtils.scala

+      // This assertion must hold since missingColumns are subset of clusteringColumnInfos.
+      assert(missingColumnInfos.length == missingColumns.length)
+      val (skippingEligibleMissingColumnInfos, nonSkippingEligibleMissingColumnInfos) =
+        missingColumnInfos.partition(info => SkippingEligibleDataType(info.dataType))


Do we need to partition here? We only use the skipping eligible result if the non-skipping eligible is empty here, so I think we can just do filter here and use the whole missing column set if the non-skipping eligible set is empty.

Good point and I reworked as suggested

zedtang · 2024-08-22T21:22:06Z

spark/src/main/scala/org/apache/spark/sql/delta/skipping/clustering/ClusteredTableUtils.scala

+      getClusteringColumnsNotInStatsSchema(statsCollection, clusteringColumnInfos)
+    if (missingColumns.nonEmpty) {
+      // Check DataType eligibility.
+      val missingColumnSet = missingColumns.toSet


nit: doesn't look like set is needed as Seq also supports contains. Also there'll be at most 4 clustering columns

chirag-s-db reviewed Aug 19, 2024

View reviewed changes

dabao521 force-pushed the imporveMissingStatsColumnMessage branch from 2d5cd86 to c431455 Compare August 22, 2024 21:21

zedtang reviewed Aug 22, 2024

View reviewed changes

zedtang approved these changes Aug 22, 2024

View reviewed changes

initial commit

c2be485

dabao521 force-pushed the imporveMissingStatsColumnMessage branch from c431455 to c2be485 Compare August 22, 2024 21:52

dabao521 requested a review from chirag-s-db August 22, 2024 21:52

comments

7ff5897

chirag-s-db approved these changes Aug 22, 2024

View reviewed changes

vkorukanti merged commit eb00b0d into delta-io:master Aug 23, 2024
13 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Improve missing stats column message for unsupported data skipping types #3577

[Spark] Improve missing stats column message for unsupported data skipping types #3577

dabao521 commented Aug 19, 2024 •

edited

Loading

chirag-s-db Aug 19, 2024 •

edited

Loading

dabao521 Aug 22, 2024

chirag-s-db Aug 19, 2024

dabao521 Aug 22, 2024

zedtang Aug 22, 2024

dabao521 Aug 22, 2024

[Spark] Improve missing stats column message for unsupported data skipping types #3577

[Spark] Improve missing stats column message for unsupported data skipping types #3577

Conversation

dabao521 commented Aug 19, 2024 • edited Loading

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

chirag-s-db Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

dabao521 Aug 22, 2024

Choose a reason for hiding this comment

chirag-s-db Aug 19, 2024

Choose a reason for hiding this comment

dabao521 Aug 22, 2024

Choose a reason for hiding this comment

zedtang Aug 22, 2024

Choose a reason for hiding this comment

dabao521 Aug 22, 2024

Choose a reason for hiding this comment

dabao521 commented Aug 19, 2024 •

edited

Loading

chirag-s-db Aug 19, 2024 •

edited

Loading