Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Improve missing stats column message for unsupported data skipping types #3577

Merged
merged 2 commits into from
Aug 23, 2024

Conversation

dabao521
Copy link
Contributor

@dabao521 dabao521 commented Aug 19, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Today when either the clustering column data type is unsupported for data skipping or clustering column is not in the first 32 columns, it shows the following message

[DELTA_CLUSTERING_COLUMN_MISSING_STATS] Liquid clustering requires clustering columns to have stats. Couldn't find clustering column(s) 'current_version' in stats schema:

This is confusing to users when the column data type is not supported for data skipping. To improve this scenario this PR introduces a new error class DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE when cluster by non data skipping data type such as cluster by Boolean column

How was this patch tested?

Existing unit tests.

Does this PR introduce any user-facing changes?

@@ -387,6 +387,12 @@
],
"sqlState" : "42P10"
},
"DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE" : {
Copy link
Contributor

@chirag-s-db chirag-s-db Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we change the wording here to eitherDELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED orDELTA_CLUSTERING_COLUMNS_UNSUPPORTED_DATATYPE? I think it would read better and be more in line with existing error classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense and I reworded it to DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED

// This assertion must hold since missingColumns are subset of clusteringColumnInfos.
assert(missingColumnInfos.length == missingColumns.length)
val (skippingEligibleMissingColumnInfos, nonSkippingEligibleMissingColumnInfos) =
missingColumnInfos.partition(info => SkippingEligibleDataType(info.dataType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to partition here? We only use the skipping eligible result if the non-skipping eligible is empty here, so I think we can just do filter here and use the whole missing column set if the non-skipping eligible set is empty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point and I reworked as suggested

@dabao521 dabao521 force-pushed the imporveMissingStatsColumnMessage branch from 2d5cd86 to c431455 Compare August 22, 2024 21:21
getClusteringColumnsNotInStatsSchema(statsCollection, clusteringColumnInfos)
if (missingColumns.nonEmpty) {
// Check DataType eligibility.
val missingColumnSet = missingColumns.toSet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doesn't look like set is needed as Seq also supports contains. Also there'll be at most 4 clustering columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dabao521 dabao521 force-pushed the imporveMissingStatsColumnMessage branch from c431455 to c2be485 Compare August 22, 2024 21:52
@vkorukanti vkorukanti merged commit eb00b0d into delta-io:master Aug 23, 2024
13 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants