-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark] Improve missing stats column message for unsupported data skipping types #3577
[Spark] Improve missing stats column message for unsupported data skipping types #3577
Conversation
@@ -387,6 +387,12 @@ | |||
], | |||
"sqlState" : "42P10" | |||
}, | |||
"DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE" : { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we change the wording here to eitherDELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED
orDELTA_CLUSTERING_COLUMNS_UNSUPPORTED_DATATYPE
? I think it would read better and be more in line with existing error classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense and I reworded it to DELTA_CLUSTERING_COLUMNS_DATATYPE_NOT_SUPPORTED
// This assertion must hold since missingColumns are subset of clusteringColumnInfos. | ||
assert(missingColumnInfos.length == missingColumns.length) | ||
val (skippingEligibleMissingColumnInfos, nonSkippingEligibleMissingColumnInfos) = | ||
missingColumnInfos.partition(info => SkippingEligibleDataType(info.dataType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to partition here? We only use the skipping eligible result if the non-skipping eligible is empty here, so I think we can just do filter
here and use the whole missing column set if the non-skipping eligible set is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and I reworked as suggested
2d5cd86
to
c431455
Compare
getClusteringColumnsNotInStatsSchema(statsCollection, clusteringColumnInfos) | ||
if (missingColumns.nonEmpty) { | ||
// Check DataType eligibility. | ||
val missingColumnSet = missingColumns.toSet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: doesn't look like set is needed as Seq also supports contains. Also there'll be at most 4 clustering columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
c431455
to
c2be485
Compare
Which Delta project/connector is this regarding?
Description
Today when either the clustering column data type is unsupported for data skipping or clustering column is not in the first 32 columns, it shows the following message
[DELTA_CLUSTERING_COLUMN_MISSING_STATS] Liquid clustering requires clustering columns to have stats. Couldn't find clustering column(s) 'current_version' in stats schema:
This is confusing to users when the column data type is not supported for data skipping. To improve this scenario this PR introduces a new error class DELTA_CLUSTERING_COLUMNS_NOT_SUPPORTED_DATATYPE when cluster by non data skipping data type such as cluster by Boolean column
How was this patch tested?
Existing unit tests.
Does this PR introduce any user-facing changes?