-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48898][SQL] Set nullability correctly in the Variant schema #49118
Conversation
// metadata is always non-nullable. | ||
assert(SparkShreddingUtils.variantShreddingSchema(IntegerType) == | ||
StructType(Seq(StructField("metadata", BinaryType, nullable = false), | ||
StructField("value", BinaryType, nullable = true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have double spaced indentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
thanks, merging to master! |
Interestingly, it seems to fail in
|
It seems that the following PR landed faster and this PR didn't test the newly added test case (which fail currently). |
@@ -61,8 +61,11 @@ case object SparkShreddingUtils { | |||
StructField(TypedValueFieldName, arrayShreddingSchema, nullable = true) | |||
) | |||
case StructType(fields) => | |||
// The field name level is always non-nullable: Variant null values are represented in the | |||
// "value" columna as "00", and missing values are represented by setting both "value" and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columna
-> column
?
Sorry but let me revert this to recover the CI and unblock other PRs. Although it looks like a trivial Mal-formed error case testing, could you re-submit this PR to make it sure to pass all? |
Hi @dongjoon-hyun, I opened #49151, which is exactly the same, but updates the broken test. Can you please take a look, and merge if it looks okay? |
Thank you so much, @cashmand . If CI passes, I'll merge back ASAP. |
What changes were proposed in this pull request?
The
variantShreddingSchema
method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes thevariantShreddingSchema
to mark that struct as non-nullable.Why are the changes needed?
If we use
variantShreddingSchema
to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level.Does this PR introduce any user-facing change?
No, this code is not used yet.
How was this patch tested?
Added a test to do some minimal validation of the
variantShreddingSchema
function.Was this patch authored or co-authored using generative AI tooling?
No.