Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48898][SQL] Set nullability correctly in the Variant schema #49118

Closed
wants to merge 2 commits into from

Conversation

cashmand
Copy link
Contributor

@cashmand cashmand commented Dec 9, 2024

What changes were proposed in this pull request?

The variantShreddingSchema method converts a human-readable schema for Variant to one that's a valid shredding schema. According to the shredding schema in apache/parquet-format#461, each shredded field in an object should be a required group - i.e. a non-nullable struct. This PR fixes the variantShreddingSchema to mark that struct as non-nullable.

Why are the changes needed?

If we use variantShreddingSchema to construct a schema for Parquet, the schema would be technically non-conformant with the spec by setting the group as optional. I don't think this should really matter to readers, but it would waste a bit of space in the Parquet file by adding an extra definition level.

Does this PR introduce any user-facing change?

No, this code is not used yet.

How was this patch tested?

Added a test to do some minimal validation of the variantShreddingSchema function.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Dec 9, 2024
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-48898] Set nullability correctly in the Variant schema [SPARK-48898][SQL] Set nullability correctly in the Variant schema Dec 9, 2024
// metadata is always non-nullable.
assert(SparkShreddingUtils.variantShreddingSchema(IntegerType) ==
StructType(Seq(StructField("metadata", BinaryType, nullable = false),
StructField("value", BinaryType, nullable = true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have double spaced indentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cashmand cashmand requested a review from HyukjinKwon December 10, 2024 21:16
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 9394b35 Dec 11, 2024
@dongjoon-hyun
Copy link
Member

Interestingly, it seems to fail in master branch somehow. Could you take a look at the failures, please, @cashmand and all?

Screenshot 2024-12-11 at 10 23 41

[info] *** 1 TEST FAILED ***
[error] Failed: Total 12430, Failed 1, Errors 0, Passed 12429, Ignored 33, Canceled 1
[error] Failed tests:
[error] 	org.apache.spark.sql.VariantShreddingSuite
[error] (sql / Test / test) sbt.TestsFailedException: Tests unsuccessful

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 11, 2024

It seems that the following PR landed faster and this PR didn't test the newly added test case (which fail currently).

@@ -61,8 +61,11 @@ case object SparkShreddingUtils {
StructField(TypedValueFieldName, arrayShreddingSchema, nullable = true)
)
case StructType(fields) =>
// The field name level is always non-nullable: Variant null values are represented in the
// "value" columna as "00", and missing values are represented by setting both "value" and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columna -> column?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 11, 2024

Sorry but let me revert this to recover the CI and unblock other PRs. Although it looks like a trivial Mal-formed error case testing, could you re-submit this PR to make it sure to pass all?

@cashmand
Copy link
Contributor Author

Sorry but let me revert this to recover the CI and unblock other PRs. Although it looks like a trivial Mal-formed error case testing, could you re-submit this PR to make it sure to pass all?

Hi @dongjoon-hyun, I opened #49151, which is exactly the same, but updates the broken test. Can you please take a look, and merge if it looks okay?

@dongjoon-hyun
Copy link
Member

Thank you so much, @cashmand . If CI passes, I'll merge back ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants