Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOCOL] Per-file statistics documentation in protocol is ambiguous #3287

Open
1 of 3 tasks
TaylorHodan opened this issue Jun 20, 2024 · 2 comments
Open
1 of 3 tasks
Assignees
Labels
bug Something isn't working

Comments

@TaylorHodan
Copy link

TaylorHodan commented Jun 20, 2024

Bug

Describe the problem

Per-file Statistics in protocol specification is a bit ambiguous and could provide more detail regarding the availability and format of per-file statistics for columns of specific data types. For instance, columns of array data type do not specify whether min and max statistics should be provided (and whether they are seems to be at the discretion of the engine being used). For instance, using DBR 15.2 with Spark 3.5.0 and Scala 2.12, the max and min statistics for arrays (nested or otherwise) are not provided, only the nullCount. Further, for nested arrays, only the nullCount of the nested array itself, not any of its fields, is given using the same settings as above. The _delta_log was generated by creating a Spark DataFrame and then df.write.format("delta").mode("append").save(storagePathway).

Moreover, the format of min and max statistics for type DateTime seems also to be up to the discretion of the engine. For example, whether HH:MM:SS are included in the min and max statistics or whether the min and max statistics are truncated to their "short" form of YYYY-MM-DD. Using DBR 15.2 with Spark 3.5.0 and Scala 2.12, the statistics were truncated to YYYY-MM-DD in the _delta_log.

The only example included in this section of the protocol specification seems to show how the min and max values in stats would be formatted, no further details regarding how these stats should be formatted for their respective data types.

Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

  • Yes. I can contribute a fix for this bug independently.
  • Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
  • No. I cannot contribute a bug fix at this time.
@TaylorHodan TaylorHodan added the bug Something isn't working label Jun 20, 2024
@vkorukanti
Copy link
Collaborator

Thanks for reporting this. Some of the info is in the Delta docs and in code around the configs that influence the stats collections. There is some info missing. We will be adding them soon to spec.

@LukasRupprecht
Copy link
Contributor

@vkorukanti I would be interested in covering this. If that's ok, please assign the ticket to me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants