-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Safely Write IntervalMonthDayNanoArray to parquet or Throw error #6299
Safely Write IntervalMonthDayNanoArray to parquet or Throw error #6299
Conversation
Supports writing IntervalMonthNanoArray to parquet only if its `nanoseconds` part does not exceed i32::MAX. (since interval logical type, in parquet, is represented as 4 bytes month + 4 bytes days + 4 bytes milliseconds = 12 bytes.) When the `nanoseconds` part does not exceed i32::MAX, then it is safe to write it to parquet after truncating it to 4 bytes. This unblocks the ones who needs to write arrow IntervalMonthDayNano with milliseconds precision to parquet. It currently always throws error even if it is safe to write the value to parquet.
This appears to be writing nanoseconds as milliseconds, which is incorrect? |
Physically it will not lose precision since we write it only if it <= i32::MAX. But yeah, logically seems incorrect since we write a I think it could be an option to check a metadata key like There is currently no way to write the interval with millis to parquet via arrow. And it seems that we can get rid of the current limitation. |
Right but this is not compatible with the parquet logical type definition, so is simply incorrect...
The broader issue here is that parquet doesn't support nanosecond precision intervals, and we're constrained by what the format itself supports - apache/parquet-format#313 |
Yes, I totally understand the point. But do you think below approach is broken or fragile in the context of arrow to parquet reader/writer? On
On
|
Perhaps #1938 might work for you? |
It looks like works, but AFAIU, it is not in wip for interval type yet, right? |
Right I was suggesting it as a potential path forward - I don't think we can merge something that either writes non-spec compliant parquet, or relies on creating non-spec compliant arrow arrays |
Supports writing
IntervalMonthNanoArray
to parquet only if itsnanoseconds
part does not exceedi32::MAX
. (since interval logical type, in parquet, is represented as 4 bytes months + 4 bytes days + 4 bytes milliseconds = 12 bytes.)When the
nanoseconds
part does not exceedi32::MAX
, then it is safe to write it to parquet after truncating it to 4 bytes. Otherwise, we throw error as we lose precision.Which issue does this PR close?
Closes #6298.
Rationale for this change
This unblocks the ones who needs to write arrow
IntervalMonthDayNano
with milliseconds precision to parquet. It currently always throws error even if it is safe to write the value to parquet.What changes are included in this PR?
When the
nanoseconds
part does not exceedi32::MAX
, then it is safe to write it to parquet after truncating it to 4 bytes.When the
nanoseconds
part exceedsi32::MAX
, then we throw error.Are there any user-facing changes?
No.