-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arrow_json: support decimal 128 and 256 types in json writer #5197
arrow_json: support decimal 128 and 256 types in json writer #5197
Conversation
# arrow-data = { workspace = true } | ||
# arrow-schema = { workspace = true } | ||
|
||
arrow-buffer = { version = "49" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temporary change while testing in another system
explicit_nulls: bool, | ||
) -> Result<(), ArrowError> { | ||
let options = FormatOptions::default(); | ||
let formatter = ArrayFormatter::try_new(array.as_ref(), &options)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this formatter approach was lifted from the date-type writer logic above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will result in precision loss, decimals need to use the formatting logic in arrow-cast. I'm not entirely sure how to support this within the constraint of a serde_json value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @tustvold, i'm confused - the ArrayFormatter
is from arrow_cast
and, from what I can tell, uses the format_decimal logic in this impl (from arrow_cast
)
arrow-rs/arrow-cast/src/display.rs
Line 441 in bc96ca3
macro_rules! decimal_display { |
Let me know what I'm missing! Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tustvold did you mean to reference this method in the arrow_array
crate?
arrow-rs/arrow-cast/src/display.rs
Line 441 in bc96ca3
macro_rules! decimal_display { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is you are then parsing to a f64 which will result in precision loss. As serde_json value lacks a decimal representation you may be forced to encode decimals as JSON strings to avoid this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing non-native numeric quantities as string in JSON is a fairly standard approach to workaround the limited precision many readers have by default, serde_json included. Protobuf even serializes u64 to strings as part of its JSON specification. This is not just because the JSON specification only mandated double precision floating point, but because there are real performance disadvantages to doing otherwise.
I agree using a string is not a good thing, ideally we wouldn't be relying on serde_json to serialize, but aside from a fairly major rewrite of this code to not usw serde_json value, it seems to me like a pragmatic way to get something, which is better than not supporting anything?
Edit: FWIW I filed the ticket for actually de-serdifying the writer - #5314
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. 5314 which builds on top of the impressive #3479 looks like a great step forward. In the meantime, I appreciate that writing to string is the correct thing to do, but I recommend that we balance the ideal technical solution with how it will be used in the real world and that user experience.
For practical purposes, I strongly suggest that we provide the user an option to write decimals as json numbers. We could do this in a non-breaking way by adding a pub JsonWriter
that provides a method to set the decimal decoding as number (the default being string). This would ensure that by default, the correct thing is performed but the user can opt in for the practical parsing to number, saving them a ton of headache downstream.
For example, our downstream use case would require this feature; otherwise we'd have to maintain a fork of the crate that lets us serialize decimals as number; otherwise we'd have to layer in additional parsing on each field to see if string fields could actually be parsed to numbers (which would be commonly the case since we're dealing with small decimal-type numbers at the source).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, will that implementation enable us to write decimal types to json as a number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
Hi @matthewgapp , I was wondering if you'd have any plans to continue working on this PR, please? |
Hey @saeedzareian apologies about ghosting this. I'll pick this up again this week. It's blocking our downstream project. |
#5318 has now been merged and should give us flexibility to format decimals directly |
Hey, I would also be interested in this feature, are the next steps here known? Can I help? FWIW, I think I'd prefer decimals to be printed as strings, since otherwise it's easier for the reader of the json to screw things up when reading (if you print a decimal like "10.3" as |
Closing as this PR has been inactive for a while |
Implemented in #6606 |
Which issue does this PR close?
Closes #5198
Rationale for this change
This PR will enable support for Decimal type columns to json as floats
What changes are included in this PR?
This PR uses similar logic that's used to write arrow date types to write the decimal types to json.
Are there any user-facing changes?