-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raw JSON writer (~10x faster) (#5314) #5318
Conversation
@@ -20,28 +20,6 @@ | |||
//! This JSON writer converts Arrow [`RecordBatch`]es into arrays of | |||
//! JSON objects or JSON formatted byte streams. | |||
//! | |||
//! ## Writing JSON Objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This functionality isn't removed, yet, but it is deprecated as I can't think of any reasonable use-cases for this. If you're wanting to embed arrow data in another JSON document, serde_json's raw value mechanism is an objectively better way to go about doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it is deprecated as I can't think of any reasonable use-cases for this.
Looks like @houqp added it in d868cff many 🌔 's ago - perhaps he has some additional context.
I agree I can't really think of why this would be useful - it seems like it may be similar to wanting to convert RecordBatches into actual Rust struct
s via serde but I can't remember how far we got with that
Given I am not familiar with serde_json's raw value mechanism
I suspect others may not be either
Perhaps you can add a note here about writing JSON objects using serde and leave a link for readers to follow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @tustvold, could you clarify on what the serde_json's raw value mechanism is you're thinking of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this isn't clear to me either as I mentioned in the original review -- I made a PR to add an example showing how to use this: #5364
@@ -1564,9 +1575,9 @@ mod tests { | |||
r#"{"a":{"list":[1,2]},"b":{"list":[1,2]}} | |||
{"a":{"list":[null]},"b":{"list":[null]}} | |||
{"a":{"list":[]},"b":{"list":[]}} | |||
{"a":null,"b":{"list":[3,null]}} | |||
{"b":{"list":[3,null]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prior behaviour feels like a bug to me, without explicit nulls set I would expect consistent use of implicit nulls. The fact that null objects happen to be treated differently to null primitives seems at best confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem like it was a bug previously, I'm just racking my brain to remember if I was aware of this before or not, if there was a reason for this 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when I worked on #5133 I just forgot to consider my previous work for writing explicit nulls in #5065.
This fix makes sense; the only case where we should write nulls if explicit_nulls
is set to false
(i.e. the default) is for list values, and nothing else, I believe. This falls in line with that 👍
8898d2e
to
552828e
Compare
arrow-json/test/data/basic.json
Outdated
@@ -1,5 +1,5 @@ | |||
{"a":1, "b":2.0, "c":false, "d":"4", "e":"1970-1-2", "f": "1.02", "g": "2012-04-23T18:25:43.511", "h": 1.1} | |||
{"a":-10, "b":-3.5, "c":true, "d":"4", "e": "1969-12-31", "f": "-0.3", "g": "2016-04-23T18:25:43.511", "h": 3.141} | |||
{"a":1, "b":2.0, "c":false, "d":"4", "e":"1970-1-2", "f": "1.02", "g": "2012-04-23T18:25:43.511", "h": 1.2802734375} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous writer had some questionable logic to truncate the precision of its output. We no longer do this, and so we need to use a float that can roundtrip be exactly represented in a f16 in order for it to roundtrip precisely.
259163b
to
49a0357
Compare
49a0357
to
e8f914c
Compare
I'm going to label this as an API change, as whilst it technically isn't a breaking change, there is a high risk of there being subtle behaviour changes, especially around the encoding of nulls |
I will re-run the benchmarks tomorrow |
arrow-json/src/writer.rs
Outdated
@@ -703,7 +682,7 @@ where | |||
format: F, | |||
|
|||
/// Whether keys with null values should be written or skipped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Whether keys with null values should be written or skipped | |
/// Controls how JSON should be encoded, e.g. whether to write explicit nulls or skip them |
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
out.push(b'"'); | ||
// Should be infallible | ||
// Note: We are making an assumption that the formatter does not produce characters that require escaping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you expand on this a little? I'm not sure I follow 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, basically if users can provide format specifications containing "
we need to escape them when serializing to JSON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some comments to this effect elsewhere. I wonder if it is possible to add a test that would fail if the invariant was broken in the future. I suspect the answer is no given it is not possible to specify format specifiers now 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it isn't currently possible to hit this, I am just documenting it here for future readers who may not realise this detail
@@ -1564,9 +1575,9 @@ mod tests { | |||
r#"{"a":{"list":[1,2]},"b":{"list":[1,2]}} | |||
{"a":{"list":[null]},"b":{"list":[null]}} | |||
{"a":{"list":[]},"b":{"list":[]}} | |||
{"a":null,"b":{"list":[3,null]}} | |||
{"b":{"list":[3,null]}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does seem like it was a bug previously, I'm just racking my brain to remember if I was aware of this before or not, if there was a reason for this 🤔
Most recent numbers
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really nice to me -- thank you @tustvold
I had some comment quibbles but nothing that is required from my perspective.
Basically I would summarize this PR as "converting to JSON and then writing Value
s to bytes is very slow"
@@ -20,28 +20,6 @@ | |||
//! This JSON writer converts Arrow [`RecordBatch`]es into arrays of | |||
//! JSON objects or JSON formatted byte streams. | |||
//! | |||
//! ## Writing JSON Objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it is deprecated as I can't think of any reasonable use-cases for this.
Looks like @houqp added it in d868cff many 🌔 's ago - perhaps he has some additional context.
I agree I can't really think of why this would be useful - it seems like it may be similar to wanting to convert RecordBatches into actual Rust struct
s via serde but I can't remember how far we got with that
Given I am not familiar with serde_json's raw value mechanism
I suspect others may not be either
Perhaps you can add a note here about writing JSON objects using serde and leave a link for readers to follow
@@ -481,6 +463,7 @@ fn set_column_for_json_rows( | |||
|
|||
/// Converts an arrow [`RecordBatch`] into a `Vec` of Serde JSON | |||
/// [`JsonMap`]s (objects) | |||
#[deprecated(note = "Use Writer")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't figure out if the deprecation is needed for the new json writer, or did you just include it in the same PR for convenience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I lumped the deprecation into this PR as moving the writer over to mainly not use this functionality means a reduction in our test coverage of it
arrow-json/src/writer/encoder.rs
Outdated
float_encode!(f32, f64); | ||
|
||
impl PrimitiveEncode for f16 { | ||
type Buffer = <f64 as PrimitiveEncode>::Buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we just use the PrimitiveEncode
directly for f16
? I doubt the performance of f16 encoding is particular critical but I am curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the formulation of PrimitiveEncoder expects fixed size buffers... Having peeked at f16's display impl, it converts to f32 in order to print and to parse, so will update this to likewise
// Workaround https://github.com/rust-lang/rust/issues/61415 | ||
fn init_buffer() -> Self::Buffer; | ||
|
||
fn encode(self, buf: &mut Self::Buffer) -> &[u8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to document what encode does here
fn encode(self, buf: &mut Self::Buffer) -> &[u8]; | |
/// Encode the primitive value as bytes, returning a reference to that slice. | |
/// `buf` is temporary space that may be used | |
fn encode(self, buf: &mut Self::Buffer) -> &[u8]; |
options: &EncoderOptions, | ||
) -> Result<Box<dyn Encoder + 'a>, ArrowError> { | ||
let (encoder, nulls) = make_encoder_impl(array, options)?; | ||
assert!(nulls.is_none(), "root cannot be nullable"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this -- isn't it possible to try to encode a BooleanArray
as the root with null values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root is called with a StructArray derived from a RecordBatch, and therefore cannot be nullable
pub explicit_nulls: bool, | ||
} | ||
|
||
pub trait Encoder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please document the expectations on nullability here? Specifically, it seems like this code assumes that this is invoked with idx
for non-null entries, which was not clear to me on my first read of this code
|
||
impl Encoder for BooleanEncoder { | ||
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
match self.0.value(idx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was pretty confused at first trying to figure out why this doesn't check for null, but then I saw the null check is handled in the outer loop
fn encode(&mut self, idx: usize, out: &mut Vec<u8>) { | ||
out.push(b'"'); | ||
// Should be infallible | ||
// Note: We are making an assumption that the formatter does not produce characters that require escaping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw some comments to this effect elsewhere. I wonder if it is possible to add a test that would fail if the invariant was broken in the future. I suspect the answer is no given it is not possible to specify format specifiers now 🤔
pub trait Encoder { | ||
/// Encode the non-null value at index `idx` to `out` | ||
/// | ||
/// The behaviour is unspecified if `idx` corresponds to a null index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #5314
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?