-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix arrow-json encoding with dictionary including nulls #6503
fix arrow-json encoding with dictionary including nulls #6503
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can revert this move if you prefer, but I think the previous naming was weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a nice improvement to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually seems mod.rs
naming is no longer the preferred convention: https://doc.rust-lang.org/reference/items/modules.html#module-source-filenames
Note: Prior to
rustc
1.30, usingmod.rs
files was the way to load a module with nested children. It is encouraged to use the new naming convention as it is more consistent, and avoids having many files namedmod.rs
within a project.
I don't have any particular strong feelings about this, as it would be nice to keep consistent (as reader already has mod.rs
) 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree keeping consistency is a good thing
See #5318 (comment) for discussion of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samuelcolvin and @mbrobbel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a nice improvement to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually seems mod.rs
naming is no longer the preferred convention: https://doc.rust-lang.org/reference/items/modules.html#module-source-filenames
Note: Prior to
rustc
1.30, usingmod.rs
files was the way to load a module with nested children. It is encouraged to use the new naming convention as it is more consistent, and avoids having many files namedmod.rs
within a project.
I don't have any particular strong feelings about this, as it would be nice to keep consistent (as reader already has mod.rs
) 👍
#[test] | ||
fn test_writer_null_dict() { | ||
let keys = Int32Array::from_iter(vec![Some(0), None, Some(1)]); | ||
let values = Arc::new(StringArray::from_iter(vec![Some("a"), None])); | ||
let dict = DictionaryArray::new(keys, values); | ||
|
||
let schema = SchemaRef::new(Schema::new(vec![Field::new( | ||
"my_dict", | ||
DataType::Dictionary(DataType::Int32.into(), DataType::Utf8.into()), | ||
true, | ||
)])); | ||
|
||
let array = Arc::new(dict) as ArrayRef; | ||
let batch = RecordBatch::try_new(schema, vec![array]).unwrap(); | ||
|
||
let mut json = Vec::new(); | ||
let write_builder = WriterBuilder::new().with_explicit_nulls(true); | ||
let mut writer = write_builder.build::<_, JsonArray>(&mut json); | ||
writer.write(&batch).unwrap(); | ||
writer.close().unwrap(); | ||
|
||
let json_str = str::from_utf8(&json).unwrap(); | ||
assert_eq!( | ||
json_str, | ||
r#"[{"my_dict":"a"},{"my_dict":null},{"my_dict":null}]"# | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks @samuelcolvin and @Jefffrey and @mbrobbel |
Rationale for this change
While dictionary values are generally not null (it's more efficient to have a
null
in the keys), there's nothing to stop them from being nulls, and currentlydatafusion-functions-json
generates dictionaries with null values.Until now they caused
arrow-json
to panic, that is now fixed.What changes are included in this PR?
Use
make_encoder_impl
, notmake_encoder
inDictionaryEncoder
to avoid unnecessary and incorrectnulls.is_none()
assertion.Add a test.
Are there any user-facing changes?
no.
Related: we've copied arrow-json and customised it to support the unions generated by
datafusion-functions-json
, and also to inline JSON data within JSON based no field metadata, would you be inerested in accepting logic to allow customising of JSON encoding?