-
Notifications
You must be signed in to change notification settings - Fork 833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support zero column RecordBatch
es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)
#6320
Support zero column RecordBatch
es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)
#6320
Conversation
arrow/src/pyarrow.rs
Outdated
// Technically `num_rows` is an attribute on `pyarrow.RecordBatch` | ||
// If other python classes can use the PyCapsule interface and do not have this attribute, | ||
// then this will have no effect. | ||
let row_count = value | ||
.getattr("num_rows") | ||
.ok() | ||
.and_then(|x| x.extract().ok()); | ||
let options = RecordBatchOptions::default().with_row_count(row_count); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial thought is that the PyCapsule interface should handle this, and so this should not be before checking for the pycapsule dunder. If this breaks via the C data interface, I'd like to look for a fix to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd strongly prefer a non-pyarrow-specific solution to this, or else we'll get the same failure from other Arrow producers.
In kylebarron/arro3#177 I added some tests to arro3 to make sure my (arrow-rs derived) FFI can handle this. It's a bit annoying: the ArrayData
will have positive length but then once you import that with makeData
, you'll have a StructArray
with length 0
. I think your most recent commit fixes this.
@@ -476,6 +476,27 @@ def test_tensor_array(): | |||
|
|||
del b | |||
|
|||
|
|||
def test_empty_recordbatch_with_row_count(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose CI is likely always testing with the most recent version of pyarrow, and thus we only really test with the PyCapsule Interface, not with the pyarrow-specific FFI. If you wanted to ensure you're testing the PyCapsule Interface, you can create a wrapper class around a pa.RecordBatch
that only exposes the PyCapsule dunder method:
Then you can be assured that
rust.round_trip_record_batch(PyCapsuleArrayHolder(batch))
is testing the PyCapsule Interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like CI runs with at least both pyarrow 13 (last release before capsules) and 14
https://github.com/apache/arrow-rs/actions/runs/10603372118?pr=6320
RecordBatch
es in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Michael-J-Ward and @kylebarron. This change looks good to me ❤️
Which issue does this PR close?
Closes #6318.
Rationale for this change
arrow-rs
already has the capability of handling RecordBatches with no columns or data but non-zero row counts #1552.However,
from_pyarrow_bound
forRecordBatch
does not currently take advantage of it, which causes an error when runningselect count(*)
from pyarrow datasets indatafusion-python
apache/datafusion-python#800.What changes are included in this PR?
This updates the
impl FromPyarrowBound
forRecordBatch
to usetry_new_with_options
instead of the defaulttry_new
.Are there any user-facing changes?
Yes,
from_pyarrow_bound
will now succeed for a RecordBatch with no columns but a row-count set.