Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support zero column RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch) #6320

Merged
merged 6 commits into from
Aug 31, 2024

Conversation

Michael-J-Ward
Copy link
Contributor

@Michael-J-Ward Michael-J-Ward commented Aug 28, 2024

Which issue does this PR close?

Closes #6318.

Rationale for this change

arrow-rs already has the capability of handling RecordBatches with no columns or data but non-zero row counts #1552.

However, from_pyarrow_bound for RecordBatch does not currently take advantage of it, which causes an error when running select count(*) from pyarrow datasets in datafusion-python apache/datafusion-python#800.

What changes are included in this PR?

This updates the impl FromPyarrowBound for RecordBatch to use try_new_with_options instead of the default try_new.

Are there any user-facing changes?

Yes, from_pyarrow_bound will now succeed for a RecordBatch with no columns but a row-count set.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 28, 2024
Comment on lines 336 to 344
// Technically `num_rows` is an attribute on `pyarrow.RecordBatch`
// If other python classes can use the PyCapsule interface and do not have this attribute,
// then this will have no effect.
let row_count = value
.getattr("num_rows")
.ok()
.and_then(|x| x.extract().ok());
let options = RecordBatchOptions::default().with_row_count(row_count);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial thought is that the PyCapsule interface should handle this, and so this should not be before checking for the pycapsule dunder. If this breaks via the C data interface, I'd like to look for a fix to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd strongly prefer a non-pyarrow-specific solution to this, or else we'll get the same failure from other Arrow producers.

In kylebarron/arro3#177 I added some tests to arro3 to make sure my (arrow-rs derived) FFI can handle this. It's a bit annoying: the ArrayData will have positive length but then once you import that with makeData, you'll have a StructArray with length 0. I think your most recent commit fixes this.

@@ -476,6 +476,27 @@ def test_tensor_array():

del b


def test_empty_recordbatch_with_row_count():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose CI is likely always testing with the most recent version of pyarrow, and thus we only really test with the PyCapsule Interface, not with the pyarrow-specific FFI. If you wanted to ensure you're testing the PyCapsule Interface, you can create a wrapper class around a pa.RecordBatch that only exposes the PyCapsule dunder method:

https://github.com/pola-rs/polars/blob/b2550a092e34aa40f8786f45ff67cab96c93695d/py-polars/tests/unit/constructors/test_constructors.py#L1661-L1676

Then you can be assured that

rust.round_trip_record_batch(PyCapsuleArrayHolder(batch))

is testing the PyCapsule Interface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like CI runs with at least both pyarrow 13 (last release before capsules) and 14

https://github.com/apache/arrow-rs/actions/runs/10603372118?pr=6320

@alamb alamb changed the title use RecordBatchOptions when converting a pyarrow RecordBatch Support zero column RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch) Aug 31, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Michael-J-Ward and @kylebarron. This change looks good to me ❤️

@alamb alamb merged commit 0c15191 into apache:master Aug 31, 2024
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow converting empty pyarrow.RecordBatch to arrow::RecordBatch
3 participants