Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add pyarrow.Table support #487

Merged
merged 16 commits into from
Nov 25, 2024
Merged

feat: add pyarrow.Table support #487

merged 16 commits into from
Nov 25, 2024

Conversation

amol-
Copy link
Contributor

@amol- amol- commented Oct 5, 2024

Summary

Allows to directly use pyarrow.Table objects without having to cast them to pandas (which is an optional dependency of pyarrow). This is also convenient as more and more libraries are supporting a to_arrow() method and thus would allow to use data from those libraries too.

Checklist

@amol-
Copy link
Contributor Author

amol- commented Oct 5, 2024

While I added tests for tbl_data it wasn't immediately clear to me if e2e tests for each table library exist and how to add one.  I updated the tests where @parametrize was used to test across multiple backends

Copy link

codecov bot commented Oct 7, 2024

Codecov Report

Attention: Patch coverage is 88.37209% with 15 lines in your changes missing coverage. Please review.

Project coverage is 89.79%. Comparing base (4771178) to head (ef41328).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
great_tables/_tbl_data.py 88.00% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #487      +/-   ##
==========================================
+ Coverage   89.71%   89.79%   +0.07%     
==========================================
  Files          45       45              
  Lines        5202     5321     +119     
==========================================
+ Hits         4667     4778     +111     
- Misses        535      543       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

great_tables/_tbl_data.py Outdated Show resolved Hide resolved
@machow
Copy link
Collaborator

machow commented Oct 9, 2024

Thanks for opening this PR! It seems like adding pyarrow support is a good chance for us to clean up our tests of different backends a bit, and make sure the tests in tbl_data cover the range of use cases.

If we could add a snapshot test (would be okay to put at the bottom of test_tbl_data.py I think), I'd be on board with merging this. We can note that it's in development, while tbl_data tests get cleaned up. (I'm happy to add the snapshot to this PR)

@amol-
Copy link
Contributor Author

amol- commented Oct 9, 2024

If we could add a snapshot test (would be okay to put at the bottom of test_tbl_data.py I think), I'd be on board with merging this. We can note that it's in development, while tbl_data tests get cleaned up. (I'm happy to add the snapshot to this PR)

Happy to add a test, not a problem 👍

.pre-commit-config.yaml Outdated Show resolved Hide resolved
@amol-
Copy link
Contributor Author

amol- commented Oct 9, 2024

If we could add a snapshot test (would be okay to put at the bottom of test_tbl_data.py I think), I'd be on board with merging this. We can note that it's in development, while tbl_data tests get cleaned up. (I'm happy to add the snapshot to this PR)

Happy to add a test, not a problem 👍

Test added in d3cdd1c

But I couldn't find a way to avoid duplicating the snapshot that didn't involve writing a custom AmberSnapshotExtension

@@ -136,7 +136,7 @@ def _object_as_dict(v: Any) -> Any:
return list(_object_as_dict(i) for i in v)
if isinstance(v, dict):
return dict((k, _object_as_dict(val)) for (k, val) in v.items())
if type(v) == type(_object_as_dict): # FIXME figure out how to get "function"
if type(v) is type(_object_as_dict): # FIXME figure out how to get "function"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to satisfy newer flake8: https://github.com/amol-/great-tables/actions/runs/11259168945/job/31307498137

great_tables/_utils.py:139:8: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`

but indeed it was a good idea anyway

Copy link
Collaborator

@machow machow Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot -- it seems like maybe this check was originally trying to do either if callable(v): or maybe inspect.isfunction(v) 😓

(def seems okay to leave as is, and we can fix it up later if flake8 is happy again!)

Copy link
Collaborator

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for this -- I'd be on board merging this with an experimental warning! The one thing I think that might be important is if we could revert back the changes to the pandas and polars implementation of _set_cell, just to keep it scoped onto PyArrow.

I should be able to tweak the use of _set_cell to be column focused down the road, which should hopefully work better with frame libraries 😓

(Happy to make any of these changes and merge if useful!)

@@ -173,7 +174,7 @@ def render_formats(self, data_tbl: TblData, formats: list[FormatInfo], context:
# TODO: I think that this is very inefficient with polars, so
# we could either accumulate results and set them per column, or
# could always use a pandas DataFrame inside Body?
_set_cell(self.body, row, col, result)
self.body = _set_cell(self.body, row, col, result)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spending time working through this quirk in the formatter 😬

What if we did this for now?:

  • if _set_cell returns something that is not None, we set it to self.body
  • We restore the original behavior of _set_cell returning None for pandas and polars, and mutating them (just to ensure this PR stays scoped to pyarrow).
  • After merging, I can open an issue to refactor this bit of code so it works columnwise (rather than on each individual row, col cell)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds a reasonable approach

great_tables/_tbl_data.py Show resolved Hide resolved
@amol-
Copy link
Contributor Author

amol- commented Oct 15, 2024

@machow I think all requested changes should have been addressed

@machow
Copy link
Collaborator

machow commented Oct 25, 2024

Hey, thanks for putting all this work in on the PR! I'm just getting back from being OOO, and can work on merging next week!

@machow machow self-requested a review November 12, 2024 21:22
@machow machow self-assigned this Nov 13, 2024
@machow
Copy link
Collaborator

machow commented Nov 22, 2024

Ok, thanks for infinite patience here -- I'm planning to take a look on Monday

Copy link
Collaborator

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @amol- for pushing this through -- it's really neat to see how things wire up to pyarrow! (And for us to have to think about our _set_cell behaviors...)

@machow machow assigned rich-iannone and unassigned machow Nov 25, 2024
@machow
Copy link
Collaborator

machow commented Nov 25, 2024

@rich-iannone do you want to merge?

@rich-iannone rich-iannone merged commit 925fa41 into posit-dev:main Nov 25, 2024
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants