-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add pyarrow.Table support #487
Conversation
While I added tests for |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #487 +/- ##
==========================================
+ Coverage 89.71% 89.79% +0.07%
==========================================
Files 45 45
Lines 5202 5321 +119
==========================================
+ Hits 4667 4778 +111
- Misses 535 543 +8 ☔ View full report in Codecov by Sentry. |
Thanks for opening this PR! It seems like adding pyarrow support is a good chance for us to clean up our tests of different backends a bit, and make sure the tests in tbl_data cover the range of use cases. If we could add a snapshot test (would be okay to put at the bottom of |
Happy to add a test, not a problem 👍 |
Test added in d3cdd1c But I couldn't find a way to avoid duplicating the snapshot that didn't involve writing a custom AmberSnapshotExtension |
great_tables/_utils.py
Outdated
@@ -136,7 +136,7 @@ def _object_as_dict(v: Any) -> Any: | |||
return list(_object_as_dict(i) for i in v) | |||
if isinstance(v, dict): | |||
return dict((k, _object_as_dict(val)) for (k, val) in v.items()) | |||
if type(v) == type(_object_as_dict): # FIXME figure out how to get "function" | |||
if type(v) is type(_object_as_dict): # FIXME figure out how to get "function" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to satisfy newer flake8: https://github.com/amol-/great-tables/actions/runs/11259168945/job/31307498137
great_tables/_utils.py:139:8: E721 do not compare types, for exact checks use `is` / `is not`, for instance checks use `isinstance()`
but indeed it was a good idea anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot -- it seems like maybe this check was originally trying to do either if callable(v):
or maybe inspect.isfunction(v)
😓
(def seems okay to leave as is, and we can fix it up later if flake8 is happy again!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for this -- I'd be on board merging this with an experimental warning! The one thing I think that might be important is if we could revert back the changes to the pandas and polars implementation of _set_cell
, just to keep it scoped onto PyArrow.
I should be able to tweak the use of _set_cell
to be column focused down the road, which should hopefully work better with frame libraries 😓
(Happy to make any of these changes and merge if useful!)
great_tables/_gt_data.py
Outdated
@@ -173,7 +174,7 @@ def render_formats(self, data_tbl: TblData, formats: list[FormatInfo], context: | |||
# TODO: I think that this is very inefficient with polars, so | |||
# we could either accumulate results and set them per column, or | |||
# could always use a pandas DataFrame inside Body? | |||
_set_cell(self.body, row, col, result) | |||
self.body = _set_cell(self.body, row, col, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spending time working through this quirk in the formatter 😬
What if we did this for now?:
- if
_set_cell
returns something that is not None, we set it to self.body - We restore the original behavior of
_set_cell
returning None for pandas and polars, and mutating them (just to ensure this PR stays scoped to pyarrow). - After merging, I can open an issue to refactor this bit of code so it works columnwise (rather than on each individual row, col cell)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds a reasonable approach
@machow I think all requested changes should have been addressed |
Hey, thanks for putting all this work in on the PR! I'm just getting back from being OOO, and can work on merging next week! |
Ok, thanks for infinite patience here -- I'm planning to take a look on Monday |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @amol- for pushing this through -- it's really neat to see how things wire up to pyarrow! (And for us to have to think about our _set_cell
behaviors...)
@rich-iannone do you want to merge? |
Summary
Allows to directly use
pyarrow.Table
objects without having to cast them to pandas (which is an optional dependency of pyarrow). This is also convenient as more and more libraries are supporting ato_arrow()
method and thus would allow to use data from those libraries too.Checklist