-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Polars dataframes across the library #769
Comments
I'm working on testing for polars inputs in : test_deduplicate.py |
I wonder if instead of creating separate tests to compare polars to pandas, we should parametrize the existing tests to run them once on pandas dataframes and once on polars dataframes? |
as is done in this test for the agg joiner for example |
I wonder if instead of creating separate tests to compare polars to pandas, we should parametrize the existing tests to run them once on pandas dataframes and once on polars dataframes?
Fine with me. Whatever makes the code more natural and readable.
|
All done, last item was completed in #945 |
Congratulations, this is great!
Maybe a line in the CHANGES.rst to say that support of polars is now complete?
|
Currently, we only partially support Polars dataframes, in most cases thanks to
skrub._utils.check_input
that converts dataframes into numpy arrays viasklearn.utils.validation.check_array
.Moreover, #733 introduced Pandas and Polars operations like
aggregation
andjoin
. Note that this duplicated logic will be replaced in the mid-term by the dataframe consortium standard, as discussed in #719The following methods need to be fixed to enable Polars dataframes:
TableVectorizer.get_feature_names_out()
fuzzy_join()
The following tests need to at least check for polars dataframe inputs:
We also need to enable polars output with our
TableVectorizer
, by running:Having Polars output in
ColumnTransformer
is currently under discussion at scikit-learn/scikit-learn#25896. When made available inColumnTransformer
, this feature will also be available inTableVectorizer
directly.In the meantime, we could create a minimalistic workaround to enable Polars outputs.
This will require:
TableVectorizer.get_feature_names_out()
(mentioned above) to be fixedTo accomplish this, I suggest to:
TableVectorizer
theset_output
function, initially defined inTransformerMixin
parent class,_SetOutputMixin
:super().set_output(transform="pandas")
self.column_transformer.set_output(transform="pandas")
, and use the flag again afterself.column_transformer.fit_transform(X)
to convert the output to a Polars dataframe.transform
and apply the same logic.The text was updated successfully, but these errors were encountered: