-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Polars support in Joiner #945
Conversation
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
…put column has a different name
Should we rename |
Should we rename `self._matching` into `self._matcher`, as well as the associated files `_matching` into `_matcher` ? Or do we keep this for an upcoming PR ?
I would prefer to leave it for another PR
|
the CI failure is due to the fact polars support in the column transformer was only added in scikit-learn 1.4, when polars was added to the set_output API: https://github.com/scikit-learn/scikit-learn/pull/27315/files#diff-66316a90c42bbe375eef74f1b9dcbd6a1b5b99f60feb692f676067a633d30f60R226 |
I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings |
if we do this we get another error when scikit-learn tries to access diff --git a/skrub/_joiner.py b/skrub/_joiner.py
index 0c325a09..795f71c2 100644
--- a/skrub/_joiner.py
+++ b/skrub/_joiner.py
@@ -5,11 +5,13 @@ The Joiner provides fuzzy joining as a scikit-learn transformer.
from functools import partial
import numpy as np
+import sklearn
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, StandardScaler
+from sklearn.utils.fixes import parse_version
from sklearn.utils.validation import check_is_fitted
from . import _dataframe as sbd
@@ -39,6 +41,12 @@ _MATCHERS = {
DEFAULT_REF_DIST = "random_pairs"
+def _compat_df(df):
+ if parse_version(sklearn.__version__) < parse_version("1.4"):
+ return sbd.to_pandas(df)
+ return df
+
+
def _make_vectorizer(table, string_encoder, rescale):
"""Construct the transformer used to vectorize joining columns.
@@ -299,7 +307,7 @@ class Joiner(TransformerMixin, BaseEstimator):
rescale=self.ref_dist != "no_rescaling",
)
aux = self.vectorizer_.fit_transform(
- s.select(self._aux_table, s.cols(*self._aux_key))
+ _compat_df(s.select(self._aux_table, s.cols(*self._aux_key)))
)
self._matching.fit(aux)
return self
@@ -327,7 +335,11 @@ class Joiner(TransformerMixin, BaseEstimator):
X, self._aux_table, self.suffix, main_table_name="X"
)
main = self.vectorizer_.transform(
- sbd.set_column_names(s.select(X, s.cols(*self._main_key)), self._aux_key)
+ _compat_df(
+ sbd.set_column_names(
+ s.select(X, s.cols(*self._main_key)), self._aux_key
+ )
+ )
)
match_result = self._matching.match(main, self.max_dist_)
matching_col = match_result["index"].copy() this fixes the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @TheooJ ! a few more comments :)
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! thanks a ton @TheooJ this is a big one!!
Refactoring the
Joiner
:Joiner
&fuzzy_join()
left_join()
to have a single join utilsagg_joiner
to remove_pandas/_polars.join()
with_columns