Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Polars support in Joiner #945

Merged
merged 64 commits into from
Jun 21, 2024
Merged

Conversation

TheooJ
Copy link
Contributor

@TheooJ TheooJ commented Jun 13, 2024

Refactoring the Joiner:

  • Support for polars dataframes, use selectors + the dispatch mechanism in Joiner & fuzzy_join()
  • Better handling of duplicate column names, and input dataframe checks
  • Dispatch left_join() to have a single join utils
    • Cleanup in agg_joiner to remove _pandas/_polars.join()
  • Dispatch with_columns

@jeromedockes jeromedockes added this to the 0.1.2 milestone Jun 13, 2024
skrub/_join_utils.py Outdated Show resolved Hide resolved
TheooJ and others added 2 commits June 16, 2024 23:45
Co-authored-by: Jérôme Dockès <jerome@dockes.org>
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Outdated Show resolved Hide resolved
skrub/_join_utils.py Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
@TheooJ
Copy link
Contributor Author

TheooJ commented Jun 17, 2024

Should we rename self._matching into self._matcher, as well as the associated files _matching into _matcher ? Or do we keep this for an upcoming PR ?

@jeromedockes
Copy link
Member

jeromedockes commented Jun 18, 2024 via email

@TheooJ TheooJ marked this pull request as ready for review June 19, 2024 14:45
@jeromedockes
Copy link
Member

the CI failure is due to the fact polars support in the column transformer was only added in scikit-learn 1.4, when polars was added to the set_output API: https://github.com/scikit-learn/scikit-learn/pull/27315/files#diff-66316a90c42bbe375eef74f1b9dcbd6a1b5b99f60feb692f676067a633d30f60R226

@jeromedockes
Copy link
Member

I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings

@jeromedockes
Copy link
Member

I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings

if we do this we get another error when scikit-learn tries to access .ndims. but actually here we are just vectorizing the key columns and we want the output as numpy or a scipy sparse matrix anyway, I think we can just convert to pandas before vectorizing.

diff --git a/skrub/_joiner.py b/skrub/_joiner.py
index 0c325a09..795f71c2 100644
--- a/skrub/_joiner.py
+++ b/skrub/_joiner.py
@@ -5,11 +5,13 @@ The Joiner provides fuzzy joining as a scikit-learn transformer.
 from functools import partial
 
 import numpy as np
+import sklearn
 from sklearn.base import BaseEstimator, TransformerMixin, clone
 from sklearn.compose import make_column_transformer
 from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import FunctionTransformer, StandardScaler
+from sklearn.utils.fixes import parse_version
 from sklearn.utils.validation import check_is_fitted
 
 from . import _dataframe as sbd
@@ -39,6 +41,12 @@ _MATCHERS = {
 DEFAULT_REF_DIST = "random_pairs"
 
 
+def _compat_df(df):
+    if parse_version(sklearn.__version__) < parse_version("1.4"):
+        return sbd.to_pandas(df)
+    return df
+
+
 def _make_vectorizer(table, string_encoder, rescale):
     """Construct the transformer used to vectorize joining columns.
 
@@ -299,7 +307,7 @@ class Joiner(TransformerMixin, BaseEstimator):
             rescale=self.ref_dist != "no_rescaling",
         )
         aux = self.vectorizer_.fit_transform(
-            s.select(self._aux_table, s.cols(*self._aux_key))
+            _compat_df(s.select(self._aux_table, s.cols(*self._aux_key)))
         )
         self._matching.fit(aux)
         return self
@@ -327,7 +335,11 @@ class Joiner(TransformerMixin, BaseEstimator):
             X, self._aux_table, self.suffix, main_table_name="X"
         )
         main = self.vectorizer_.transform(
-            sbd.set_column_names(s.select(X, s.cols(*self._main_key)), self._aux_key)
+            _compat_df(
+                sbd.set_column_names(
+                    s.select(X, s.cols(*self._main_key)), self._aux_key
+                )
+            )
         )
         match_result = self._matching.match(main, self.max_dist_)
         matching_col = match_result["index"].copy()

this fixes the issue

@jeromedockes
Copy link
Member

@TheooJ I opened a pr on your branch to fix the issue for old scikit-learn versions: TheooJ#2

@TheooJ TheooJ changed the title [WIP] Refactor Joiner, dispatch left_join ENH Polars support in Joiner Jun 20, 2024
@TheooJ TheooJ force-pushed the refactor_joiner branch from e600945 to d139fd8 Compare June 20, 2024 15:47
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @TheooJ ! a few more comments :)

skrub/tests/test_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/tests/test_fuzzy_join.py Outdated Show resolved Hide resolved
skrub/tests/test_fuzzy_join.py Show resolved Hide resolved
skrub/tests/test_join_utils.py Outdated Show resolved Hide resolved
skrub/tests/test_join_utils.py Outdated Show resolved Hide resolved
skrub/tests/test_join_utils.py Show resolved Hide resolved
skrub/_joiner.py Show resolved Hide resolved
skrub/_joiner.py Show resolved Hide resolved
skrub/tests/test_joiner.py Outdated Show resolved Hide resolved
skrub/tests/test_joiner.py Outdated Show resolved Hide resolved
skrub/_joiner.py Outdated Show resolved Hide resolved
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great! thanks a ton @TheooJ this is a big one!!

@jeromedockes jeromedockes merged commit 7fe0f27 into skrub-data:main Jun 21, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants