ENH Polars support in Joiner #945

TheooJ · 2024-06-13T14:29:13Z

Refactoring the Joiner:

Support for polars dataframes, use selectors + the dispatch mechanism in Joiner & fuzzy_join()
Better handling of duplicate column names, and input dataframe checks
Dispatch left_join() to have a single join utils
- Cleanup in agg_joiner to remove _pandas/_polars.join()
Dispatch with_columns

skrub/_dataframe/_common.py

skrub/_join_utils.py

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

skrub/_join_utils.py

skrub/_joiner.py

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

…put column has a different name

TheooJ · 2024-06-17T23:23:31Z

Should we rename self._matching into self._matcher, as well as the associated files _matching into _matcher ? Or do we keep this for an upcoming PR ?

jeromedockes · 2024-06-18T05:00:17Z

Should we rename `self._matching` into `self._matcher`, as well as the associated files `_matching` into `_matcher` ? Or do we keep this for an upcoming PR ?

I would prefer to leave it for another PR

jeromedockes · 2024-06-19T19:12:59Z

the CI failure is due to the fact polars support in the column transformer was only added in scikit-learn 1.4, when polars was added to the set_output API: https://github.com/scikit-learn/scikit-learn/pull/27315/files#diff-66316a90c42bbe375eef74f1b9dcbd6a1b5b99f60feb692f676067a633d30f60R226

jeromedockes · 2024-06-19T19:21:14Z

I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings

jeromedockes · 2024-06-19T19:46:21Z

I guess a workaround would be to provide the columns directly as integer indices to the columntransformer instead of strings

if we do this we get another error when scikit-learn tries to access .ndims. but actually here we are just vectorizing the key columns and we want the output as numpy or a scipy sparse matrix anyway, I think we can just convert to pandas before vectorizing.

diff --git a/skrub/_joiner.py b/skrub/_joiner.py
index 0c325a09..795f71c2 100644
--- a/skrub/_joiner.py
+++ b/skrub/_joiner.py
@@ -5,11 +5,13 @@ The Joiner provides fuzzy joining as a scikit-learn transformer.
 from functools import partial
 
 import numpy as np
+import sklearn
 from sklearn.base import BaseEstimator, TransformerMixin, clone
 from sklearn.compose import make_column_transformer
 from sklearn.feature_extraction.text import HashingVectorizer, TfidfTransformer
 from sklearn.pipeline import make_pipeline
 from sklearn.preprocessing import FunctionTransformer, StandardScaler
+from sklearn.utils.fixes import parse_version
 from sklearn.utils.validation import check_is_fitted
 
 from . import _dataframe as sbd
@@ -39,6 +41,12 @@ _MATCHERS = {
 DEFAULT_REF_DIST = "random_pairs"
 
 
+def _compat_df(df):
+    if parse_version(sklearn.__version__) < parse_version("1.4"):
+        return sbd.to_pandas(df)
+    return df
+
+
 def _make_vectorizer(table, string_encoder, rescale):
     """Construct the transformer used to vectorize joining columns.
 
@@ -299,7 +307,7 @@ class Joiner(TransformerMixin, BaseEstimator):
             rescale=self.ref_dist != "no_rescaling",
         )
         aux = self.vectorizer_.fit_transform(
-            s.select(self._aux_table, s.cols(*self._aux_key))
+            _compat_df(s.select(self._aux_table, s.cols(*self._aux_key)))
         )
         self._matching.fit(aux)
         return self
@@ -327,7 +335,11 @@ class Joiner(TransformerMixin, BaseEstimator):
             X, self._aux_table, self.suffix, main_table_name="X"
         )
         main = self.vectorizer_.transform(
-            sbd.set_column_names(s.select(X, s.cols(*self._main_key)), self._aux_key)
+            _compat_df(
+                sbd.set_column_names(
+                    s.select(X, s.cols(*self._main_key)), self._aux_key
+                )
+            )
         )
         match_result = self._matching.match(main, self.max_dist_)
         matching_col = match_result["index"].copy()

this fixes the issue

jeromedockes · 2024-06-19T19:50:16Z

@TheooJ I opened a pr on your branch to fix the issue for old scikit-learn versions: TheooJ#2

fix joiner for old sklearn

jeromedockes

thanks @TheooJ ! a few more comments :)

skrub/tests/test_fuzzy_join.py

skrub/tests/test_join_utils.py

skrub/_joiner.py

skrub/tests/test_joiner.py

skrub/_joiner.py

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

jeromedockes

great! thanks a ton @TheooJ this is a big one!!

TheooJ added 9 commits June 12, 2024 20:37

Plan out future changes, dispatch with_columns

849a701

Dispatch left_join in _join_utils

66042e7

Iter tests with_columns

6ed3203

Iter left_join

72cad9b

Use left_join in AggJoiner & AggTarget

aab6390

.

d75e572

Only drop right_on col when it is not equal to left_on

e047b7e

Switch to default implem for with_columns

6c4c17b

Merge branch 'main' into refactor_joiner

b39f21a

jeromedockes added this to the 0.1.2 milestone Jun 13, 2024

Format

595aab4

jeromedockes reviewed Jun 13, 2024

View reviewed changes

skrub/_dataframe/_common.py Outdated Show resolved Hide resolved

TheooJ added 8 commits June 13, 2024 17:08

Iter dispatch Joiner

2caa227

Remove old test in pandas and polars

edac2d5

Test with_columns

ec13d35

More left_join tests

c3e290e

Simplify test

d09359e

Test make_column_like on col is col

2a04e99

TODO

be7ec26

Make Joiner work for Polars

686de71

jeromedockes reviewed Jun 13, 2024

View reviewed changes

skrub/_join_utils.py Outdated Show resolved Hide resolved

TheooJ and others added 2 commits June 16, 2024 23:45

Apply suggestions from code review

8d6e624

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

Test & comment left_join

ccc67ba

jeromedockes reviewed Jun 17, 2024

View reviewed changes

TheooJ and others added 4 commits June 18, 2024 01:03

Merge branch 'main' into refactor_joiner

1f0f2f6

Apply suggestion from code review

d5031d9

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

Address more review comments

d0796ec

Check that make_column_like name is the requested name even if the in…

e21fb9a

…put column has a different name

jeromedockes mentioned this pull request Jun 19, 2024

add _dataframe.filter function #965

Merged

TheooJ added 3 commits June 19, 2024 16:18

Merge branch 'main' into refactor_joiner

6e4420c

Dispatch filter in fuzzy_join

5f2fd60

Fix pandas aggregate testing

5accc20

TheooJ marked this pull request as ready for review June 19, 2024 14:45

fix joiner for old sklearn

1f4d49d

jeromedockes and others added 5 commits June 20, 2024 07:23

specify right key dtype

86efa01

Merge branch 'main' into refactor_joiner

54d49f9

Merge pull request #2 from jeromedockes/fix-old-column-transformer

5ebbe9a

fix joiner for old sklearn

Dispatch fuzzy_join tests

206fe54

Next steps

3e4a597

TheooJ changed the title ~~[WIP] Refactor Joiner, dispatch left_join~~ ENH Polars support in Joiner Jun 20, 2024

Format tests

d139fd8

TheooJ force-pushed the refactor_joiner branch from e600945 to d139fd8 Compare June 20, 2024 15:47

jeromedockes reviewed Jun 21, 2024

View reviewed changes

skrub/_joiner.py Outdated Show resolved Hide resolved

TheooJ and others added 6 commits June 21, 2024 16:21

Apply suggestions from code review

3f4d4bb

Co-authored-by: Jérôme Dockès <jerome@dockes.org>

More suggestions from code review

79f1192

Dispatch test_join_utils

36d81ff

Merge branch 'main' into refactor_joiner

9517280

More suggestions from code review

f2717c2

Test duplicated key and col names

557a326

jeromedockes approved these changes Jun 21, 2024

View reviewed changes

jeromedockes merged commit 7fe0f27 into skrub-data:main Jun 21, 2024
19 checks passed

TheooJ mentioned this pull request Jul 25, 2024

Support Polars dataframes across the library #769

Closed

12 tasks

jeromedockes mentioned this pull request Jul 29, 2024

no "index" column in aggtarget output #1020

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Polars support in Joiner #945

ENH Polars support in Joiner #945

TheooJ commented Jun 13, 2024 •

edited

Loading

TheooJ commented Jun 17, 2024

jeromedockes commented Jun 18, 2024 via email

jeromedockes commented Jun 19, 2024

jeromedockes commented Jun 19, 2024

jeromedockes commented Jun 19, 2024

jeromedockes commented Jun 19, 2024

jeromedockes left a comment

jeromedockes left a comment

ENH Polars support in Joiner #945

ENH Polars support in Joiner #945

Conversation

TheooJ commented Jun 13, 2024 • edited Loading

TheooJ commented Jun 17, 2024

jeromedockes commented Jun 18, 2024 via email

jeromedockes commented Jun 19, 2024

jeromedockes commented Jun 19, 2024

jeromedockes commented Jun 19, 2024

jeromedockes commented Jun 19, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

jeromedockes left a comment

Choose a reason for hiding this comment

TheooJ commented Jun 13, 2024 •

edited

Loading