Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Polars support in Joiner #945

Merged
merged 64 commits into from
Jun 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
849a701
Plan out future changes, dispatch with_columns
TheooJ Jun 12, 2024
66042e7
Dispatch left_join in _join_utils
TheooJ Jun 13, 2024
6ed3203
Iter tests with_columns
TheooJ Jun 13, 2024
72cad9b
Iter left_join
TheooJ Jun 13, 2024
aab6390
Use left_join in AggJoiner & AggTarget
TheooJ Jun 13, 2024
d75e572
.
TheooJ Jun 13, 2024
e047b7e
Only drop right_on col when it is not equal to left_on
TheooJ Jun 13, 2024
6c4c17b
Switch to default implem for with_columns
TheooJ Jun 13, 2024
b39f21a
Merge branch 'main' into refactor_joiner
TheooJ Jun 13, 2024
595aab4
Format
TheooJ Jun 13, 2024
2caa227
Iter dispatch Joiner
TheooJ Jun 13, 2024
edac2d5
Remove old test in pandas and polars
TheooJ Jun 13, 2024
ec13d35
Test with_columns
TheooJ Jun 13, 2024
c3e290e
More left_join tests
TheooJ Jun 13, 2024
d09359e
Simplify test
TheooJ Jun 13, 2024
2a04e99
Test make_column_like on col is col
TheooJ Jun 13, 2024
be7ec26
TODO
TheooJ Jun 13, 2024
686de71
Make Joiner work for Polars
TheooJ Jun 13, 2024
8d6e624
Apply suggestions from code review
TheooJ Jun 16, 2024
ccc67ba
Test & comment left_join
TheooJ Jun 16, 2024
1f0f2f6
Merge branch 'main' into refactor_joiner
TheooJ Jun 17, 2024
d5031d9
Apply suggestion from code review
TheooJ Jun 17, 2024
d0796ec
Address more review comments
TheooJ Jun 17, 2024
e21fb9a
Check that make_column_like name is the requested name even if the in…
TheooJ Jun 17, 2024
b461c78
Merge branch 'main' into refactor_joiner
TheooJ Jun 18, 2024
b9688c8
CHANGES.rst
TheooJ Jun 18, 2024
41e385c
Change iterable of str to list of str
TheooJ Jun 18, 2024
283ddb0
Use df_module.assert_frame_equal
TheooJ Jun 18, 2024
d59c916
Docstring
TheooJ Jun 18, 2024
6b7e379
Iter test_joiner
TheooJ Jun 18, 2024
e467994
Add useful error msg for max_dist
TheooJ Jun 18, 2024
4be4bdc
Add useful error msg for max_dist
TheooJ Jun 18, 2024
e84de29
Merge branch 'main' into refactor_joiner
TheooJ Jun 18, 2024
c250d26
Merge branch 'refactor_joiner' of https://github.com/TheooJ/skrub int…
TheooJ Jun 18, 2024
1ade2b0
.
TheooJ Jun 18, 2024
14a3d97
Handle case where main_key and aux_key are lists
TheooJ Jun 18, 2024
5ab616a
Test missing values, numeric, datetimes
TheooJ Jun 18, 2024
92f78b0
Test missing values, numeric, datetimes
TheooJ Jun 18, 2024
72bea25
Merge branch 'main' into refactor_joiner
TheooJ Jun 19, 2024
101d788
Apply suggestions from code review
TheooJ Jun 19, 2024
095d121
Merge branch 'refactor_joiner' of https://github.com/TheooJ/skrub int…
TheooJ Jun 19, 2024
8acbd92
Apply more suggestions from code review
TheooJ Jun 19, 2024
4365d9c
More suggestions from code review
TheooJ Jun 19, 2024
aa7015d
More suggestions from code review
TheooJ Jun 19, 2024
da74a0c
More suggestions from code review
TheooJ Jun 19, 2024
c393088
Add reset_index to _pandas.aggregate
TheooJ Jun 19, 2024
a7f1ab7
Dispatch fuzzy_join
TheooJ Jun 19, 2024
af684c3
Format fuzzy_join
TheooJ Jun 19, 2024
6e4420c
Merge branch 'main' into refactor_joiner
TheooJ Jun 19, 2024
5f2fd60
Dispatch filter in fuzzy_join
TheooJ Jun 19, 2024
5accc20
Fix pandas aggregate testing
TheooJ Jun 19, 2024
1f4d49d
fix joiner for old sklearn
jeromedockes Jun 19, 2024
86efa01
specify right key dtype
jeromedockes Jun 20, 2024
54d49f9
Merge branch 'main' into refactor_joiner
TheooJ Jun 20, 2024
5ebbe9a
Merge pull request #2 from jeromedockes/fix-old-column-transformer
TheooJ Jun 20, 2024
206fe54
Dispatch fuzzy_join tests
TheooJ Jun 20, 2024
3e4a597
Next steps
TheooJ Jun 20, 2024
d139fd8
Format tests
TheooJ Jun 20, 2024
3f4d4bb
Apply suggestions from code review
TheooJ Jun 21, 2024
79f1192
More suggestions from code review
TheooJ Jun 21, 2024
36d81ff
Dispatch test_join_utils
TheooJ Jun 21, 2024
9517280
Merge branch 'main' into refactor_joiner
TheooJ Jun 21, 2024
f2717c2
More suggestions from code review
TheooJ Jun 21, 2024
557a326
Test duplicated key and col names
TheooJ Jun 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ It is currently undergoing fast development and backward compatibility is not en

Major changes
-------------
* The :class:`Joiner` has been adapted to support polars dataframes. :pr:`945` by :user:`Théo Jolivet <TheooJ>`.

* The :class:`TableVectorizer` now consistently applies the same transformation
across different calls to `transform`. There also have been some breaking
changes to its functionality: (i) all transformations are now applied
Expand Down
10 changes: 4 additions & 6 deletions skrub/_agg_joiner.py
Original file line number Diff line number Diff line change
Expand Up @@ -289,9 +289,8 @@ def transform(self, X):
X, _ = self._check_dataframes(X, self.aux_table_)
_join_utils.check_missing_columns(X, self._main_key, "'X' (the main table)")

skrub_px, _ = get_df_namespace(self.aux_table_)
X = skrub_px.join(
left=X,
X = _join_utils.left_join(
X,
right=self.aux_table_,
left_on=self._main_key,
right_on=self._aux_key,
Expand Down Expand Up @@ -439,10 +438,9 @@ def transform(self, X):
The augmented input.
"""
check_is_fitted(self, "y_")
skrub_px, _ = get_df_namespace(X)

return skrub_px.join(
left=X,
return _join_utils.left_join(
X,
right=self.y_,
left_on=self.main_key_,
right_on=self.main_key_,
Expand Down
7 changes: 7 additions & 0 deletions skrub/_dataframe/_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@
"sample",
"head",
"replace",
"with_columns",
]

#
Expand Down Expand Up @@ -1007,3 +1008,9 @@ def _replace_pandas(col, old, new):
@replace.specialize("polars", argument_type="Column")
def _replace_polars(col, old, new):
return col.replace(old, new)


def with_columns(df, **new_cols):
cols = {col_name: col(df, col_name) for col_name in column_names(df)}
cols.update({n: make_column_like(df, c, n) for n, c in new_cols.items()})
return make_dataframe_like(df, cols)
44 changes: 1 addition & 43 deletions skrub/_dataframe/_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,49 +104,7 @@ def aggregate(
]
sorted_cols = sorted(base_group.columns)

return base_group[sorted_cols]


def join(
left,
right,
left_on,
right_on,
):
"""Left join two :obj:`pandas.DataFrame`.

This function uses the ``dataframe.merge`` method from Pandas.

Parameters
----------
left : pd.DataFrame
The left dataframe to left-join.

right : pd.DataFrame
The right dataframe to left-join.

left_on : str or Iterable[str]
Left keys to merge on.

right_on : str or Iterable[str]
Right keys to merge on.

Returns
-------
merged : pd.DataFrame,
The merged output.
"""
if not (isinstance(left, pd.DataFrame) and isinstance(right, pd.DataFrame)):
raise TypeError(
"'left' and 'right' must be pandas dataframes, "
f"got {type(left)!r} and {type(right)!r}."
)
return left.merge(
right,
how="left",
left_on=left_on,
right_on=right_on,
)
return base_group[sorted_cols].reset_index(drop=False)


def get_named_agg(table, cols, operations):
Expand Down
46 changes: 0 additions & 46 deletions skrub/_dataframe/_polars.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
"""
Polars specialization of the aggregate and join operations.
"""
import inspect

try:
import polars as pl
import polars.selectors as cs
Expand Down Expand Up @@ -91,50 +89,6 @@ def aggregate(
return table.select(sorted_cols)


def join(left, right, left_on, right_on):
"""Left join two :obj:`polars.DataFrame` or :obj:`polars.LazyFrame`.
This function uses the ``dataframe.join`` method from Polars.
Note that the input dataframes type must agree: either both
Polars dataframes or both Polars lazyframes.
Mixing polars dataframe with lazyframe will raise an error.
Parameters
----------
left : pl.DataFrame or pl.LazyFrame
The left dataframe of the left-join.
right : pl.DataFrame or pl.LazyFrame
The right dataframe of the left-join.
left_on : str or Iterable[str]
Left keys to merge on.
right_on : str or Iterable[str]
Right keys to merge on.
Returns
-------
merged : pl.DataFrame or pl.LazyFrame
The merged output.
"""
is_dataframe = isinstance(left, pl.DataFrame) and isinstance(right, pl.DataFrame)
is_lazyframe = isinstance(left, pl.LazyFrame) and isinstance(right, pl.LazyFrame)
if is_dataframe or is_lazyframe:
if "coalesce" in inspect.signature(left.join).parameters:
kw = {"coalesce": True}
else:
kw = {}
return left.join(right, how="left", left_on=left_on, right_on=right_on, **kw)
else:
raise TypeError(
"'left' and 'right' must be polars dataframes or lazyframes, "
f"got {type(left)!r} and {type(right)!r}."
)


def get_aggfuncs(cols, operations):
"""List Polars aggregation functions.
Expand Down
40 changes: 40 additions & 0 deletions skrub/_dataframe/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ def test_not_implemented():
"reset_index",
"copy_index",
"index",
"with_columns",
}
for func_name in sorted(set(ns.__all__) - has_default_impl):
func = getattr(ns, func_name)
Expand Down Expand Up @@ -147,6 +148,10 @@ def test_make_column_like(df_module, example_data_dict):
)
assert ns.dataframe_module_name(col) == df_module.name

col = df_module.make_column("old_name", [1, 2, 3])
expected = df_module.make_column("new_name", [1, 2, 3])
df_module.assert_column_equal(ns.make_column_like(col, col, "new_name"), expected)


def test_null_value_for(df_module):
assert ns.null_value_for(df_module.example_dataframe) is None
Expand Down Expand Up @@ -645,3 +650,38 @@ def same(c1, c2):

same(ns.drop_nulls(s), col([1.1, 2.2, float("inf")]))
same(ns.fill_nulls(s, -1.0), col([1.1, -1.0, 2.2, -1.0, float("inf")]))


def test_with_columns(df_module):
df = df_module.make_dataframe({"a": [1, 2], "b": [3, 4]})

# Add one new col
out = ns.with_columns(df, **{"c": [5, 6]})
if df_module.description == "pandas-nullable-dtypes":
# for pandas, make_column_like will return an old-style / numpy dtypes Series
out = ns.pandas_convert_dtypes(out)
expected = df_module.make_dataframe({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
df_module.assert_frame_equal(out, expected)

# Add multiple new cols
out = ns.with_columns(df, **{"c": [5, 6], "d": [7, 8]})
if df_module.description == "pandas-nullable-dtypes":
out = ns.pandas_convert_dtypes(out)
expected = df_module.make_dataframe(
{"a": [1, 2], "b": [3, 4], "c": [5, 6], "d": [7, 8]}
)
df_module.assert_frame_equal(out, expected)

# Pass a col instead of an array
out = ns.with_columns(df, **{"c": df_module.make_column("c", [5, 6])})
if df_module.description == "pandas-nullable-dtypes":
out = ns.pandas_convert_dtypes(out)
expected = df_module.make_dataframe({"a": [1, 2], "b": [3, 4], "c": [5, 6]})
df_module.assert_frame_equal(out, expected)

# Replace col
out = ns.with_columns(df, **{"a": [5, 6]})
if df_module.description == "pandas-nullable-dtypes":
out = ns.pandas_convert_dtypes(out)
expected = df_module.make_dataframe({"a": [5, 6], "b": [3, 4]})
df_module.assert_frame_equal(out, expected)
16 changes: 3 additions & 13 deletions skrub/_dataframe/tests/test_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

from skrub._dataframe._pandas import (
aggregate,
join,
rename_columns,
)

Expand All @@ -18,12 +17,6 @@
)


def test_join():
joined = join(left=main, right=main, left_on="movieId", right_on="movieId")
expected = main.merge(main, on="movieId", how="left")
assert_frame_equal(joined, expected)


def test_simple_agg():
aggregated = aggregate(
table=main,
Expand All @@ -36,7 +29,7 @@ def test_simple_agg():
"genre_mode": ("genre", pd.Series.mode),
"rating_mean": ("rating", "mean"),
}
expected = main.groupby("movieId").agg(**aggfunc)
expected = main.groupby("movieId").agg(**aggfunc).reset_index()
assert_frame_equal(aggregated, expected)


Expand All @@ -56,7 +49,7 @@ def test_value_counts_agg():
"rating_4.0_user": [3.0, 1.0],
"userId": [1, 2],
}
)
).reset_index(drop=False)
assert_frame_equal(aggregated, expected)

aggregated = aggregate(
Expand All @@ -73,14 +66,11 @@ def test_value_counts_agg():
"rating_(3.0, 4.0]_user": [3, 1],
"userId": [1, 2],
}
)
).reset_index(drop=False)
assert_frame_equal(aggregated, expected)


def test_incorrect_dataframe_inputs():
with pytest.raises(TypeError, match=r"(?=.*pandas dataframes)(?=.*array)"):
join(left=main.values, right=main, left_on="movieId", right_on="movieId")

with pytest.raises(TypeError, match=r"(?=.*pandas dataframe)(?=.*array)"):
aggregate(
table=main.values,
Expand Down
16 changes: 0 additions & 16 deletions skrub/_dataframe/tests/test_polars.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,8 @@
import inspect

import pandas as pd
import pytest

from skrub._dataframe._polars import (
aggregate,
join,
rename_columns,
)
from skrub.conftest import _POLARS_INSTALLED
Expand All @@ -27,16 +24,6 @@
pytest.skip(reason=POLARS_MISSING_MSG, allow_module_level=True)


def test_join():
joined = join(left=main, right=main, left_on="movieId", right_on="movieId")
if "coalesce" in inspect.signature(main.join).parameters:
kw = {"coalesce": True}
else:
kw = {}
expected = main.join(main, on="movieId", how="left", **kw)
assert_frame_equal(joined, expected)


def test_simple_agg():
aggregated = aggregate(
table=main,
Expand Down Expand Up @@ -68,9 +55,6 @@ def test_mode_agg():


def test_incorrect_dataframe_inputs():
with pytest.raises(TypeError, match=r"(?=.*polars dataframes)(?=.*pandas)"):
join(left=pd.DataFrame(main), right=main, left_on="movieId", right_on="movieId")

with pytest.raises(TypeError, match=r"(?=.*polars dataframe)(?=.*pandas)"):
aggregate(
table=pd.DataFrame(main),
Expand Down
10 changes: 6 additions & 4 deletions skrub/_fuzzy_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@
"""
import numpy as np

from skrub import _join_utils
from skrub._joiner import DEFAULT_REF_DIST, DEFAULT_STRING_ENCODER, Joiner
from . import _dataframe as sbd
from . import _join_utils
from . import _selectors as s
from ._joiner import DEFAULT_REF_DIST, DEFAULT_STRING_ENCODER, Joiner


def fuzzy_join(
Expand Down Expand Up @@ -210,7 +212,7 @@ def fuzzy_join(
add_match_info=True,
).fit_transform(left)
if drop_unmatched:
join = join[join["skrub_Joiner_match_accepted"]]
join = sbd.filter(join, sbd.col(join, "skrub_Joiner_match_accepted"))
if not add_match_info:
join = join.drop(Joiner.match_info_columns, axis=1)
join = s.select(join, ~s.cols(*Joiner.match_info_columns))
return join
Loading
Loading