-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
no "index" column in aggtarget output #1020
no "index" column in aggtarget output #1020
Conversation
Hey @jeromedockes, actually, I don't remember why we initially needed to use import pandas as pd
from skrub import AggJoiner
main = pd.DataFrame({
"airportId": [1, 2],
"airportName": ["Paris CDG", "NY JFK"],
}, index=[2, 3])
aux = pd.DataFrame({
"flightId": range(1, 7),
"from_airport": [1, 1, 1, 2, 2, 2],
"total_passengers": [90, 120, 100, 70, 80, 90],
"company": ["DL", "AF", "AF", "DL", "DL", "TR"],
}, index=[10, 11, 12, 13, 14, 15])
AggJoiner(
aux_table=aux,
main_key="airportId",
aux_key="from_airport",
cols=["total_passengers", "company"],
operations=["hist(4)", "mode"],
).fit_transform(main).index.tolist()
# >>> [0, 1], instead of [2, 3] |
As discussed IRL with @Vincent-Maladiere his comment above is correct but we'll open a separate issue about it: this PR is about the index of the aux table, whereas the comment is about preserving the index of the main table |
By the way, do we need to reset the index of the aux table? As we join on a key, the index shouldn't matter, should it? |
By the way, do we need to reset the index of the aux table? As we join on a key, the index shouldn't matter, should it?
after the groupby the grouping column ends up as the index, so we reset it to put it back as a column
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, LGTM then
thanks @Vincent-Maladiere |
ATM the AggJoiner and AggTarget when using value counts or histogram include an unwanted "index" column in their output.
After performing a groupby, the grouping column is in the index of the pandas result. to have it as a column and be able to join on it, we need to do a
reset_index
.skrub._dataframe._pandas.aggregate
used to do it for histogram and value counts, but not for other aggregation functions. This went unnoticed/was not a problem because the aggjoiner and aggtarget used pandas.merge which then performed the merge on the index rather than a column, because it allows "right_on" to be either a column index or an index level index.in #945 a
reset_index
was added, but in both for mean() and for value_counts(). So before the reset_index was done 0 times for mean(), 1 times for value_counts(); after it was done 1 time for mean() and 2 times for value_counts() -- the second one resulting in the "index" column.what we want is to do reset_index once in all cases.
At least that's what I understand from a quick look, it would be great if @TheooJ and @Vincent-Maladiere can confirm