Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no "index" column in aggtarget output #1020

Merged

Conversation

jeromedockes
Copy link
Member

@jeromedockes jeromedockes commented Jul 29, 2024

ATM the AggJoiner and AggTarget when using value counts or histogram include an unwanted "index" column in their output.

After performing a groupby, the grouping column is in the index of the pandas result. to have it as a column and be able to join on it, we need to do a reset_index. skrub._dataframe._pandas.aggregate used to do it for histogram and value counts, but not for other aggregation functions. This went unnoticed/was not a problem because the aggjoiner and aggtarget used pandas.merge which then performed the merge on the index rather than a column, because it allows "right_on" to be either a column index or an index level index.

in #945 a reset_index was added, but in both for mean() and for value_counts(). So before the reset_index was done 0 times for mean(), 1 times for value_counts(); after it was done 1 time for mean() and 2 times for value_counts() -- the second one resulting in the "index" column.

what we want is to do reset_index once in all cases.

At least that's what I understand from a quick look, it would be great if @TheooJ and @Vincent-Maladiere can confirm

@Vincent-Maladiere
Copy link
Member

Hey @jeromedockes, actually, I don't remember why we initially needed to use reset_index. If my main table has some custom indices (e.g., due to a split), I might want to keep these. For example, this could create silent errors in data wrangling use-cases out of a pipeline. WDYT?

import pandas as pd
from skrub import AggJoiner

main = pd.DataFrame({
    "airportId": [1, 2],
    "airportName": ["Paris CDG", "NY JFK"],
}, index=[2, 3])

aux = pd.DataFrame({
    "flightId": range(1, 7),
    "from_airport": [1, 1, 1, 2, 2, 2],
    "total_passengers": [90, 120, 100, 70, 80, 90],
    "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
}, index=[10, 11, 12, 13, 14, 15])

AggJoiner(
    aux_table=aux,
    main_key="airportId",
    aux_key="from_airport",
    cols=["total_passengers", "company"],
    operations=["hist(4)", "mode"],
).fit_transform(main).index.tolist()

# >>> [0, 1], instead of [2, 3]

@jeromedockes
Copy link
Member Author

As discussed IRL with @Vincent-Maladiere his comment above is correct but we'll open a separate issue about it: this PR is about the index of the aux table, whereas the comment is about preserving the index of the main table

@Vincent-Maladiere
Copy link
Member

By the way, do we need to reset the index of the aux table? As we join on a key, the index shouldn't matter, should it?

@jeromedockes
Copy link
Member Author

jeromedockes commented Jul 31, 2024 via email

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, LGTM then

@jeromedockes
Copy link
Member Author

thanks @Vincent-Maladiere

@jeromedockes jeromedockes merged commit b78a5f2 into skrub-data:main Aug 1, 2024
22 checks passed
@jeromedockes jeromedockes deleted the fix-index-in-aggtarget-output branch August 1, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants