no "index" column in aggtarget output #1020

jeromedockes · 2024-07-29T16:08:50Z

ATM the AggJoiner and AggTarget when using value counts or histogram include an unwanted "index" column in their output.

After performing a groupby, the grouping column is in the index of the pandas result. to have it as a column and be able to join on it, we need to do a reset_index. skrub._dataframe._pandas.aggregate used to do it for histogram and value counts, but not for other aggregation functions. This went unnoticed/was not a problem because the aggjoiner and aggtarget used pandas.merge which then performed the merge on the index rather than a column, because it allows "right_on" to be either a column index or an index level index.

in #945 a reset_index was added, but in both for mean() and for value_counts(). So before the reset_index was done 0 times for mean(), 1 times for value_counts(); after it was done 1 time for mean() and 2 times for value_counts() -- the second one resulting in the "index" column.

what we want is to do reset_index once in all cases.

At least that's what I understand from a quick look, it would be great if @TheooJ and @Vincent-Maladiere can confirm

Vincent-Maladiere · 2024-07-30T15:33:40Z

Hey @jeromedockes, actually, I don't remember why we initially needed to use reset_index. If my main table has some custom indices (e.g., due to a split), I might want to keep these. For example, this could create silent errors in data wrangling use-cases out of a pipeline. WDYT?

import pandas as pd
from skrub import AggJoiner

main = pd.DataFrame({
    "airportId": [1, 2],
    "airportName": ["Paris CDG", "NY JFK"],
}, index=[2, 3])

aux = pd.DataFrame({
    "flightId": range(1, 7),
    "from_airport": [1, 1, 1, 2, 2, 2],
    "total_passengers": [90, 120, 100, 70, 80, 90],
    "company": ["DL", "AF", "AF", "DL", "DL", "TR"],
}, index=[10, 11, 12, 13, 14, 15])

AggJoiner(
    aux_table=aux,
    main_key="airportId",
    aux_key="from_airport",
    cols=["total_passengers", "company"],
    operations=["hist(4)", "mode"],
).fit_transform(main).index.tolist()

# >>> [0, 1], instead of [2, 3]

jeromedockes · 2024-07-31T12:16:56Z

As discussed IRL with @Vincent-Maladiere his comment above is correct but we'll open a separate issue about it: this PR is about the index of the aux table, whereas the comment is about preserving the index of the main table

…get-output

Vincent-Maladiere · 2024-07-31T16:15:37Z

By the way, do we need to reset the index of the aux table? As we join on a key, the index shouldn't matter, should it?

jeromedockes · 2024-07-31T16:19:30Z

By the way, do we need to reset the index of the aux table? As we join on a key, the index shouldn't matter, should it?

after the groupby the grouping column ends up as the index, so we reset it to put it back as a column

Vincent-Maladiere

Right, LGTM then

jeromedockes · 2024-08-01T13:28:11Z

thanks @Vincent-Maladiere

jeromedockes added 3 commits July 29, 2024 17:58

no "index" column in aggtarget output

e9d6e96

update tests

46144d7

[doc build]

89ed893

jeromedockes added this to the 0.3.0 milestone Jul 31, 2024

jeromedockes mentioned this pull request Jul 31, 2024

systematically handling column names and indexes of transformed dataframes #1021

Open

jeromedockes added 2 commits July 31, 2024 16:55

Merge remote-tracking branch 'upstream/main' into fix-index-in-aggtar…

26e272f

…get-output

changelog

f86dc7f

Vincent-Maladiere approved these changes Aug 1, 2024

View reviewed changes

jeromedockes merged commit b78a5f2 into skrub-data:main Aug 1, 2024
22 checks passed

jeromedockes deleted the fix-index-in-aggtarget-output branch August 1, 2024 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no "index" column in aggtarget output #1020

no "index" column in aggtarget output #1020

jeromedockes commented Jul 29, 2024 •

edited

Loading

Vincent-Maladiere commented Jul 30, 2024

jeromedockes commented Jul 31, 2024

Vincent-Maladiere commented Jul 31, 2024

jeromedockes commented Jul 31, 2024 via email

Vincent-Maladiere left a comment

jeromedockes commented Aug 1, 2024

no "index" column in aggtarget output #1020

no "index" column in aggtarget output #1020

Conversation

jeromedockes commented Jul 29, 2024 • edited Loading

Vincent-Maladiere commented Jul 30, 2024

jeromedockes commented Jul 31, 2024

Vincent-Maladiere commented Jul 31, 2024

jeromedockes commented Jul 31, 2024 via email

Vincent-Maladiere left a comment

Choose a reason for hiding this comment

jeromedockes commented Aug 1, 2024

jeromedockes commented Jul 29, 2024 •

edited

Loading