Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Pearsons correlation to _column_associations #1203

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

reshamas
Copy link
Contributor

Towards #1176

Add Pearsons correlation to column associations.

@reshamas
Copy link
Contributor Author

reshamas commented Dec 15, 2024

This is the output, which I excluded for now, due to errors:

       left_column_name  left_column_idx right_column_name  right_column_idx  cramer_v  pearson
    0               c_3                3             c_str                 5    0.8215      NaN
    1               c_1                1               c_4                 4    0.8215   0.1597
    2               c_0                0               c_1                 1    0.8215   0.1123
    3               c_2                2             c_str                 5    0.7551      NaN
    4               c_0                0             c_str                 5    0.7551      NaN
    5               c_0                0               c_3                 3    0.7551   0.3212
    6               c_1                1               c_3                 3    0.6837  -0.1887
    7               c_0                0               c_4                 4    0.6837  -0.3202
    8               c_4                4             c_str                 5    0.6837      NaN
    9               c_3                3               c_4                 4    0.6053  -0.0150
    10              c_2                2               c_3                 3    0.6053   0.1757
    11              c_1                1             c_str                 5    0.6053      NaN
    12              c_0                0               c_2                 2    0.6053  -0.0578
    13              c_2                2               c_4                 4    0.5169  -0.2885
    14              c_1                1               c_2                 2    0.4122  -0.4986

Not sure why Pearson is showing up as NaN for the diagonal correlations, they should be 1.000

       left right  pearson
    0   c_0   c_0   1.0000
    1   c_0   c_1   0.1123
    2   c_0   c_2  -0.0578
    3   c_0   c_3   0.3212
    4   c_0   c_4  -0.3202
    5   c_1   c_0   0.1123
    6   c_1   c_1   1.0000

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @reshamas, thank you for starting this!

In skrub, we use methods defined in skrub._dataframe._common.py to dispatch operations on polars or pandas dataframes. We do this to avoid copying dataframes from one backend to the other, and later to enable features like lazy evaluation. Therefore, the input of the function skrub._column_associations.column_associations() could be either a pandas or a polars DataFrame.

So, here, we would like to use something like sbd.corr() ("sbd" stands for "skrub dataframe"), instead of pandas corr. Since this method doesn't exist yet, you'll have to add it to skrub._dataframe._common.py 😁

Alternatively, you could turn your dataframe into a numpy array with skb.to_numpy() and then use np.corrcoef. I slightly prefer the former suggestion because it adds corr() in skrub, which I think would be also useful later.

If you go with the first option, I suggest converting the output to a numpy 2d array, so that you can directly use the _stack_symmetric_associations() function to melt the symmetric matrix into a long format. You need to select only numeric columns when passing your dataframe to _stack_symmetric_associations().

Finally, you can apply the merge operation with skrub._join_utils.left_join.

We know that developing skrub is fairly involved at the moment, and we plan on making a simple documentation for all our internals. In the meantime, I'm here to help you to move this PR forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants