Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove sorted_highlow from form_discrete_distribution call in ppi_distribution_label_shift_ci. #15

Merged
merged 2 commits into from
Oct 27, 2024

Conversation

justinkay
Copy link
Contributor

Addresses #14.

@aangelopoulos
Copy link
Owner

Just checking - have you tested this for the plankton example, and also on an example where the classes are not pre-sorted?

@justinkay
Copy link
Contributor Author

I've checked this on the plankton example as well as a 'reversed' version of the planton example where I swap the frequencies of the 0 and 1 classes:

Y = ~Y
Yhat = ~Yhat
Y_unlabeled = ~Y_unlabeled
Yhat_unlabeled = ~Yhat_unlabeled

The resulting plots are:

Original plankton problem, no code change (sorted_highlow=True)
ppi_plankton_sorted=True

Original plankton problem, with proposed code change (sorted_highlow=False)
ppi_plankton_sorted=False

So, the original results are unaffected by the change in this PR.

Reversed distribution, no code change (sorted_highlow=True)
ppi_plankton-reverse_sorted=True

This estimate fails given the unchanged code.

Reversed distribution, with proposed code change (sorted_highlow=False)
ppi_plankton-reverse_sorted=False

The change in this PR results in a reasonable estimate.

However I have not tested this for other cases, for example where the number of classes is > 2. I want to make sure I understand the original intention of sorting the discrete distribution by class frequency to make sure this doesn't break something else in the future. Do you remember why this was originally done? Are there some other tests you can think of that I should try? Happy to add to the pytests as appropriate.

@aangelopoulos
Copy link
Owner

I think it's probably a relic from the way I used to run this experiment. It was in a notebook before it was in ppi_py, and in that notebook, there was some code for truncating the matrix to the first [K:,K:] submatrix or something (these are the K most common classes).

I don't see any way this could go wrong. If you could put a little synthetic data example as a pytest with more than 2 classes, go for it. Then once we confirm that it works, I'll test it too, and we'll merge.

You're a hero @justinkay thank you <3

@aangelopoulos
Copy link
Owner

aangelopoulos commented Sep 23, 2024

Hey @justinkay --- sorry I lost track of this --- do you want to write these tests, or should I write them and merge?

@justinkay
Copy link
Contributor Author

Hey @aangelopoulos sorry for the delay -- this keeps slipping down my to-do list. Happy to write a couple tests for good measure but probably won't get to it til after ECCV.

@aangelopoulos aangelopoulos merged commit cb245e0 into aangelopoulos:main Oct 27, 2024
1 check failed
@aangelopoulos
Copy link
Owner

Merged! :)

@aangelopoulos
Copy link
Owner

Thanks for all your help @justinkay !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants