Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to improve performance of contingency_similarity #625

Merged
merged 7 commits into from
Aug 29, 2024

Conversation

amontanez24
Copy link
Contributor

@amontanez24 amontanez24 commented Aug 20, 2024

resolves #622

Below is the profiling of the QualityReport before the changes. As you can see most time was spent in the crosstab method.
old_code_100_000

To improve this, I switched to using groupby instead. Groupby seems to be faster than crosstab and pivot_table according to this post and trials I ran. The one issue is that groupby returns data as a multi-indexed series instead of a table with the different column values on one axis and the different row values on the other. To compute the metric, I could either use unstack to convert the return value of groupby to the table we wanted, or change the logic to sum over the multi-indexed series. I tried both examples and show their results below. I used the same dataset for each method.

Results for all methods at 1000 rows

Screenshot 2024-08-20 at 11 46 11 AM

Results for all methods at 10,000 rows

Screenshot 2024-08-20 at 11 48 15 AM

Results for all methods at 1,000,000 rows

image

As you can see, as the number of rows increases, the gap in time between methods decreases. Despite that, groupby was always faster. The specific method of groupby that was fastest seemed to vary based on number of rows.

Below is the profiling for each method.

Using groupby, unstack and reindexing after unstacking

groupby_unstack_reindex

Using groupby, unsstack and reindexing before unstacking

groupby_reindex_unstack

Using groupby on its own

Screenshot 2024-08-20 at 11 57 15 AM

@amontanez24
Copy link
Contributor Author

@rwedge @pvk-developer I included all the variants I tried. Ultimately we'll just pick one before merging, but I wanted you to see them and give any feedback. Personally I like the option that only uses groupby and not unstack since it seems simplest.

@amontanez24
Copy link
Contributor Author

here are the .prof files if you want to examine yourself
Archive.zip

Comment on lines 102 to 113
elif method == 'groupby_reindex':
columns = real_data.columns[:2]
real = real_data[columns]
synthetic = synthetic_data[columns]
contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(synthetic)
combined_index = contingency_real.index.union(contingency_synthetic.index)
contingency_synthetic = contingency_synthetic.reindex(combined_index, fill_value=0)
contingency_real = contingency_real.reindex(combined_index, fill_value=0)
diff = abs(contingency_real - contingency_synthetic).fillna(0)
variation = diff / 2
return 1 - variation.sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for this approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me as well

})
assert score == 0.6615076057048345
assert score == 0.6615076057048344
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new method seems to round the last digit down instead of up

@amontanez24 amontanez24 marked this pull request as ready for review August 27, 2024 16:16
@amontanez24 amontanez24 requested a review from a team as a code owner August 27, 2024 16:16
@amontanez24 amontanez24 merged commit be77884 into main Aug 29, 2024
47 checks passed
@amontanez24 amontanez24 deleted the issue-622-speedup-contingency-similarity branch August 29, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try to improve performance of contingency_similarity
3 participants