-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to improve performance of contingency_similarity #625
Conversation
@rwedge @pvk-developer I included all the variants I tried. Ultimately we'll just pick one before merging, but I wanted you to see them and give any feedback. Personally I like the option that only uses groupby and not unstack since it seems simplest. |
here are the .prof files if you want to examine yourself |
elif method == 'groupby_reindex': | ||
columns = real_data.columns[:2] | ||
real = real_data[columns] | ||
synthetic = synthetic_data[columns] | ||
contingency_real = real.groupby(list(columns), dropna=False).size() / len(real) | ||
contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(synthetic) | ||
combined_index = contingency_real.index.union(contingency_synthetic.index) | ||
contingency_synthetic = contingency_synthetic.reindex(combined_index, fill_value=0) | ||
contingency_real = contingency_real.reindex(combined_index, fill_value=0) | ||
diff = abs(contingency_real - contingency_synthetic).fillna(0) | ||
variation = diff / 2 | ||
return 1 - variation.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable to me as well
}) | ||
assert score == 0.6615076057048345 | ||
assert score == 0.6615076057048344 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new method seems to round the last digit down instead of up
resolves #622
Below is the profiling of the
QualityReport
before the changes. As you can see most time was spent in thecrosstab
method.To improve this, I switched to using
groupby
instead.Groupby
seems to be faster thancrosstab
andpivot_table
according to this post and trials I ran. The one issue is thatgroupby
returns data as a multi-indexed series instead of a table with the different column values on one axis and the different row values on the other. To compute the metric, I could either useunstack
to convert the return value ofgroupby
to the table we wanted, or change the logic to sum over the multi-indexed series. I tried both examples and show their results below. I used the same dataset for each method.Results for all methods at 1000 rows
Results for all methods at 10,000 rows
Results for all methods at 1,000,000 rows
As you can see, as the number of rows increases, the gap in time between methods decreases. Despite that, groupby was always faster. The specific method of groupby that was fastest seemed to vary based on number of rows.
Below is the profiling for each method.
Using groupby, unstack and reindexing after unstacking
Using groupby, unsstack and reindexing before unstacking
Using groupby on its own