Try to improve performance of contingency_similarity #625

amontanez24 · 2024-08-20T16:53:16Z

resolves #622

Below is the profiling of the QualityReport before the changes. As you can see most time was spent in the crosstab method.

To improve this, I switched to using groupby instead. Groupby seems to be faster than crosstab and pivot_table according to this post and trials I ran. The one issue is that groupby returns data as a multi-indexed series instead of a table with the different column values on one axis and the different row values on the other. To compute the metric, I could either use unstack to convert the return value of groupby to the table we wanted, or change the logic to sum over the multi-indexed series. I tried both examples and show their results below. I used the same dataset for each method.

Results for all methods at 1000 rows

Results for all methods at 10,000 rows

Results for all methods at 1,000,000 rows

As you can see, as the number of rows increases, the gap in time between methods decreases. Despite that, groupby was always faster. The specific method of groupby that was fastest seemed to vary based on number of rows.

Below is the profiling for each method.

Using groupby, unstack and reindexing after unstacking

Using groupby, unsstack and reindexing before unstacking

Using groupby on its own

amontanez24 · 2024-08-20T17:58:28Z

@rwedge @pvk-developer I included all the variants I tried. Ultimately we'll just pick one before merging, but I wanted you to see them and give any feedback. Personally I like the option that only uses groupby and not unstack since it seems simplest.

amontanez24 · 2024-08-21T22:50:23Z

here are the .prof files if you want to examine yourself
Archive.zip

pvk-developer · 2024-08-26T15:09:47Z

sdmetrics/column_pairs/statistical/contingency_similarity.py

+        elif method == 'groupby_reindex':
+            columns = real_data.columns[:2]
+            real = real_data[columns]
+            synthetic = synthetic_data[columns]
+            contingency_real = real.groupby(list(columns), dropna=False).size() / len(real)
+            contingency_synthetic = synthetic.groupby(list(columns), dropna=False).size() / len(synthetic)
+            combined_index = contingency_real.index.union(contingency_synthetic.index)
+            contingency_synthetic = contingency_synthetic.reindex(combined_index, fill_value=0)
+            contingency_real = contingency_real.reindex(combined_index, fill_value=0)
+            diff = abs(contingency_real - contingency_synthetic).fillna(0)
+            variation = diff / 2
+            return 1 - variation.sum()


I vote for this approach.

This seems reasonable to me as well

…tests

amontanez24 · 2024-08-27T16:16:18Z

tests/integration/reports/multi_table/test_quality_report.py

    })
-    assert score == 0.6615076057048345
+    assert score == 0.6615076057048344


The new method seems to round the last digit down instead of up

amontanez24 added 2 commits August 19, 2024 22:06

Try to improve performance of contingency_similarity

55c7c8c

showing all code variants and fixing lint:

529d131

amontanez24 requested review from rwedge and pvk-developer August 20, 2024 17:57

pvk-developer reviewed Aug 26, 2024

View reviewed changes

gsheni assigned amontanez24 Aug 27, 2024

amontanez24 added 2 commits August 27, 2024 11:14

picking simplest and fastest solution and fixing rounding errors for …

f2240f8

…tests

not linting notebooks

b466fc5

amontanez24 commented Aug 27, 2024

View reviewed changes

amontanez24 marked this pull request as ready for review August 27, 2024 16:16

amontanez24 requested a review from a team as a code owner August 27, 2024 16:16

amontanez24 requested a review from pvk-developer August 27, 2024 16:16

amontanez24 added 3 commits August 27, 2024 11:24

lint

4a0e8ed

lint

4e5b6bb

lint again

98e7ea9

pvk-developer approved these changes Aug 28, 2024

View reviewed changes

rwedge approved these changes Aug 29, 2024

View reviewed changes

amontanez24 merged commit be77884 into main Aug 29, 2024
47 checks passed

amontanez24 deleted the issue-622-speedup-contingency-similarity branch August 29, 2024 17:08

npatki mentioned this pull request Nov 13, 2024

When running Quality Report, ContingencySimilarity produces a RuntimeWarning (The values in the array are unorderable.) #656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to improve performance of contingency_similarity #625

Try to improve performance of contingency_similarity #625

amontanez24 commented Aug 20, 2024 •

edited

Loading

amontanez24 commented Aug 20, 2024

amontanez24 commented Aug 21, 2024

pvk-developer Aug 26, 2024

rwedge Aug 26, 2024

amontanez24 Aug 27, 2024

Try to improve performance of contingency_similarity #625

Try to improve performance of contingency_similarity #625

Conversation

amontanez24 commented Aug 20, 2024 • edited Loading

Results for all methods at 1000 rows

Results for all methods at 10,000 rows

Results for all methods at 1,000,000 rows

Using groupby, unstack and reindexing after unstacking

Using groupby, unsstack and reindexing before unstacking

Using groupby on its own

amontanez24 commented Aug 20, 2024

amontanez24 commented Aug 21, 2024

pvk-developer Aug 26, 2024

Choose a reason for hiding this comment

rwedge Aug 26, 2024

Choose a reason for hiding this comment

amontanez24 Aug 27, 2024

Choose a reason for hiding this comment

amontanez24 commented Aug 20, 2024 •

edited

Loading