Add metric for sequence length similarity #643

fealho · 2024-10-23T21:48:48Z

CU-86b2evqnr, Resolve #638.

sdv-team · 2024-10-23T21:48:53Z

Task linked: CU-86b2evqnr SDMetrics - Add metric for sequence length similarity #638

R-Palazzo

Looks good!
I just want to check if we should use actual or normalized counts.
Currently, we use non-normalized counts, so the metric score depends on the size of the real and synthetic data; for instance:

real_data = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3'] * 2)
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6'])

score = SequenceLengthSimilarity.compute(real_data, synthetic_data)

print(score)

We will print 0.33, while in both real and synthetic data, we have a sequence that is half; one is one-third, and the last is one-sixth of the data size. Let me know if it makes sense. @Neha, do you have any thoughts on it?

npatki · 2024-10-29T16:50:07Z

the metric score depends on the size of the real and synthetic data

As it should! The metric is called SequenceLengthSimilarity, so of course the length of the sequences matter.

In this example, score1 should not be equal to score2:

real_data1 = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3'])
real_data2 = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3']*2)
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6'])

score1 = SequenceLengthSimilarity.compute(real_data1, synthetic_data)
score2 = SequenceLengthSimilarity.compute(real_data2, synthetic_data)

This is because the SequenceLengthSimilarity metric is computing the length of each sequence and then using KSComplement on the resulting distributions.

Length of sequences in real_data1 is: [2, 3, 1]
Length of sequences in real_data2 is: [4, 6, 2]
Length of sequences in synthetic_data: [2, 1, 3]

As a result, score1 should be 1.0 while score2 should be 0.33333. Normalizing would defeat the purpose of comparing lengths.

@amontanez24 could you also review if you get the chance?

amontanez24

This lgtm!

tests/unit/timeseries/test_sequence_length_similarity.py

…nce-similarity

ADd metric

d5a3725

fealho marked this pull request as ready for review October 25, 2024 16:55

fealho requested a review from a team as a code owner October 25, 2024 16:55

fealho requested review from gsheni and frances-h and removed request for a team and gsheni October 25, 2024 16:55

Merge branch 'main' into issue-638-sequence-similarity

b9fab0e

fealho requested a review from R-Palazzo October 28, 2024 16:38

fealho force-pushed the issue-638-sequence-similarity branch from 0e707c2 to ef768d2 Compare October 28, 2024 17:43

Fix ordering of the metric

9524b00

fealho force-pushed the issue-638-sequence-similarity branch from ef768d2 to 9524b00 Compare October 29, 2024 06:09

frances-h approved these changes Oct 29, 2024

View reviewed changes

R-Palazzo reviewed Oct 29, 2024

View reviewed changes

R-Palazzo self-requested a review October 29, 2024 15:05

R-Palazzo approved these changes Oct 29, 2024

View reviewed changes

fealho requested a review from amontanez24 October 29, 2024 17:57

fealho changed the base branch from main to feature-branch-timeseries-metrics October 30, 2024 16:47

amontanez24 approved these changes Oct 30, 2024

View reviewed changes

tests/unit/timeseries/test_sequence_length_similarity.py Show resolved Hide resolved

fealho added 2 commits October 31, 2024 04:24

Add test case

ae46e7e

Merge branch 'feature-branch-timeseries-metrics' into issue-638-seque…

036de6a

…nce-similarity

fealho merged commit dd93b1a into feature-branch-timeseries-metrics Nov 5, 2024
47 checks passed

fealho deleted the issue-638-sequence-similarity branch November 5, 2024 03:41

fealho added a commit that referenced this pull request Nov 14, 2024

Add metric for sequence length similarity (#643)

b677125

fealho added a commit that referenced this pull request Nov 15, 2024

Add metric for sequence length similarity (#643)

161808f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric for sequence length similarity #643

Add metric for sequence length similarity #643

fealho commented Oct 23, 2024

sdv-team commented Oct 23, 2024

R-Palazzo left a comment

npatki commented Oct 29, 2024 •

edited

Loading

amontanez24 left a comment

Add metric for sequence length similarity #643

Add metric for sequence length similarity #643

Conversation

fealho commented Oct 23, 2024

sdv-team commented Oct 23, 2024

R-Palazzo left a comment

Choose a reason for hiding this comment

npatki commented Oct 29, 2024 • edited Loading

amontanez24 left a comment

Choose a reason for hiding this comment

npatki commented Oct 29, 2024 •

edited

Loading