Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for sequence length similarity #643

Merged

Conversation

fealho
Copy link
Member

@fealho fealho commented Oct 23, 2024

CU-86b2evqnr, Resolve #638.

@sdv-team
Copy link
Contributor

@fealho fealho marked this pull request as ready for review October 25, 2024 16:55
@fealho fealho requested a review from a team as a code owner October 25, 2024 16:55
@fealho fealho requested review from gsheni and frances-h and removed request for a team and gsheni October 25, 2024 16:55
@fealho fealho requested a review from R-Palazzo October 28, 2024 16:38
@fealho fealho force-pushed the issue-638-sequence-similarity branch from 0e707c2 to ef768d2 Compare October 28, 2024 17:43
@fealho fealho force-pushed the issue-638-sequence-similarity branch from ef768d2 to 9524b00 Compare October 29, 2024 06:09
Copy link
Contributor

@R-Palazzo R-Palazzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
I just want to check if we should use actual or normalized counts.
Currently, we use non-normalized counts, so the metric score depends on the size of the real and synthetic data; for instance:

real_data = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3'] * 2)
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6'])

score = SequenceLengthSimilarity.compute(real_data, synthetic_data)

print(score)

We will print 0.33, while in both real and synthetic data, we have a sequence that is half; one is one-third, and the last is one-sixth of the data size. Let me know if it makes sense. @Neha, do you have any thoughts on it?

@R-Palazzo R-Palazzo self-requested a review October 29, 2024 15:05
@npatki
Copy link
Contributor

npatki commented Oct 29, 2024

the metric score depends on the size of the real and synthetic data

As it should! The metric is called SequenceLengthSimilarity, so of course the length of the sequences matter.

In this example, score1 should not be equal to score2:

real_data1 = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3'])
real_data2 = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3']*2)
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6'])

score1 = SequenceLengthSimilarity.compute(real_data1, synthetic_data)
score2 = SequenceLengthSimilarity.compute(real_data2, synthetic_data)

This is because the SequenceLengthSimilarity metric is computing the length of each sequence and then using KSComplement on the resulting distributions.

  • Length of sequences in real_data1 is: [2, 3, 1]
  • Length of sequences in real_data2 is: [4, 6, 2]
  • Length of sequences in synthetic_data: [2, 1, 3]

As a result, score1 should be 1.0 while score2 should be 0.33333. Normalizing would defeat the purpose of comparing lengths.

@amontanez24 could you also review if you get the chance?

@fealho fealho requested a review from amontanez24 October 29, 2024 17:57
@fealho fealho changed the base branch from main to feature-branch-timeseries-metrics October 30, 2024 16:47
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm!

@fealho fealho merged commit dd93b1a into feature-branch-timeseries-metrics Nov 5, 2024
47 checks passed
@fealho fealho deleted the issue-638-sequence-similarity branch November 5, 2024 03:41
fealho added a commit that referenced this pull request Nov 14, 2024
fealho added a commit that referenced this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metric for sequence length similarity
6 participants