-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric for sequence length similarity #643
Add metric for sequence length similarity #643
Conversation
0e707c2
to
ef768d2
Compare
ef768d2
to
9524b00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
I just want to check if we should use actual or normalized counts.
Currently, we use non-normalized counts, so the metric score depends on the size of the real and synthetic data; for instance:
real_data = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3'] * 2)
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6'])
score = SequenceLengthSimilarity.compute(real_data, synthetic_data)
print(score)
We will print 0.33
, while in both real and synthetic data, we have a sequence that is half; one is one-third, and the last is one-sixth of the data size. Let me know if it makes sense. @Neha, do you have any thoughts on it?
As it should! The metric is called SequenceLengthSimilarity, so of course the length of the sequences matter. In this example, real_data1 = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3'])
real_data2 = pd.Series(['id1', 'id1', 'id2', 'id2', 'id2', 'id3']*2)
synthetic_data = pd.Series(['id4', 'id4', 'id5', 'id6', 'id6', 'id6'])
score1 = SequenceLengthSimilarity.compute(real_data1, synthetic_data)
score2 = SequenceLengthSimilarity.compute(real_data2, synthetic_data) This is because the
As a result, @amontanez24 could you also review if you get the chance? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lgtm!
CU-86b2evqnr, Resolve #638.