-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaccard similarity differences when using information content #746
Comments
Very nice ticket, subscribing with interest to the thread. |
Certainly strange and unexpected. ontology-access-kit/src/oaklib/interfaces/semsim_interface.py Lines 224 to 228 in aef85c6
|
I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the
--information-content-file
option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.The first one was without using any information content files:
Next, I used the same parameters, just including --information-content-file option:
The HP and MP terms' information content files were generated separately and merged into a final file.
Here are some exploratory analysis regarding jaccard similarity comparisons
Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.
The text was updated successfully, but these errors were encountered: