Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jaccard similarity differences when using information content #746

Open
souzadevinicius opened this issue May 1, 2024 · 2 comments
Open

Comments

@souzadevinicius
Copy link
Contributor

souzadevinicius commented May 1, 2024

I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the --information-content-file option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.

The first one was without using any information content files:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
-O csv \
-o semsim_without_ic_file.tsv

Next, I used the same parameters, just including --information-content-file option:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
--information-content-file  phenio_monarch_hp_mp_ic.tsv \
-O csv \
-o semsim_with_ic_file.tsv

The HP and MP terms' information content files were generated separately and merged into a final file.

runoak -i phenio.db -g gene_phenotype.9606.tsv -G hpoa_g2p information-content -p i i^HP: -o phenio_monarch_hp_ic.tsv
runoak -i phenio.db -g gene_phenotype.10090.tsv -G hpoa_g2p information-content -p i i^MP: -o phenio_mp_ic.tsv

Here are some exploratory analysis regarding jaccard similarity comparisons

property semsim_without_ic semsim_with_ic
count 1,485,387.00 1,522,836.00
mean 0.44 0.44
std 0.03 0.03
min 0.40 0.40
25% 0.41 0.41
50% 0.43 0.43
75% 0.46 0.46
max 0.70 0.70

Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.

subject_id object_id jaccard_similarity_without_ic jaccard_similarity_with_ic difference
HP:0025477 MP:0013304 0.416667 0.481481 15.56%
HP:0025477 MP:0012070 0.416667 0.481481 15.56%
HP:0025477 MP:0030485 0.416667 0.481481 15.56%
HP:0025477 MP:0031348 0.416667 0.481481 15.56%
HP:0025477 MP:0005422 0.416667 0.481481 15.56%
HP:0002514 MP:0000783 0.465116 0.425532 -9.30%
HP:0005671 MP:0000783 0.454545 0.416667 -9.09%
HP:0007045 MP:0000783 0.454545 0.416667 -9.09%
HP:0002514 MP:0000787 0.5 0.458333 -9.09%
HP:0005849 MP:0000783 0.454545 0.416667 -9.09%
@matentzn
Copy link
Contributor

matentzn commented May 1, 2024

Very nice ticket, subscribing with interest to the thread.

@caufieldjh
Copy link
Collaborator

Certainly strange and unexpected.
Is the behavior reproducible with a smaller set of terms?
Or rather, does it happen when you use the basic OAK semsim implementation rather than semsimian?
I ask because I didn't think the semsimian implementation did anything with the information-content-file input; the semsim interface will cache the provided values here (

if self.cached_information_content_map is not None:
for curie in curies:
if curie in self.cached_information_content_map:
yield curie, self.cached_information_content_map[curie]
return
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants