Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible inconsistencies with DMS ID, DOI, and selection type #26

Open
agitter opened this issue Apr 17, 2024 · 5 comments
Open

Possible inconsistencies with DMS ID, DOI, and selection type #26

agitter opened this issue Apr 17, 2024 · 5 comments

Comments

@agitter
Copy link

agitter commented Apr 17, 2024

Thanks for the excellent resource and making all the data so easily accessible. While combing through the csv files, we noticed a few possible inconsistencies I wanted to ask about.

DMS ID

In reference_files/DMS_substitutions.csv datasets like ARGR_ECOLI_Tsuboyama_2023_1AOY from the mega scale stability experiment are named after Tsuboyama, e.g. ARGR_ECOLI_Tsuboyama_2023_1AOY. That is also the convention in benchmarks/DMS_zero_shot/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv. However, in benchmarks/DMS_supervised/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv they are named after Rocklin, e.g. ARGR_ECOLI_Rocklin_2023_1AOY.

DOI

The same mega scale study appears to have multiple journal DOIs listed in the jo column of reference_files/DMS_substitutions.csv. The first 10.1038/s41586-023-06328-6 is correct but the following increment the final position incorrectly, e.g. 10.1038/s41586-023-06328-7, 10.1038/s41586-023-06328-8.

Selection type

In https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip the targets column has the value fitness or fitness_unsupervised_prediction for all rows. Some of these assays have other selection types in reference_files/DMS_substitutions.csv.

@pascalnotin
Copy link
Contributor

Hi Anthony - thank you very much for flagging all of these, we will fix them all in the next update!

@brycejoh16
Copy link

brycejoh16 commented May 17, 2024

Hi @pascalnotin ,

I ended up manually making a mapping of what DMS_id's were in the scoring file: https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv

Compared the DMS_id's that are used to represent the sequences in the cross validation splits https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip

I made the mapping by manually inspecting what DMS_id looked to go with the other one.
Please check to make sure the each DMS_id in the scoring file corresponds to the correct DMS_id in the splits zip file.

Despite both containing 217 unique DMS_id's, two DMS_id's in the scoring file had no obvious mapping to the split DMS_id's in the split zip file.

  • HXK4_HUMAN_Gersing_2022
  • B3VI55_LIPST_Klesmith_2015

Again thanks for providing protein gym as a resource. It is a great reference and way to discover new DMS datasets and explore models. Thanks again for maintaining this resource.

Here is the file of the 88 DMS_id's that are in the scoring file, but not in the cross validation split zip file, and my best guess and the mapping for each one. If you use please verify that these are correct!

missed_dms_ids.csv

@BarKetPlace
Copy link

Hi, I am adding a minor question in this thread:
In reference_files/DMS_substitutions.csv , KCNJ2_MOUSE is listed as Human/Homo Sapiens
is this correct ?

@agitter
Copy link
Author

agitter commented Sep 13, 2024

The UniProt ID for PSAE_SYNP2_Tsuboyama_2023_1PSE was updated. https://www.uniprot.org/uniprot/PSAE_SYNP2 now redirects to https://www.uniprot.org/uniprotkb/P31969/entry with ID PSAE_PICP2 (see the history tab).

@pascalnotin
Copy link
Contributor

Hi @agitter @BarKetPlace @brycejoh16 - thank you so much for the feedback!
Just fixed the reference files and updated zero-shot model performance accordingly in the latest commit. We will address the few typos re: supervised baselines in a separate commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants