-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible inconsistencies with DMS ID, DOI, and selection type #26
Comments
Hi Anthony - thank you very much for flagging all of these, we will fix them all in the next update! |
Hi @pascalnotin , I ended up manually making a mapping of what DMS_id's were in the scoring file: https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv Compared the DMS_id's that are used to represent the sequences in the cross validation splits https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip I made the mapping by manually inspecting what DMS_id looked to go with the other one. Despite both containing 217 unique DMS_id's, two DMS_id's in the scoring file had no obvious mapping to the split DMS_id's in the split zip file.
Again thanks for providing protein gym as a resource. It is a great reference and way to discover new DMS datasets and explore models. Thanks again for maintaining this resource. Here is the file of the 88 DMS_id's that are in the scoring file, but not in the cross validation split zip file, and my best guess and the mapping for each one. If you use please verify that these are correct! |
Hi, I am adding a minor question in this thread: |
The UniProt ID for |
Hi @agitter @BarKetPlace @brycejoh16 - thank you so much for the feedback! |
Thanks for the excellent resource and making all the data so easily accessible. While combing through the csv files, we noticed a few possible inconsistencies I wanted to ask about.
DMS ID
In
reference_files/DMS_substitutions.csv
datasets like ARGR_ECOLI_Tsuboyama_2023_1AOY from the mega scale stability experiment are named after Tsuboyama, e.g. ARGR_ECOLI_Tsuboyama_2023_1AOY. That is also the convention inbenchmarks/DMS_zero_shot/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv
. However, inbenchmarks/DMS_supervised/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv
they are named after Rocklin, e.g. ARGR_ECOLI_Rocklin_2023_1AOY.DOI
The same mega scale study appears to have multiple journal DOIs listed in the jo column of
reference_files/DMS_substitutions.csv
. The first 10.1038/s41586-023-06328-6 is correct but the following increment the final position incorrectly, e.g. 10.1038/s41586-023-06328-7, 10.1038/s41586-023-06328-8.Selection type
In https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip the targets column has the value fitness or fitness_unsupervised_prediction for all rows. Some of these assays have other selection types in
reference_files/DMS_substitutions.csv
.The text was updated successfully, but these errors were encountered: