Possible inconsistencies with DMS ID, DOI, and selection type #26

agitter · 2024-04-17T16:45:03Z

Thanks for the excellent resource and making all the data so easily accessible. While combing through the csv files, we noticed a few possible inconsistencies I wanted to ask about.

DMS ID

In reference_files/DMS_substitutions.csv datasets like ARGR_ECOLI_Tsuboyama_2023_1AOY from the mega scale stability experiment are named after Tsuboyama, e.g. ARGR_ECOLI_Tsuboyama_2023_1AOY. That is also the convention in benchmarks/DMS_zero_shot/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv. However, in benchmarks/DMS_supervised/substitutions/Spearman/DMS_substitutions_Spearman_DMS_level.csv they are named after Rocklin, e.g. ARGR_ECOLI_Rocklin_2023_1AOY.

DOI

The same mega scale study appears to have multiple journal DOIs listed in the jo column of reference_files/DMS_substitutions.csv. The first 10.1038/s41586-023-06328-6 is correct but the following increment the final position incorrectly, e.g. 10.1038/s41586-023-06328-7, 10.1038/s41586-023-06328-8.

Selection type

In https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.zip the targets column has the value fitness or fitness_unsupervised_prediction for all rows. Some of these assays have other selection types in reference_files/DMS_substitutions.csv.

The text was updated successfully, but these errors were encountered:

pascalnotin · 2024-04-23T02:14:39Z

Hi Anthony - thank you very much for flagging all of these, we will fix them all in the next update!

brycejoh16 · 2024-05-17T20:43:05Z

Hi @pascalnotin ,

I ended up manually making a mapping of what DMS_id's were in the scoring file: https://marks.hms.harvard.edu/proteingym/DMS_supervised_substitutions_scores.csv

Compared the DMS_id's that are used to represent the sequences in the cross validation splits https://marks.hms.harvard.edu/proteingym/cv_folds_singles_substitutions.zip

I made the mapping by manually inspecting what DMS_id looked to go with the other one.
Please check to make sure the each DMS_id in the scoring file corresponds to the correct DMS_id in the splits zip file.

Despite both containing 217 unique DMS_id's, two DMS_id's in the scoring file had no obvious mapping to the split DMS_id's in the split zip file.

HXK4_HUMAN_Gersing_2022
B3VI55_LIPST_Klesmith_2015

Again thanks for providing protein gym as a resource. It is a great reference and way to discover new DMS datasets and explore models. Thanks again for maintaining this resource.

Here is the file of the 88 DMS_id's that are in the scoring file, but not in the cross validation split zip file, and my best guess and the mapping for each one. If you use please verify that these are correct!

missed_dms_ids.csv

BarKetPlace · 2024-07-10T11:41:20Z

Hi, I am adding a minor question in this thread:
In reference_files/DMS_substitutions.csv , KCNJ2_MOUSE is listed as Human/Homo Sapiens
is this correct ?

agitter · 2024-09-13T20:40:55Z

The UniProt ID for PSAE_SYNP2_Tsuboyama_2023_1PSE was updated. https://www.uniprot.org/uniprot/PSAE_SYNP2 now redirects to https://www.uniprot.org/uniprotkb/P31969/entry with ID PSAE_PICP2 (see the history tab).

pascalnotin · 2024-10-02T04:06:15Z

Hi @agitter @BarKetPlace @brycejoh16 - thank you so much for the feedback!
Just fixed the reference files and updated zero-shot model performance accordingly in the latest commit. We will address the few typos re: supervised baselines in a separate commit.

pascalnotin mentioned this issue Jul 3, 2024

Possible missing data of benchmarking supervised performance #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible inconsistencies with DMS ID, DOI, and selection type #26

Possible inconsistencies with DMS ID, DOI, and selection type #26

agitter commented Apr 17, 2024

pascalnotin commented Apr 23, 2024

brycejoh16 commented May 17, 2024 •

edited

Loading

BarKetPlace commented Jul 10, 2024

agitter commented Sep 13, 2024

pascalnotin commented Oct 2, 2024

Possible inconsistencies with DMS ID, DOI, and selection type #26

Possible inconsistencies with DMS ID, DOI, and selection type #26

Comments

agitter commented Apr 17, 2024

DMS ID

DOI

Selection type

pascalnotin commented Apr 23, 2024

brycejoh16 commented May 17, 2024 • edited Loading

BarKetPlace commented Jul 10, 2024

agitter commented Sep 13, 2024

pascalnotin commented Oct 2, 2024

brycejoh16 commented May 17, 2024 •

edited

Loading