Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What was the pretraining split for the ProtT5-UniRef50 model? #151

Open
speydril opened this issue Jun 12, 2024 · 1 comment
Open

What was the pretraining split for the ProtT5-UniRef50 model? #151

speydril opened this issue Jun 12, 2024 · 1 comment

Comments

@speydril
Copy link

speydril commented Jun 12, 2024

do you by any chance still have the dataset split (train/val/test set) that was used to pretrain ProtT5 UniRef50? I am trying to investigate data leakage for down stream tasks.

@speydril speydril changed the title What was the training split for the ProtT5-UniRef50 model? What was the pretraining split for the ProtT5-UniRef50 model? Jun 12, 2024
@mheinzinger
Copy link
Collaborator

Hi, no, unfortunately, we do not have datasplits for this anymore as we considered the downstream prediction performance the acid test. Looking back, this was obviously a mistake.
In order to still move forward on your end, you could take a time-cut-off of UniRef, i.e., extracting all sequences published after ProtT5 training, and redundancy reduce the newly added sequences against our training set (which will be a pain, sorry, as we also trained on BFD ... ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants