In this study, we applied the deep sequence model – UDSMProt to two new protein classification tasks.
- predict proteins with liquid-liquid phase separation propensity
- predict synaptic proteins
Our results have shown that, without prior domain knowledge and only based on protein sequences, the fine-tuned language models achieved high classification accuracies and outperformed baseline models using compositional k-mer features in both tasks. For details of this work, please refer to our paper "Deep sequence representation learning for predicting human proteins with liquid-liquid phase separation propensity and synaptic functions" (Wei and Wang, 2022)
Please refer to the orignal repository of UDSMProt for detailed information.
Users are welcome to use the fine-tuned models in both learning tasks for comparisons in their own research.
Here, we provide one example to show the application of the fine-tuned UDSM-LLPS models in the first learning task. As stated in our paper, in addition to LLPSDB and PhaSepDB data, we also evaluated the performance of UDSM-LLPS on another well-known database – DrLLPS. DrLLPS is currently the most comprehensive database with the largest collection of LLPS-associated proteins in 164 eukaryotes. In DrLLPS, LLPS-associated proteins can be browsed by three LLPS types, including
- scaffolds, proteins that can drive or undergo LLPS;
- clients, proteins that can be recruited by scaffolds for the formation of biomolecular condensates;
- regulators, proteins that have not been identified to undergo LLPS but shown to be involved in regulating LLPS behaviors.
- DrLLPS data:
task_1/application/DrLLPS_data.csv
stores 3627 reviewed human LLPS-associated proteins categorized by the three types, consisting of 100 scaffolds, 2,998 clients, and 529 regulators. - Fine-tuned UDSM-LLPS models:
UDSM-LLPS_Random.pkl
andUDSM-LLPS_UniRef.pkl
undertask_1/
- Utils file:
model_utils.py
downloaded from the original UDSMProt repository - Token file:
tok_itos.npy
Please see two Jupyter Notebooks under task_1/application/
for detailed steps:
1. Predict LLPS propensity of DrLLPS data.ipynb
2. UDSM-LLPS prediction results on DrLLPS data.ipynb