Domain-PFP is a self-supervised method to learn functional representations of protein domains that can be used for protein function prediction.
License: GPL v3. (If you are interested in a different license, for example, for commercial use, please contact us.)
Contact: Daisuke Kihara (dkihara@purdue.edu)
For technical problems or questions, please reach to Nabil Ibtehaz (nibtehaz@purdue.edu).
Ibtehaz, N., Kagaya, Y. & Kihara, D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun Biol 6, 1103 (2023). https://doi.org/10.1038/s42003-023-05476-9
https://bit.ly/domain-pfp-colab
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains, through learning domain-Gene Ontology (GO) co-occurrences and associations. Domain embeddings constructed with the self-supervised protocol learned functional associations, which turned out effective to perform in actual function prediction tasks. An extensive evaluation shows that the protein representation using the domain embeddings are superior to that of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method, Domain-PFP, significantly outperformed the state-of-the-art function predictors. Notably, Domain-PFP achieved increase of area under precision-recall curve by 2.43%, 14.58% and 9.57% over the state-of-the-art method for molecular function (MF), biological process (BP) and cellular components (CC), respectively. Moreover, Domain-PFP demonstrated competitive performance in CAFA3 evaluation, by achieving overall the best performance among top teams that participated in the assessment.
Overview of Domain-PFP.
- The network architecture used for self-supervised learning of domain embeddings.
- The overall pipeline of learning the functionally aware domain embeddings.
- The steps of computing the embeddings of a protein and inferring the functions.
Python 3.9 : https://www.python.org/downloads/
1. Install git
git clone https://github.com/kiharalab/Domain-PFP && cd Domain-PFP
You have two options to install dependency on your computer:
3.1.1install pip
.
pip3 install -r requirements.txt --user
If you encounter any errors, you can install each library one by one:
!pip3 install numpy==1.23.5
!pip3 install tqdm==4.64.1
!pip3 install scipy==1.9.3
!pip3 install matplotlib==3.6.2
!pip3 install matplotlib-inline==0.1.6
!pip3 install pandas==1.5.2
!pip3 install seaborn==0.12.1
!pip3 install torch==1.13.0
!pip3 install tabulate==0.9.0
!pip3 install scikit-learn==1.2.0
!pip3 install click==8.0.3
Installing the dependencies only require a few minutes on a standard desktop computer.
3.2.1 install conda
.
conda create -n domainpfp python=3.9
conda activate domainpfp
pip3 install -r requirements.txt
Each time when you want to run this code, simply activate the environment by
conda activate domainpfp
conda deactivate (If you want to exit)
Please download and unzip the data.zip and saved_models.zip files. Optinally, you may download our blast and ppi database (blast_ppi_database.zip) if you wish to use blast or ppi in your prediction.
https://kiharalab.org/domainpfp/
wget https://kiharalab.org/domainpfp/data.zip
unzip data.zip
wget https://kiharalab.org/domainpfp/saved_models.zip
unzip saved_models.zip
wget https://kiharalab.org/domainpfp/blast_ppi_database.zip
unzip blast_ppi_database.zip
Our implementation of Domain-PFP is provided in the DomainPFP
directory.
All the codes to run the experiments presented in the paper, are provided in the /experiments
directory.
The result files of CAFA3 and PROBE benchmarks, generated using the official evaluation tool, are provided in the /results
directory.
Here we provide the following functionalities :
You can use DomainGO_prob to calculate the association probability of a domain and GO term, by providing the domain and GO term
python3 domaingo_prob.py:
-domain input InterPro domain
-GO input GO term
python3 domaingo_prob.py --domain IPR000003 --GO GO:0006355
This usually takes <2 minutes to run.
You can use Domain-PFP to compute functionally aware embedding representation of a protein by providing the protein ID or path to a fasta file. You also need to provide the path to the savefile, where the embedding will be saved as a pickle file
python3 compute_embeddings.py:
-protein UniProt ID of protein
-fasta Or provide the fasta file path
-savefile Path to save the protein embeddings (as pickle file)
(default: emb.p)
python3 compute_embeddings.py --protein Q6NYN7 --savefile emb_Q6NYN7.p
This usually takes <5 minutes to run, depending on the availability of InterProScan server.
Note: If you wish to use this representation as feature for some functionally relevant downstream task.
Please consider applying proper normalization
You can use Domain-PFP to predict the functions by either providing the protein ID or path to a fasta file.
python3 predict_functions.py:
--protein UniProt ID of protein
--fasta Or provide the fasta file path
--threshMFO Threshold for MFO prediction (default: 0.36)
--threshBPO Threshold for BPO prediction (default: 0.31)
--threshCCO Threshold for CCO prediction (default: 0.36)
--blast_flag Optional flag to use DiamondBlast for function prediction
(DiamondBlast needs to be installed and assigned to path)
--diamond_path Path to Diamond Blast (by default the colab release path is provided)
(default='/content/Domain-PFP/diamond')
--ppi_flag Optional flag to use String PPI for function prediction
(Only works for Uniprot IDs or properly formatted fastas)
--outfile Path to the output csv file (optional)
python3 predict_functions.py --protein Q6NYN7 --outfile sample_functions/Q6NYN7_functions.csv
python3 predict_functions.py --fasta sample_protein/Q6NYN7.fasta --outfile sample_functions/Q6NYN7_functions.csv
python3 predict_functions.py --fasta sample_protein/Q6NYN7.fasta --threshCCO 0.5 --outfile sample_functions/Q6NYN7_functions.csv
python3 predict_functions.py --fasta sample_protein/Q6NYN7.fasta --threshCCO 0.5 --outfile sample_functions/Q6NYN7_functions.csv --blast_flag --ppi_flag
This usually takes <5 minutes to run, depending on the availability of InterProScan server.
(Note: we recommend using our google colab release https://bit.ly/domain-pfp-colab to avoid issues with DiamondBlast installation)
Protein sequence in fasta format.
Our example input can be found in the sample_protein
directory
Predcited functions for the protein in csv format.
Our example output can be found in the sample_functions
directory