- We recommend using a new Conda enviroment!
- Install Proval Framework
pip install -e .[all]
- (Optional) Install Smith-Watermann Alignment:
git clone https://github.com/mengyao/Complete-Striped-Smith-Waterman-Library.git
cd Complete-Striped-Smith-Waterman-Library/src
make
Integration into embedding.py
- Load pretrained model
- Add function to
embedding_utils.py
, which takes the train and test sequences as lists of Bio sequences (see read_fasta() in utils.py) and returns the vectors in a dictionary of the form id(String):vector(NumPy array) - Add approach to embedding list (
embeddings.py
, line 17) - Add embedding function call to the if/elif statements in the similar form
- Run
embeddings.py
and the respective comparison scripts
Custom integration through vector file
- Load the train and test sequences as lists of Bio sequences (see
read_fasta()
in utils.py) - Use custom embedding to predict the embedding vector for each sequence in the dictionary format id(String):vector(NumPy array).
- Truncate the vectors to d=100 if necessary, compare
embeddings.py
- Save as pickle '.p' file, compare
embeddings.py
Note, the extraction of the vectors and the results might not be fully deterministic and small deviations might be possible.
Data set (optional)
Steps to reproduce the test.fasta
and train.fasta
files in the data/
folder:
- Download the full SwissProt data set (release 02/2021):
https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2021_02/ - Select the sequence IDs, the sequence strings and the molecular function information ('GO:xxxxxx' terms)
- Discard all sequences with more than one molecular function (to reduce the complexity of the experiments)
- Select 1000 random sequences for each of the most frequent 15 molecular functions (=15,000 sequences)
- Randomly split the sequences in training and test sets (70:30)
- Save the sequences in the .fasta format, compare the test.fasta and train.fasta files in the data folder:
<Sequence ID> [<GO-ID>]
<Sequence>
<Sequence ID> [<GO-ID>]
<Sequence>
...
Embedding methods
- Install the Smith-Watermann Alignment
- Run embeddings.py to obtain the vectors
Figures
- Run
dataset_metrics.py
for optional data set plots - Run
semantics.py
for the classification results (Table 3) - Run
visualization.py
for the visualization results (Figure 7) - Run
eigenspectrum_plot.py
for the information theory results (Figure 8)
@article{VATH2022100044,
title = {PROVAL: A framework for comparison of protein sequence embeddings},
journal = {Journal of Computational Mathematics and Data Science},
pages = {100044},
year = {2022},
issn = {2772-4158},
doi = {https://doi.org/10.1016/j.jcmds.2022.100044},
url = {https://www.sciencedirect.com/science/article/pii/S2772415822000128},
author = {Philipp Väth and Maximilian Münch and Christoph Raab and F.-M. Schleif},
}