The conda environment can be constructed with the configuration env.yaml
:
conda env create -f env.yaml
The codes are tested with cuda version 11.7
and pytorch version 1.13.1
.
Don't forget to activate the environment before running the codes:
conda activate PepGLAD
PyRosetta is used to calculate interface energy of generated peptides. If you are interested in it, please follow the instruction here to install.
These datasets are only used for benchmarking models. If you just want to use the trained weights for inferencing on your cases, there is no need to download these datasets.
- Download
The datasets, which are originally introduced in this paper, are uploaded to Zenodo at this url. You can download them as follows:
mkdir datasets # all datasets will be put into this directory
wget https://zenodo.org/records/13373108/files/train_valid.tar.gz?download=1 -O ./datasets/train_valid.tar.gz # training/validation
wget https://zenodo.org/records/13373108/files/LNR.tar.gz?download=1 -O ./datasets/LNR.tar.gz # test set
wget https://zenodo.org/records/13373108/files/ProtFrag.tar.gz?download=1 -O ./datasets/ProtFrag.tar.gz # augmentation dataset
- Decompresss
tar zxvf ./datasets/train_valid.tar.gz -C ./datasets
tar zxvf ./datasets/LNR.tar.gz -C ./datasets
tar zxvf ./datasets/ProtFrag.tar.gz -C ./datasets
- Process
python -m scripts.data_process.process --index ./datasets/train_valid/all.txt --out_dir ./datasets/train_valid/processed # train/validation set
python -m scripts.data_process.process --index ./datasets/LNR/test.txt --out_dir ./datasets/LNR/processed # test set
python -m scripts.data_process.process --index ./datasets/ProtFrag/all.txt --out_dir ./datasets/ProtFrag/processed # augmentation dataset
The index of processed data for train/validation splits need to be generated as follows, which will result in datasets/train_valid/processed/train_index.txt
and datasets/train_valid/processed/valid_index.txt
:
python -m scripts.data_process.split --train_index datasets/train_valid/train.txt --valid_index datasets/train_valid/valid.txt --processed_dir datasets/train_valid/processed/
- Download
wget http://huanglab.phys.hust.edu.cn/pepbdb/db/download/pepbdb-20200318.tgz -O ./datasets/pepbdb.tgz
- Decompress
tar zxvf ./datasets/pepbdb.tgz -C ./datasets/pepbdb
- Process
python -m scripts.data_process.pepbdb --index ./datasets/pepbdb/peptidelist.txt --out_dir ./datasets/pepbdb/processed
python -m scripts.data_process.split --train_index ./datasets/pepbdb/train.txt --valid_index ./datasets/pepbdb/valid.txt --test_index ./datasets/pepbdb/test.txt --processed_dir datasets/pepbdb/processed/
mv ./datasets/pepbdb/processed/pdbs ./dataset/pepbdb # re-locate
- codesign:
./checkpoint/codesign.ckpt
- conformation generation:
./checkpoints/fixseq.ckpt
Both can be downloaded at the release page. These checkpoints were trained on PepBench.
Take ./assets/1ssc_A_B.pdb
as an example, where chain A is the target protein:
# obtain the binding site, which might also be manually crafted or from other ligands (e.g. small molecule, antibodies)
python -m api.detect_pocket --pdb assets/1ssc_A_B.pdb --target_chains A --ligand_chains B --out assets/1ssc_A_pocket.json
# sequence-structure codesign with length in [8, 15)
CUDA_VISIBLE_DEVICES=0 python -m api.run \
--mode codesign \
--pdb assets/1ssc_A_B.pdb \
--pocket assets/1ssc_A_pocket.json \
--out_dir ./output/codesign \
--length_min 8 \
--length_max 15 \
--n_samples 10
Then 10 generations will be outputed under the folder ./output/codesign
.
Take ./assets/1ssc_A_B.pdb
as an example, where chain A is the target protein:
# obtain the binding site, which might also be manually crafted or from other ligands (e.g. small molecule, antibodies)
python -m api.detect_pocket --pdb assets/1ssc_A_B.pdb --target_chains A --ligand_chains B --out assets/1ssc_A_pocket.json
# generate binding conformation
CUDA_VISIBLE_DEVICES=0 python -m api.run \
--mode struct_pred \
--pdb assets/1ssc_A_B.pdb \
--pocket assets/1ssc_A_pocket.json \
--out_dir ./output/struct_pred \
--peptide_seq PYVPVHFDASV \
--n_samples 10
Then 10 conformations will be outputed under the folder ./output/struct_pred
.
Each task requires the following steps, which we have integrated into the script ./scripts/run_exp_pipe.sh
:
- Train autoencoder
- Train latent diffusion model
- Calculate distribution of latent distances between consecutive residues
- Generation & Evaluation
On the other hand, if you want to evaluate existing checkpoints, please follow the instructions below (e.g. conformation generation):
# generate results on the test set and save to ./results/fixseq
python generate.py --config configs/pepbench/test_fixseq.yaml --ckpt checkpoints/fixseq.ckpt --gpu 0 --save_dir ./results/fixseq
# calculate metrics
python cal_metrics.py --results ./results/fixseq/results.jsonl
Codesign experiments on PepBench:
GPU=0 bash scripts/run_exp_pipe.sh pepbench_codesign configs/pepbench/autoencoder/train_codesign.yaml configs/pepbench/ldm/train_codesign.yaml configs/pepbench/ldm/setup_latent_guidance.yaml configs/pepbench/test_codesign.yaml
Conformation generation experiments on PepBench:
GPU=0 bash scripts/run_exp_pipe.sh pepbench_fixseq configs/pepbench/autoencoder/train_fixseq.yaml configs/pepbench/ldm/train_fixseq.yaml configs/pepbench/ldm/setup_latent_guidance.yaml configs/pepbench/test_fixseq.yaml
Thank you for your interest in our work!
Please feel free to ask about any questions about the algorithms, codes, as well as problems encountered in running them so that we can make it clearer and better. You can either create an issue in the github repo or contact us at jackie_kxz@outlook.com.
@article{kong2024full,
title={Full-atom peptide design with geometric latent diffusion},
author={Kong, Xiangzhe and Huang, Wenbing and Liu, Yang},
journal={arXiv preprint arXiv:2402.13555},
year={2024}
}