Official implementation of Plug & Play Directed Evolution (PPDE). A fast MCMC-based sampler for mixing and matching unsupervised and supervised sequence models for machine-learning-based directed evolution of proteins.
[DOI] [arxiv link]
Please check out https://github.com/NREL/EvoProtGrad for an easy-to-use library that implements PPDE, is installable via pip, and supports 🤗 HuggingFace protein language models.
Create the conda env with necessary dependencies:
conda env create -f environment.yml
Activate the environment:
conda activate ppde
Install the package:
poetry install
Simulated annealing sampler
python3 scripts/mnist_sum.py --seed 1 --sampler simulated_annealing --unsupervised_expert ebm --energy_function product_of_experts --simulated_annealing_temp 10 --muts_per_seq_param 5 --energy_lamda 30 --n_iters 20000 --log_every 50 --wild_type 1
MALA-approx sampler
python3 scripts/mnist_sum.py --seed 1 --sampler MALA-approx --unsupervised_expert ebm --energy_function product_of_experts --diffusion_step_size 0.1 --diffusion_relaxation_tau 0.9 --energy_lamda 5 --n_iters 20000 --log_every 50 --wild_type 1
CMA-ES sampler
python3 scripts/mnist_sum.py --seed 1 --sampler CMAES --unsupervised_expert ebm --energy_function product_of_experts --energy_lamda 20 --cmaes_initial_variance 0.1 --n_iters 20000 --log_every 50 --wild_type 1
PPDE sampler
python3 scripts/mnist_sum.py --seed 1 --sampler PPDE --unsupervised_expert ebm --energy_function product_of_experts --ppde_pas_length 10 --energy_lamda 10 --n_iters 20000 --log_every 50 --wild_type 1
By default, the script will save metrics and visualizations to results/mnist_sum/
.
See ./scripts/train_mnist.sh
for instructions on training the MNIST models.
Script for training the Denoising Autoencoder (DAE) model: ./scripts/train_binary_mnist_dae.py
.
Script for training the supervised experts: ./scripts/train_binary_mnist_regression.py
.
UPDATE: We discovered that the PPDE protein sampler was running with a "soft" maximum number of mutations from wild type---the sampler would reset the Markov chain to the wild type whenever a mutation proposal was rejected. We have corrected the accept/reject step code (L77 in ppde/protein_samplers/ppde.py), and added a proper "hard" maximum number of mutations constraint. This is easily implemented in our sampler by setting the logits of mutations that result in an edit distance from the wild type greater than a threshold to negative infinity.
Overall, this improves PPDE's diversity scores, average number of mutations in the population, and predicted activity scores, with a reasonable drop in evolutionary density scores---due to increased exploration away from the wild type. See this PDF for updated versions of Table 1 and Table 2 (this is with a "hard" maximum of 10 mutations from wild type, which can be set with argument --nmut_threshold
in our code). If aiming to replicate the PPDE protein experiment results from the paper, simply set the flag --paper_results
to use the "soft" maximum number of mutations constraint. Note that these flags only affect the PPDE protein sampler (not the baselines or the MNIST experiments).
protein | unsupervised expert | |
---|---|---|
PABP_YEAST_Fields2013 | potts | 5 |
UBE4B_MOUSE_Klevit2013-nscor_log2_ratio | potts | 0.5 |
GFP_AEQVI_Sarkisyan2016 | potts | 15 |
PABP_YEAST_Fields2013 | transformer | 5 |
UBE4B_MOUSE_Klevit2013-nscor_log2_ratio | transformer | 3 |
GFP_AEQVI_Sarkisyan2016 | transformer | 1 |
See ./scripts/run_protein_samplers.sh
or:
Random sampler
python3 scripts/directed_evolution.py --seed 1 --sampler Random --unsupervised_expert potts --energy_function product_of_experts --energy_lamda 5 --n_iters 10000 --log_every 50 --protein PABP_YEAST_Fields2013 --msa_path data/proteins/PABP_YEAST.a2m
Simulated annealing sampler
python3 scripts/directed_evolution.py --seed 1 --sampler simulated_annealing --unsupervised_expert potts --energy_function product_of_experts --energy_lamda 5 --n_iters 10000 --log_every 50 --protein PABP_YEAST_Fields2013 --msa_path data/proteins/PABP_YEAST.a2m
MALA-approx sampler
python3 scripts/directed_evolution.py --seed 1 --sampler MALA-approx --unsupervised_expert potts --energy_function product_of_experts --energy_lamda 5 --n_iters 10000 --log_every 50 --protein PABP_YEAST_Fields2013 --msa_path data/proteins/PABP_YEAST.a2m
CMA-ES sampler
python3 scripts/directed_evolution.py --seed 1 --sampler CMAES --unsupervised_expert potts --energy_function product_of_experts --energy_lamda 5 --n_iters 1000 --log_every 50 --protein PABP_YEAST_Fields2013 --msa_path data/proteins/PABP_YEAST.a2m
PPDE sampler
python3 scripts/directed_evolution.py --seed 1 --sampler PPDE --unsupervised_expert potts --energy_function product_of_experts --energy_lamda 5 --n_iters 100 --log_every 50 --protein PABP_YEAST_Fields2013 --msa_path data/proteins/PABP_YEAST.a2m
By default, the script will save metrics in .npy
format to results/proteins/$PROTEIN
. Compute metrics with ./scripts/make_figures.py
.
@article{emami2023plug,
author={Emami, Patrick and Perreault, Aidan and Law, Jeffrey and Biagioni, David and St. John, Peter C},
title={Plug & Play Directed Evolution of Proteins with Gradient-based Discrete MCMC},
journal={Machine Learning: Science and Technology},
url={http://iopscience.iop.org/article/10.1088/2632-2153/accacd},
year={2023}
}