p-IgGen is a paired antibody auto-regressive langauge model. This package provides utlity functions for generating and scoring antibody sequences using p-IgGen, with model weights stored on HuggingFace.
- Generate full-length antibody sequences.
- Generate a heavy chain given a light chain and vice versa.
- Generate full-length antibody sequences given an initial sequence.
- Calculate log likelihoods of sequences.
- VH and VL chains of generated sequences can be optionally seperated using ANARCI.
We advise installing using a conda environment.
- Conda
-
Create a new conda environemnt:
conda create -n my_env conda activate my_env conda install python pip -y
-
Install this repository:
pip install https://github.com/OliverT1/p-IgGen.git
-
Install optional ANARCI dependency (for
--separate_chains
option):conda install -c bioconda anarci
To generate new antibody sequences, use the piggen_generate
command:
piggen_generate --output_file output_sequences.txt --n_sequences 100
Sequences are generated by default in direction VH->VL, from C-term to N-term. Alternatively, they can be genreated in reverse from VL->VH, from N-term to C-term using the --backwards flag. This allows generation given an VH or VL sequence of any length.
Note:
- If --backwards is used, the --initial_sequence should be provided in reverse, starting from the N-term of the VL chain.
- If heavy_chain_file or light_chain_file this inversion is handled autmoamtically, and the VH and VL chains should be provide in the standard direction.
Options:
- --developable: Use the developable model.
- --heavy_chain_file FILE: File containing heavy chain sequences to generate light chains from.
- --light_chain_file FILE: File containing light chain sequences to generate heavy chains from.
- --initial_sequence TEXT: Initial sequence to generate from.
- --n_sequences INTEGER: Number of sequences to generate, per input sequence if applicable.
- --top_p FLOAT: Top-p sampling value (default: 0.95).
- --temp FLOAT: Temperature for generation (default: 1.2).
- --bottom_n_percent FLOAT: Bottom n percent of sequences to discard based on likelihood (default: 5).
- --backwards: Generate sequences in reverse.
- --output_file FILE: File to save the generated sequences (required).
- --separate_chains: Output VH and VL sequences separately, requires ANARCI.
Using bottom_n_percent requires n_sequences to be at least 100, otherwise this option is ignored.
To calculate the log likelihoods of sequences, use the piggen_likelihood
command:
sh
python cli.py likelihood --sequence_file input_sequences.txt --output_file log_likelihoods.txt
Options:
- --developable: Use the developable model.
- --sequence_file FILE: The file containing sequences to calculate log likelihoods.
- --batch_size INTEGER: Batch size for processing sequences.
- --output_file FILE: File to save the log likelihoods.
- --local: Load model from local path.
Generate Light Chains for Provided Heavy Chain :
piggen_generate --heavy_chain_file heavy_chains.txt --n_seqeunces 5 --top_p 0.95 --temp 1.25 --output_file generated_sequences.txt
Heavy chains should be seperate by new lines. Here, five light chains will be generated for each heavy chain.
Calculate Log Likelihoods for Sequences:
piggen_likelihood --sequence_file sequences.txt --batch_size 2 --output_file log_likelihoods.txt