Skip to content

This repo contains the code to train and test NetTCR-2.1 models

Notifications You must be signed in to change notification settings

mnielLab/NetTCR-2.1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NetTCR-2.1 - Sequence-based prediction of peptide-TCR interacions using CDR1, CDR2 and CDR3 loops

NetTCR-2.1 is a deep learning model used to predict TCR specificity. NetTCR-2.1 uses convolutional neural networks (CNN) to predict whether a given TCR binds a specific peptide. The NetTCR-2.1 publication is available at https://www.frontiersin.org/articles/10.3389/fimmu.2022.1055151/full.

The scripts in this repo allow training and testing of models. It is possible to train/test using CDR3 only (with train_nettcr_cdr3.py and test_nettcr_cdr3.py) or all the CDRs (with train_nettcr_cdr123.py and test_nettcr_cdr123.py).

Data

The input datasets shoud contain the CDRs and peptide sequences. For the CDR3 training/testing, at least the columns A3, B3 should be present (with headers). For CDR123, the columns should be A1,A2,A3, B1, B2, B3. All the input files shoud be comma-separated.

See data/GILGFVFTL/train.csv as an exampl of a CDR123 dataset.

NB! Since the NetTCR models are peptide-specific, the peptide sequence is not needed in the input file. Make sure that all the TCRs in the input file refer to the same peptide.

The folder data/contains the data used to train/validate/test NetTCR-2.1. Th data file contains information about the 6 CDR loops, the V/J genes, the target peptide and HLA. The positive data was retrieved from IEDB, VDJdb 10X genomics and McPAS datasets; the negative data comes from 10X (denoted as true_neg) or is generated by mismatching positive TCRs and peptide (denoted as swapped_neg).

The redundancy in the dataset was reduced using Hobohm1 algorithm [1], using the kernel similarity [2] measure and a similarity threshold of 0.95. Thus training, validation and test dataset do not share similar TCR sequences (up to 0.95 similarity threshold).

Environment setup

First, install the conda environment running conda env create -f environment.yml. This will create a conda environment called nettcr_env with the necessary dependencies.

Network training

The inputs files for the training scripts are the training dataset and the validation data, used for early stopping.

Example:

python src/train_nettcr_cdr3.py --train_data data/RAKFKQLL/train.csv --val_data data/RAKFKQLL/validation.csv --outdir out/<model_name>/

This will generate and save a .pt file with the the traiend model. The directory has to be specified with the option --outdir.

The other input arguments to the script are --epochs, --learning_rate, --verbose. If a GPU is available, the scritp will detect it and use it.

Network testing

The test scripts can be used to make predictions of test TCRs, using a pre-trained model.

Example:

python src/test_nettcr_cdr3.py --test_data data/RAKFKQLL/test.csv --trained_model out/<model_name>/trained_model_cdr3_ab.pt --outdir out/<model_name>/

This will generate and save a .csv file with the prediction. The file will be saved in the specified output directory.

Pre-trained models

The folder pretrained_models contains the models from [3]. The pretrained models refer both to the NetTCR-2.1 CDR3 and CDR123 architectures, with paired alpha and beta chains. For each network configuration, the peptide-specific models are provided. For each peptide, the network was trained using 5-fold nested cross-validation; this results in 20 models per peptide. The final prediction score is given by an average of the 20 predictions. The followig example shows hot to test the pretrained models.

python src/test_pretrained_cdr3.py --test_data data/RAKFKQLL/test.csv --trained_models_dir pretrained_models/cdr3_pep/RAKFKQLL/ --outdir <path/for/prediction/file>

NB! NetTCR-2.1 is a peptide-specific model. Make sure that the pretrained model and the test data refer to the same peptide.

References

[1] Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.

[2] Shen, Wen-Jun, et al. "Towards a mathematical foundation of immunology and amino acid chains." arXiv preprint arXiv:1205.6031 (2012).

[3] Montemurro, Alessandro, et al. "NetTCR-2.1: Lessons and guidance on how to develop models for TCR specificity predictions." Frontiers in Immunology Volume 13 (2022).

About

This repo contains the code to train and test NetTCR-2.1 models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published