Skip to content

mnielLab/train_NetTCR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NetTCR-2.1 - Sequence-based prediction of peptide-TCR interacions using CDR1, CDR2 and CDR3 loops

NetTCR-2.1 is a deep learning model used to predict TCR specificity. NetTCR-2.1 uses convolutional neural networks (CNN) to predict whether a given TCR binds a specific peptide.

The scripts in this repo allow training and testing of models. It is possible to train/test using CDR3 only (with train_nettcr_cdr3.py and test_nettcr_cdr3.py) or all the CDRs (with train_nettcr_cdr123.py and test_nettcr_cdr123.py). It is also possible to choose, with the --chain option, which chains of the TCRs to use for training.

Data

The input datasets shoud contain the CDRs and peptide sequences. For the CDR3 training/testing, at least the columns peptide, A3, B3 should be present (with headers). For CDR123, the columns should be peptide, A1,A2,A3, B1, B2, B3. All the input files shoud be comma-separated.

See test/train_data as an example.

The folder data/contains the data used to train/validate/test NetTCR-2.1. Th data file contains information about the 6 CDR loops, the V/J genes, the target peptide and HLA. The positive data was retrieved from IEDB, VDJdb 10X genomics and McPAS datasets; the negative data comes from 10X (denoted as true_neg) or is generated by mismatching positive TCRs and peptide (denoted as swapped_neg).

The redundancy in the dataset was reduced using Hobohm1 algorithm [1], using the kernel similarity [2] measure and a similarity threshold of 0.95. The data is then split into 6 partitions. 5 partitions can be used for nested cross-validation; the left-out partitions can be used as an external evaluation set.

Network training

The inputs files for the training scripts are the training dataset and the validation data, used for early stopping.

Example:

python src/train_nettcr_cdr3.py --train_data test/train_data --val_data test/val_data --outdir test/models/ --chain ab

This will generate and save a .pt file with the the traiend model. The directory has to be specified with the option --outdir.

The other input arguments to the script are --epochs, --learning_rate, --verbose. If a GPU is available, the scritp will detect it and use it.

Network testing

The test scripts can be used to make predictions of test TCRs, using a pre-trained model.

Example:

python src/test_nettcr_cdr3.py --test_data test/test_data --trained_model test/models/trained_model_cdr3_ab.pt --outdir test/models/ --chain ab

This will generate and save a .csv file with the prediction. The file will be saved in the specified output directory.

References

[1] Hobohm, Uwe, et al. "Selection of representative protein data sets." Protein Science 1.3 (1992): 409-417.

[2] Shen, Wen-Jun, et al. "Towards a mathematical foundation of immunology and amino acid chains." arXiv preprint arXiv:1205.6031 (2012).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages