______ ______ .__ __. .___________..______ ___ .___________. ______ .______
/ | / __ \ | \ | | | || _ \ / \ | | / || _ \
| ,----'| | | | | \| | `---| |----`| |_) | / ^ \ `---| |----`| ,----'| |_) |
| | | | | | | . ` | | | | / / /_\ \ | | | | | /
| `----.| `--' | | |\ | | | | |\ \----./ _____ \ | | | `----.| |\ \----.
\______| \______/ |__| \__| |__| | _| `._____/__/ \__\ |__| \______|| _| `._____|
ContraTCR: Predicting TCR-Epitope Binding with Features Generated by Fine-Tuned Protein Language Model and Contrastive Learning
Welcome to ContraTCR, a tool designed for training and predicting T-cell receptor (TCR) and epitope binding using contrastive learning techniques. This guide will walk you through the steps required to run the project, from training the model to making predictions.
ContraTCR leverages contrastive learning to model the interaction between T-cell receptors and epitopes. By training on the PyTDC dataset, it aims to predict binding specificity with high accuracy. This guide provides a step-by-step walkthrough for users to:
- Train the ESM-2 model along with the projection head using PyTDC data.
- Extract features using the trained model.
- Use the extracted features to train an XGBoost model for binding specificity prediction.
The project consists of several key files and directories:
run.py
: Main script to run different modes (train
,extract
,predict
).config/
: Contains configuration files (e.g.,midterm400_clean.yaml
).model.py
: Defines the ESM-2 model and projection head.data.py
: Data loading and preprocessing utilities.train.py
: Training routines for different contrastive modes.extract.py
: Feature extraction functions.xgb.py
: Functions for training and evaluating the XGBoost model.result/
: Default directory where results, logs, and checkpoints are saved.
The project operates in three primary modes: train
, extract
, and predict
. Below are detailed instructions for each step, including how to incorporate your own data.
Description: Train the ESM-2 model along with a projection head.
Command:
!python ./run.py --config_path './config/default/your_config.yaml' --mode train
Instructions:
-
Prepare Your Configuration File:
- Copy the example configuration file
midterm400_clean.yaml
and rename it (e.g.,your_config.yaml
). - Open
your_config.yaml
and update the hyperparameters
- Copy the example configuration file
-
Run Training:
- Execute the command in your terminal or Colab notebook, replacing
your_config.yaml
with the name of your configuration file. - This will start the training process using your specified settings.
- Execute the command in your terminal or Colab notebook, replacing
-
Monitor Training:
- Training logs will be printed to the console and saved in the log directory specified in your configuration.
- Model checkpoints will be saved in the checkpoint directory.
Description: Utilize the trained ESM-2 model and projection head to extract features from the dataset.
Command:
!python ./run.py --config_path ./config/default/your_config.yaml --resume_path '/path/to/your/model_checkpoint.pth' --mode extract
Instructions:
-
Ensure Model Checkpoint Exists:
- After training, the model checkpoint should be saved in the checkpoint directory specified in your configuration file.
- Locate the checkpoint file (e.g.,
model_triplet.pth
or a similarly named file).
-
Run Feature Extraction:
- Replace
/path/to/your/model_checkpoint.pth
with the actual path to your model checkpoint.
- Replace
-
Verify Extraction:
- The extracted features will be saved to the paths within the
result
directory. - Ensure that the feature files are generated successfully.
- The extracted features will be saved to the paths within the
Description: Use the extracted features to train an XGBoost model and perform binding specificity prediction.
Command:
!python ./run.py --config_path './config/default/your_config.yaml' \
--train_feature_path '/path/to/your/feature_data_train.csv' \
--test_feature_path '/path/to/your/feature_data_test.csv' \
--mode predict
Instructions:
-
Locate Extracted Features:
- Identify where the feature extraction step saved your feature files.
- Typically, these are named
feature_data_train.csv
andfeature_data_test.csv
.
-
Update Paths:
- Replace
/path/to/your/feature_data_train.csv
and/path/to/your/feature_data_test.csv
with the actual paths to your feature files.
- Replace
-
Run Prediction:
- Execute the command to start training the XGBoost model and perform predictions.
- The script will output performance metrics such as precision, recall, F1 score, and ROC AUC.
- ESM-2 Model: Evolutionary Scale Modeling (ESM) GitHub Repository
- Contrastive Learning: A Simple Framework for Contrastive Learning of Visual Representations
- XGBoost: XGBoost Documentation