Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering
This repository contains code to perform the analysis described Minot & Reddy 2024. [Cell Systems] [bioRxiv].
- Prepare Working Environment
- Datasets
- Reproducing Study Results
- Citing This Work
- Citing Supporting Repositories
conda env create -f meta_env.yml
conda activate meta_env
For virtualenv setup:
python -m venv meta_env
- Windows:
meta_env\Scripts\activate.bat
Or Unix / MacOS:source meta_env/bin/activate
pip install -r requirements.txt
Note: study results were executed with torch 1.11.0+cu113. Environment contains torch 1.11.0
The following data is provided in data/
to facilitate ease of use:
- preprocessed NGS data in
data/4d5
anddata/5a12
- Fully processed clean datasets have been uploaded in
data/preprocessed_full_datasets
- Train, meta, validation, and test sets are planned for upload. In the meantime, one can curate the data into train, val, meta, and test sets for each learning task via the preprocessing pipeline described in Step 1 below.
The full pipeline to reproduce the study, written in Python, can be summarised into three consecutive steps:
- Data preprocessing.
- Model training and evaluation.
- Plot results.
To download and start from raw deep sequencing (NGS) data, execute step A). To start from preprocessed NGS data contained in this repo, skip A) and head to step B). In scripts/
:
Raw data is available at [NCBI]
or on Mendeley in two parts: [Part 1] and [Part 2].
Save data to data/raw_ngs/
.
After installing [BBDUK]. Modify the path to BBDUK and this repo before executing 5a12_preprocessing_raw_ngs.sh
and 4d5_preprocessing_raw_ngs.sh
in preprocessing/
. Then proceed to B).
To execute preprocessing and train/test splitting for each task, in preprocessing/
, run 5a12_preprocessing.sh
and 4d5_preprocessing.sh
.
Processed datasets for the full analysis in the paper will then be added to data/
.
Run scripts in scripts/run_models/
. This will populate the folder results/
with .csv files in the appropriate format for plotting in Step 3.
Run python scripts in plot/
:
Please cite our work when referencing this repository.
@article{minot_meta_2024,
title = {Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering},
issn = {2405-4712},
url = {https://www.sciencedirect.com/science/article/pii/S2405471223003332},
doi = {https://doi.org/10.1016/j.cels.2023.12.003},
journal = {Cell Systems},
author = {Minot, Mason and Reddy, Sai T.},
year = {2024},
}
@inproceedings{ren_learning_2018,
title = {Learning to {Reweight} {Examples} for {Robust} {Deep} {Learning}},
url = {https://proceedings.mlr.press/v80/ren18a.html},
booktitle = {Proceedings of the 35th {International} {Conference} on {Machine} {Learning}},
publisher = {PMLR},
author = {Ren, Mengye and Zeng, Wenyuan and Yang, Bin and Urtasun, Raquel},
month = jul,
year = {2018},
note = {ISSN: 2640-3498},
pages = {4334--4343},
}
@article{zheng_meta_2021,
title = {Meta {Label} {Correction} for {Noisy} {Label} {Learning}},
volume = {35},
copyright = {Copyright (c) 2021 Association for the Advancement of Artificial Intelligence},
issn = {2374-3468},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/17319},
doi = {10.1609/aaai.v35i12.17319},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
author = {Zheng, Guoqing and Awadallah, Ahmed Hassan and Dumais, Susan},
}
For PUDMS [GitHub]:
@article{SONG202192,
title = {Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning},
journal = {Cell Systems},
volume = {12},
number = {1},
pages = {92-101.e8},
year = {2021},
issn = {2405-4712},
doi = {https://doi.org/10.1016/j.cels.2020.10.007},
url = {https://www.sciencedirect.com/science/article/pii/S2405471220304142},
author = {Hyebin Song and Bennett J. Bremer and Emily C. Hinds and Garvesh Raskutti and Philip A. Romero},
}
For ElkaNoto [GitHub] [License_pulearn] [License_ElkaNoto]:
@inproceedings{elkan_learning_2008,
address = {New York, NY, USA},
series = {{KDD} '08},
title = {Learning classifiers from only positive and unlabeled data},
isbn = {978-1-60558-193-4},
url = {https://doi.org/10.1145/1401890.1401920},
doi = {10.1145/1401890.1401920},
booktitle = {Proceedings of the 14th {ACM} {SIGKDD} international conference on {Knowledge} discovery and data mining},
publisher = {Association for Computing Machinery},
author = {Elkan, Charles and Noto, Keith},
month = aug,
year = {2008},
}