MolPMoFiT

Implementation of Inductive transfer learning for Molecular Activity Prediction: Next-Gen QSAR Models with MolPMoFiT

Molecular Prediction Model Fine-Tuning (MolPMoFiT) is a transfer learning method based on self-supervised pre-training + task-specific fine-tuning for QSPR/QSAR modeling.

MolPMoFiT is adapted from the ULMFiT using Pytorch and Fastai v1. A large-scale molecular structure prediction model is pre-trained using one million unlabeled molecules from ChEMBL in a self-supervised learning manner, and can then be fine-tuned on various QSPR/QSAR tasks for smaller chemical datasets with a specific endpoints.

Enviroment

We recommand to build the enviroment with Conda.

conda env create -f molpmofit.yml

Datasets

We provide all the datasets needed to reproduce the experiments in the data folder.

data/MSPM contains the dataset to train the general domain molecular structure prediction model.
data/QSAR contains the datasets for QSAR tasks.

Experiments

The code is provided as jupyter notebook in the notebooks folder. All the code was developed in a Ubuntu 18.04 workstation with 2 Quadro P4000 GPUs.

01_MSPM_Pretraining.ipynb: Training the general domain molecular structure prediction model(MSPM).
02_MSPM_TS_finetuning.ipynb: (1) Fine-tuning the general MSPM on a target dataset to generate a task-specific MSPM model. (2) Fine-tuning the task-specific MSPM to tran a QSAR model.
03_QSAR_Classifcation.ipynb: Fine-tuning the general domain MSPM to train a classification model.
04_QSAR_Regression.ipynb: Fine-tuning the general domain MSPM to train a regression model.

Pre-trained Models Download

Download ChEMBL_1M_atom. See notebooks/05_Pretrained_Models.ipynb for instructions of usage.
- This model is trained on 1M ChEMBL molecules with the atomwise tokenization method (original MoPMoFiT).
Download ChEMBL_1M_SPE. See notebooks/06_SPE_Pretrained_Models.ipynb for instructions of usage.
- This model is trained on 1M ChEMBL molecules with the SMILES pair encoding tokenization method.
- SMILES Pair Encoding (SmilesPE) is A Data-Driven Substructure Tokenization Algorithm for Deep Learning.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
notebooks		notebooks
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
TOC.PNG		TOC.PNG
molpmofit.yml		molpmofit.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MolPMoFiT

Enviroment

Datasets

Experiments

Pre-trained Models Download

About

Releases

Packages

Contributors 2

Languages

XinhaoLi74/MolPMoFiT

Folders and files

Latest commit

History

Repository files navigation

MolPMoFiT

Enviroment

Datasets

Experiments

Pre-trained Models Download

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages