This repository contains message passing neural networks for molecular property prediction as described in the paper Analyzing Learned Molecular Representations for Property Prediction.
While it is possible to run all of the code on a CPU-only machine, GPUs make training significantly faster. To run with GPUs, you will need:
- cuda >= 8.0
- cuDNN
The easiest way to install the chemprop
dependencies is via conda. Here are the steps:
- Install Miniconda from https://conda.io/miniconda.html
cd /path/to/chemprop
conda env create -f environment.yml
source activate chemprop
(orconda activate chemprop
for newer versions of conda)- (Optional)
pip install git+https://github.com/bp-kelley/descriptastorus
The optional descriptastorus
package is only necessary if you plan to incorporate computed RDKit features into your model (see Additional Features). The addition of these features improves model performance on some datasets but is not necessary for the base model.
Note that on machines with GPUs, you may need to manually install a GPU-enabled version of PyTorch by following the instructions here.
Docker provides a nice way to isolate the chemprop
code and environment. To install and run our code in a Docker container, follow these steps:
- Install Docker from https://docs.docker.com/install/
cd /path/to/chemprop
docker build -t chemprop .
docker run -it chemprop:latest /bin/bash
Note that you will need to run the latter command with nvidia-docker if you are on a GPU machine in order to be able to access the GPUs.
If you would like to use functions or classes from chemprop
in your own code, you can install chemprop
as a pip package as follows:
cd /path/to/chemprop
pip install -e .
Then you can use import chemprop
or from chemprop import ...
in your other code.
PyTorch GPU: Although PyTorch is installed automatically along with chemprop
, you may need to install the GPU version manually. Instructions are available here.
kyotocabinet: If you get warning messages about kyotocabinet
not being installed, it's safe to ignore them.
For those less familiar with the command line, we also have a web interface which allows for basic training and predicting. After installing the dependencies following the instructions above, you can start the web interface in two ways:
- Run
python web/run.py
and then navigate to localhost:5000 in a web browser. This will start the site in development mode. - Run
gunicorn --bind {host}:{port} 'wsgi:build_app()'
. This will start the site in production mode.- To run this server in the background, add the
--daemon
flag. - Arguments including
init_db
anddemo
can be passed with this pattern:'wsgi:build_app(init_db=True, demo=True)'
- Gunicorn documentation can be found here.
- To run this server in the background, add the
In order to train a model, you must provide training data containing molecules (as SMILES strings) and known target values. Targets can either be real numbers, if performing regression, or binary (i.e. 0s and 1s), if performing classification. Target values which are unknown can be left as blanks.
Our model can either train on a single target ("single tasking") or on multiple targets simultaneously ("multi-tasking").
The data file must be be a CSV file with a header row. For example:
smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
...
Datasets from MoleculeNet and a 450K subset of ChEMBL from http://www.bioinf.jku.at/research/lsc/index.html have been preprocessed and are available in data.tar.gz
. To uncompress them, run tar xvzf data.tar.gz
.
To train a model, run:
python train.py --data_path <path> --dataset_type <type> --save_dir <dir>
where <path>
is the path to a CSV file containing a dataset, <type>
is either "classification" or "regression" depending on the type of the dataset, and <dir>
is the directory where model checkpoints will be saved.
For example:
python train.py --data_path data/tox21.csv --dataset_type classification --save_dir tox21_checkpoints
Notes:
- The default metric for classification is AUC and the default metric for regression is RMSE. Other metrics may be specified with
--metric <metric>
. --save_dir
may be left out if you don't want to save model checkpoints.--quiet
can be added to reduce the amount of debugging information printed to the console. Both a quiet and verbose version of the logs are saved in thesave_dir
.
Our code supports several methods of splitting data into train, validation, and test sets.
Random: By default, the data will be split randomly into train, validation, and test sets.
Scaffold: Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding --split_type scaffold_balanced
.
Separate val/test: If you have separate data files you would like to use as the validation or test set, you can specify them with --separate_val_path <val_path>
and/or --separate_test_path <test_path>
.
Note: By default, both random and scaffold split the data into 80% train, 10% validation, and 10% test. This can be changed with --split_sizes <train_frac> <val_frac> <test_frac>
. For example, the default setting is --split_sizes 0.8 0.1 0.1
. Both also involve a random component and can be seeded with --seed <seed>
. The default setting is --seed 0
.
k-fold cross-validation can be run by specifying --num_folds <k>
. The default is --num_folds 1
.
To train an ensemble, specify the number of models in the ensemble with --ensemble_size <n>
. The default is --ensemble_size 1
.
Although the default message passing architecture works quite well on a variety of datasets, optimizing the hyperparameters for a particular dataset often leads to marked improvement in predictive performance. We have automated hyperparameter optimization via Bayesian optimization (using the hyperopt package) in hyperparameter_optimization.py
. This script finds the optimal hidden size, depth, dropout, and number of feed-forward layers for our model. Optimization can be run as follows:
python hyperparameter_optimization.py --data_path <data_path> --dataset_type <type> --num_iters <n> --config_save_path <config_path>
where <n>
is the number of hyperparameter settings to try and <config_path>
is the path to a .json
file where the optimal hyperparameters will be saved. Once hyperparameter optimization is complete, the optimal hyperparameters can be applied during training by specifying the config path as follows:
python train.py --data_path <data_path> --dataset_type <type> --config_path <config_path>
Note that the hyperparameter optimization script sees all the data given to it. The intended use is to run the hyperparameter optimization script on a dataset with the eventual test set held out. If you need to optimize hyperparameters separately for several different cross validation splits, you should e.g. set up a bash script to run hyperparameter_optimization.py separately on each split's training and validation data with test held out.
While the model works very well on its own, especially after hyperparameter optimization, we have seen that adding computed molecule-level features can further improve performance on certain datasets. Features can be added to the model using the --features_generator <generator>
flag.
As a starting point, we recommend using pre-normalized RDKit features by using the --features_generator rdkit_2d_normalized --no_features_scaling
flags. In general, we recommend NOT using the --no_features_scaling
flag (i.e. allow the code to automatically perform feature scaling), but in the case of rdkit_2d_normalized
, those features have been pre-normalized and don't require further scaling.
Note: In order to use the rdkit_2d_normalized
features, you must have descriptastorus
installed. If you installed via conda, you can install descriptastorus
by running pip install git+https://github.com/bp-kelley/descriptastorus
. If you installed via Docker, descriptastorus
should already be installed.
The full list of available features for --features_generator
is as follows.
morgan
is binary Morgan fingerprints, radius 2 and 2048 bits.
morgan_count
is count-based Morgan, radius 2 and 2048 bits.
rdkit_2d
is an unnormalized version of 200 assorted rdkit descriptors. Full list can be found at the bottom of our paper: https://arxiv.org/pdf/1904.01561.pdf
rdkit_2d_normalized
is the CDF-normalized version of the 200 rdkit descriptors.
If you would like to load custom features, you can do so in two ways:
- Generate features: If you want to generate features in code, you can write a custom features generator function in
chemprop/features/features_generators.py
. Scroll down to the bottom of that file to see a features generator code template. - Load features: If you have features saved as a numpy
.npy
file or as a.csv
file, you can load the features by using--features_path /path/to/features
. Note that the features must be in the same order as the SMILES strings in your data file. Also note that.csv
files must have a header row and the features should be comma-separated with one line per molecule.
To load a trained model and make predictions, run predict.py
and specify:
--test_path <path>
Path to the data to predict on.- A checkpoint by using either:
--checkpoint_dir <dir>
Directory where the model checkpoint(s) are saved (i.e.--save_dir
during training). This will walk the directory, load all.pt
files it finds, and treat the models as an ensemble.--checkpoint_path <path>
Path to a model checkpoint file (.pt
file).
--preds_path
Path where a CSV file containing the predictions will be saved.
For example:
python predict.py --test_path data/tox21.csv --checkpoint_dir tox21_checkpoints --preds_path tox21_preds.csv
or
python predict.py --test_path data/tox21.csv --checkpoint_path tox21_checkpoints/fold_0/model_0/model.pt --preds_path tox21_preds.csv
During training, TensorBoard logs are automatically saved to the same directory as the model checkpoints. To view TensorBoard logs, run tensorboard --logdir=<dir>
where <dir>
is the path to the checkpoint directory. Then navigate to http://localhost:6006.
We compared our model against MolNet by Wu et al. on all of the MolNet datasets for which we could reproduce their splits (all but Bace, Toxcast, and qm7). When there was only one fold provided (scaffold split for BBBP and HIV), we ran our model multiple times and reported average performance. In each case we optimize hyperparameters on separate folds, use rdkit_2d_normalized features when useful, and compare to the best-performing model in MolNet as reported by Wu et al. We did not ensemble our model in these results.
Results on regression datasets (lower is better)
Dataset | Size | Metric | Ours | MolNet Best Model |
---|---|---|---|---|
QM8 | 21,786 | MAE | 0.011 ± 0.000 | 0.0143 ± 0.0011 |
QM9 | 133,885 | MAE | 2.666 ± 0.006 | 2.4 ± 1.1 |
ESOL | 1,128 | RMSE | 0.555 ± 0.047 | 0.58 ± 0.03 |
FreeSolv | 642 | RMSE | 1.075 ± 0.054 | 1.15 ± 0.12 |
Lipophilicity | 4,200 | RMSE | 0.555 ± 0.023 | 0.655 ± 0.036 |
PDBbind (full) | 9,880 | RMSE | 1.391 ± 0.012 | 1.25 ± 0 |
PDBbind (core) | 168 | RMSE | 2.173 ± 0.090 | 1.92 ± 0.07 |
PDBbind (refined) | 3,040 | RMSE | 1.486 ± 0.026 | 1.38 ± 0 |
Results on classification datasets (higher is better)
Dataset | Size | Metric | Ours | MolNet Best Model |
---|---|---|---|---|
PCBA | 437,928 | PRC-AUC | 0.335 ± 0.001 | 0.136 ± 0.004 |
MUV | 93,087 | PRC-AUC | 0.041 ± 0.007 | 0.184 ± 0.02 |
HIV | 41,127 | ROC-AUC | 0.776 ± 0.007 | 0.792 ± 0 |
BBBP | 2,039 | ROC-AUC | 0.737 ± 0.001 | 0.729 ± 0 |
Tox21 | 7,831 | ROC-AUC | 0.851 ± 0.002 | 0.829 ± 0.006 |
SIDER | 1,427 | ROC-AUC | 0.676 ± 0.014 | 0.648 ± 0.009 |
ClinTox | 1,478 | ROC-AUC | 0.864 ± 0.017 | 0.832 ± 0.037 |
Lastly, you can find the code to our original repo at https://github.com/wengong-jin/chemprop and for the Mayr et al. baseline at https://github.com/yangkevin2/lsc_experiments .