- Title: Estimating Uncertainty in Machine Learning Models for Drug Discovery
- Type: MSc dissertation
- Author: George Batchkala, https://www.linkedin.com/in/george-batchkala/
- Supervisor: Professor Garrett M. Morris, garrett.morris@dtc.ox.ac.uk
- Institution: University of Oxford
- Department: Department of Statistics, 24-29 St Giles', Oxford, OX1 3LB
- Project's dates: June 1st, 2020 - September 14th, 2020
- Data: MoleculeNet, Physical Chemistry Datasets (http://moleculenet.ai/datasets-1)
- GitHub repository: https://github.com/GeorgeBatch/moleculenet
This repository contains all code, results, and plots I produced while completing my MSc dissertation. The pdf file with the full dissertation will be uploaded after it gets marked and I officially complete my degree.
"My model says that I had just found an ultimate drug. Can I trust it?"
In this work, I explore ways of quantifying the confidence of machine learning models used in drug discovery. In order to do this, I start with exploring methods to predict physicochemical properties of drugs and drug-like molecules crucial to drug discovery. I first attempt to reproduce and improve upon a subset of results to do with a drug's solubility in water, taken from a popular benchmark set called "MoleculeNet". Using XGBoost, which in the era of Deep Neural Networks, is already classified as a "conventional" machine learning method, I show that I am able to achieve state-of-the-art results. After that, I explore Gaussian Processes and Infinitesimal Jackknife for Random Forests and their associated uncertainty estimates. Finally, I attempt to understand whether the confidence of a model's prediction can be used to answer a similar but more general question: "How do we know when to trust our models?" The answer depends on the model. We can trust Gaussian Processes when they are confident, but the confidence estimates from Random Forests do not give us any assurance.
This work is mostly based of four papers:
- "MoleculeNet: A Benchmark for Molecular Machine Learning" by Wu et al.;
- "Learning From the Ligand: Using Ligand-Based Features to Improve Binding Affinity Prediction" by Boyles et al.;
- "The Photoswitch Dataset: A Molecular Machine Learning Benchmark for the Advancement of Synthetic Chemistry" by Thawani et al.; and
- "Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife" by Wager et al..
In this dissertation I aim to achieve three primary goals:
- Reproduce a subset of solubility-related prediction results from the MoleculeNet benchmarking paper;
- Improve upon the reproduced results; and
- Use uncertainty estimation methods with the best-performing models to get single prediction uncertainty estimates to evaluate and compare these methods.
I used the MoleculeNet dataset which accompanies the MoleculeNet benchmarking paper, and in particular, I focused on the Physical Chemistry datasets: ESOL, FreeSolv, and Lipophilicity. The MoleculeNet datasets are widely used to validate machine learning models used to estimate a particular property directly from small molecules including drug-like compounds.
The Physical Chemistry datasets can be downloaded from MoleculeNet benchmark dataset collection.
I use the following four models for the regression task of physicochemical property prediction:
I obtained per-prediction confidence intervals with:
- Gaussian Processes (notes, chapter 7, section 7.2)
- Bias-corrected Infinitesimal Jackknife estimate for Random Forests (paper)
All the data preparation, experiments, and visualisations were done in Python.
To convert molecules from their SMILES string representations to either Molecular Descriptors or Extended-Connectivity Fingerprints, I used the open-source cheminformatics software, RDKit (GitHub).
Wu et al. suggest to use their Python library, DeepChem (GitHub), to reproduce the results. We decided not to use it, since the user API only gives high-level access to the user, while I wanted to have more control of the implementation. To have comparable results, I decided to use the tools which the DeepChem library is built on.
For most of the machine learning pipeline, I used Scikit-Learn (GitHub) for preprocessing, splitting, modelling, prediction, and validation. To obtain the confidence intervals for Random Forests, I used the forestci (GitHub) extension for Scikit-Learn. The implementation of a custom Tanimoto (Jaccard) kernel for Gaussian Process Regression and all the following GP experiments were performed with GPflow (GitHub).
In this section I outline the set-up steps required to start reproducing my results. It covers the following stages:
- Directory set-up;
- Creating an environment with conda;
- Data preparation; and
- Creation of features.
This section tells how to set up your directory via git clone
or manually.
To make the reproduction process as simple as possible, clone this repository to your local machine. To do this, run the following command in your terminal/command prompt:
>>> git clone https://github.com/GeorgeBatch/moleculenet.git
If you decided not to clone the repository from GitHub but still want to reproduce the results, choose a directory, where you will store the data and code. Organise your directory as follows:
- moleculenet
|
---- data
|
---- figures
|
---- notebooks
|
---- scripts
|
---- results
|
---- environment.yml
Download the Physical Chemistry datasets from the MoleculeNet datasets page.
Populate the ~/data/
directory with these six files:
- from the FreeSolv folder:
SAMPL.csv
,FreeSolv_README
- from the ESOL folder:
delaney-processed.csv
,ESOL_README
- from the lipophilicity folder:
Lipophilicity.csv
,Lipo_README
Rename the CSV files as follows:
SAMPL.csv
tofreesolv_original.csv
delaney-processed.csv
toesol_original.csv
Lipophilicity.csv
tolipophilicity_original.csv
In the root (moleculenet
) directory create a project environment from the environment.yml
file using:
>>> conda env create -f environment.yml
Environment's name is batch-msc
, and we activate it using:
>>> conda activate batch-msc
Conda environments make managing Python library dependences and reproducing research much easier. Another reason why we use conda us that some packages, e.g. RDKit: Open-Source Cheminformatics Software, are not available via pip install
.
This section covers two data preparation stages: standardising input files and producing the features.
To automate the process of working with three different datasets (ESOL, FreeSolv, and Lipiphilicity) we standardise the column names from the original CSV files and store the results in the new CSV files.
We need to get hold of ID/Name, SMILES string representation, and measured label value for each of the compounds in all of the three datasets. To do this, run the following commands in the ~/scripts/
directory:
>>> python get_original_id_smiles_labels_lipophilicity.py
>>> python get_original_id_smiles_labels_esol.py
>>> python get_original_id_smiles_labels_freesolv.py
The resulting files are saved in the ~/data/
directory:
esol_original_IdSmilesLabels.csv
,esol_original_extra_features.csv
freesolv_original_IdSmilesLabels.csv
lipophilicity_original_IdSmilesLabels.csv
Note: the original file for the ESOL dataset also contained extra features which we also save here.
We show how to produce the features and store them in CSV files.
From the SMILES string representations of the molecules for all three datasets compute Extended-Connectivity Fingerprints and RDKit Molecular Descriptors to use them as features. We do it at the very beginning and never worry about it in the future.
Note, we produce four different versions of extended-connectivity fingerprints:
- ECFP_4 hashed with 1024 bits
- ECFP_6 hashed with 1024 bits
- ECFP_4 hashed with 2048 bits
- ECFP_6 hashed with 2048 bits
To compute and record the features run the corresponding commands in the ~/scripts/
directory:
>>> python get_all_fingerprints_for_all_datasets.py
>>> python get_rdkit_descriptors_for_all_datasets.py