Experiment replication of Rahman Pujianto master thesis research (Universitas Indonesia, 2017).
Prepared (trainable) datasets are provided in dataset/dataset.tar.gz
. Information below are provided as an additional information on how to prepare the dataset from raw sources (*.sdf
or .mol2
files).
Required tools:
- OpenBabel http://openbabel.org
- PaDEL Descriptor http://www.yapcwsoft.com/dd/padeldescriptor/
Extracting positive (label = 1) training data:
- Convert
sdf
tomol2
:obabel ../dataset/pubchem-compound-active-hiv1-protease.sdf -O ../dataset/pubchem-compound-active-hiv1-protease_mol2/hiv1-protease.mol2
- Convert
mol2
tpcsv
:java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/pubchem-compound-active-hiv1-protease_mol2/ -file ../dataset/pubchem-compound-active-hiv1-protease.csv
Extracting negative (label = 0) training data:
- Convert
sdf
tomol2
:../dataset/obabel decoys_final.sdf -O ../dataset/decoys_final_mol2/decoys_final.mol2
- Convert
mol2
tpcsv
:java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/decoys_final_mol2/ -file ../dataset/decoys.csv
Extracting test data (unlabeled):
- Convert
mol2
tpcsv
:java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/HerbalDB_mol2/ -file ../dataset/HerbalDB.csv
Datasets provided in this repo:
dataset/dataset.tar.gz
:dataset.csv
: 3,665 HIV-1 protease inhibitor from PubChem Bioassay + 3,665 protease decoy DUD-E for HIV-1 (Mysinger, Carchia, Irwin, & Shoichet, 2012)dataset_test.csv
: 10 from top 10 protease inhibitor herbal database Indonesia (Yanuar et al., 2014)
dataset/daftar-senyawa-beserta-binding-energy.csv
: docking results of 368 molecules from herbal database Indonesia (Yanuar et al., 2014) which are predicted as HIV-1 protease inhibitor by machine learning model in this research
Raw datasets (
*.sdf
and*.mol2
) can be downloaded at https://drive.google.com/open?id=1X_wkpvSLXXXUPbxmFd7tE5pe0t_njMe_
Dependency:
- Python 3.x
- Python3-tk (on ubuntu
sudo apt install python3-tk
) - Virtualenv (optional. for isolated environment)
Dependency library installation: pip install -r requirements.txt
Steps:
- Extract preprocessed data from
dataset/dataset.tar.gz
(if you have raw csv data, usepython 01-prepare-data.py
) - Feature selection with SVM-RFE
python 02-feature-selection-svm-rfe.py
- Feature selection with Wrapper Method (GA + SVM)
python 02-feature-selection-wm.py
- Evaluate selected features using PubChem dataset
python 03-evaluate-1.py
- Evaluate selected features using Indonesian Herbal dataset
python 03-evaluate-2.py
Evaluation scripts display accuracy scores in console, save raw results in
csv
files and display result chart(s) to screen
PubChem dataset visualizations using t-SNE. Generated by running python visualize-dataset.py
:
Top 10 PubChem features importance ranking (using Extra Trees):
- feature 520 (maxsOH): 0.08817
- feature 401 (minsOH): 0.06929
- feature 282 (SsOH): 0.02738
- feature 110 (nHsOH): 0.02464
- feature 163 (nsOH): 0.0211
- feature 35 (BCUTw-1l): 0.01432
- feature 467 (maxHsOH): 0.01308
- feature 406 (minsOm): 0.01307
- feature 588 (nAtomP): 0.01254
- feature 142 (nsssCH): 0.01231
SVM-RFE also shown that even using 1 feature in PubChem dataset, already give > 80% accuracy. Generated by running python 02-feature-selection-svm-rfe.py
:
Comparisons between Linear SVM (no feature selection), Linear SVM + RFE & SVM + Wrapper Method (WM) classification metrics on PubChem dataset. Generated by running python 03-evaluate-1.py
:
SVM + RFE Accuracy: 0.9898
SVM + WM Accuracy: 0.9883
SVM Accuracy: 0.9894
Evaluation using selected features on top 10 herbal data (Yanuar et al., 2014). Generated by running python 03-evaluate-2.py
:
SVM + RFE Accuracy: 0.5000
SVM + WM Accuracy: 0.6000
SVM Accuracy: 0.5000
K-Means silhouette analysis. Generated by running python 04-evaluate_herbaldb.py
:
For n_clusters = 2 The average silhouette_score is : 0.9359476619168386
For n_clusters = 3 The average silhouette_score is : 0.5895330105970298
For n_clusters = 4 The average silhouette_score is : 0.5863432755171203
For n_clusters = 5 The average silhouette_score is : 0.5664882163184368
The bigger
silhouette_score
is, the better is itsn_clusters
K-Means clustering accuracy on top 10 herbal data (Yanuar et al., 2014) by running: python 04-clustering_herbaldb.py
:
K-Means accuracy on top 10 data (Yanuar et al., 2014): 0.30
@mastersthesis{pujianto2017thesis,
author={Rahman {Pujianto}},
title={Drug Candidates Virtual Screening on Indonesian Herbal Plants Database using Machine Learning and Various Feature Selection Strategies},
school={Universitas Indonesia},
year={2017},
}