Skip to content
This repository has been archived by the owner on Jul 8, 2023. It is now read-only.

Experiment replication of Rahman Pujianto master thesis research (Universitas Indonesia, 2017)

Notifications You must be signed in to change notification settings

yohanesgultom/drug-discovery-feature-selection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Drug Discovery Feature Selections

Experiment replication of Rahman Pujianto master thesis research (Universitas Indonesia, 2017).

Dataset Preparation

Prepared (trainable) datasets are provided in dataset/dataset.tar.gz. Information below are provided as an additional information on how to prepare the dataset from raw sources (*.sdf or .mol2 files).

Required tools:

Extracting positive (label = 1) training data:

  1. Convert sdf to mol2: obabel ../dataset/pubchem-compound-active-hiv1-protease.sdf -O ../dataset/pubchem-compound-active-hiv1-protease_mol2/hiv1-protease.mol2
  2. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/pubchem-compound-active-hiv1-protease_mol2/ -file ../dataset/pubchem-compound-active-hiv1-protease.csv

Extracting negative (label = 0) training data:

  1. Convert sdf to mol2: ../dataset/obabel decoys_final.sdf -O ../dataset/decoys_final_mol2/decoys_final.mol2
  2. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/decoys_final_mol2/ -file ../dataset/decoys.csv

Extracting test data (unlabeled):

  1. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/HerbalDB_mol2/ -file ../dataset/HerbalDB.csv

Dataset Description

Datasets provided in this repo:

  1. dataset/dataset.tar.gz:
    1. dataset.csv: 3,665 HIV-1 protease inhibitor from PubChem Bioassay + 3,665 protease decoy DUD-E for HIV-1 (Mysinger, Carchia, Irwin, & Shoichet, 2012)
    2. dataset_test.csv: 10 from top 10 protease inhibitor herbal database Indonesia (Yanuar et al., 2014)
  2. dataset/daftar-senyawa-beserta-binding-energy.csv: docking results of 368 molecules from herbal database Indonesia (Yanuar et al., 2014) which are predicted as HIV-1 protease inhibitor by machine learning model in this research

Raw datasets (*.sdf and *.mol2) can be downloaded at https://drive.google.com/open?id=1X_wkpvSLXXXUPbxmFd7tE5pe0t_njMe_

Experiments

Dependency:

  • Python 3.x
  • Python3-tk (on ubuntu sudo apt install python3-tk)
  • Virtualenv (optional. for isolated environment)

Dependency library installation: pip install -r requirements.txt

Steps:

  1. Extract preprocessed data from dataset/dataset.tar.gz (if you have raw csv data, use python 01-prepare-data.py)
  2. Feature selection with SVM-RFE python 02-feature-selection-svm-rfe.py
  3. Feature selection with Wrapper Method (GA + SVM) python 02-feature-selection-wm.py
  4. Evaluate selected features using PubChem dataset python 03-evaluate-1.py
  5. Evaluate selected features using Indonesian Herbal dataset python 03-evaluate-2.py

Evaluation scripts display accuracy scores in console, save raw results in csv files and display result chart(s) to screen

PubChem Dataset Analysis

PubChem dataset visualizations using t-SNE. Generated by running python visualize-dataset.py:

PubChem t-SNE perplexity=5

PubChem t-SNE perplexity=100

Top 10 PubChem features importance ranking (using Extra Trees):

  1. feature 520 (maxsOH): 0.08817
  2. feature 401 (minsOH): 0.06929
  3. feature 282 (SsOH): 0.02738
  4. feature 110 (nHsOH): 0.02464
  5. feature 163 (nsOH): 0.0211
  6. feature 35 (BCUTw-1l): 0.01432
  7. feature 467 (maxHsOH): 0.01308
  8. feature 406 (minsOm): 0.01307
  9. feature 588 (nAtomP): 0.01254
  10. feature 142 (nsssCH): 0.01231

PubChem Extra Trees feature importance

SVM-RFE also shown that even using 1 feature in PubChem dataset, already give > 80% accuracy. Generated by running python 02-feature-selection-svm-rfe.py:

PubChem Linear SVM + RFE Accuracy per feature set

Comparisons between Linear SVM (no feature selection), Linear SVM + RFE & SVM + Wrapper Method (WM) classification metrics on PubChem dataset. Generated by running python 03-evaluate-1.py:

Receiver Operating Characteristic (ROC) Curves

Classification Accuracy, Sensitivity, Precision and Sensitifity

SVM + RFE Accuracy: 0.9898
SVM + WM Accuracy: 0.9883
SVM Accuracy: 0.9894

HerbalDB Dataset Analysis

Evaluation using selected features on top 10 herbal data (Yanuar et al., 2014). Generated by running python 03-evaluate-2.py:

Classification Accuracy, Sensitivity, Precision and Sensitifity

SVM + RFE Accuracy: 0.5000
SVM + WM Accuracy: 0.6000
SVM Accuracy: 0.5000

K-Means silhouette analysis. Generated by running python 04-evaluate_herbaldb.py:

Silhouette Analysis

For n_clusters = 2 The average silhouette_score is : 0.9359476619168386
For n_clusters = 3 The average silhouette_score is : 0.5895330105970298
For n_clusters = 4 The average silhouette_score is : 0.5863432755171203
For n_clusters = 5 The average silhouette_score is : 0.5664882163184368

The bigger silhouette_score is, the better is its n_clusters

K-Means clustering accuracy on top 10 herbal data (Yanuar et al., 2014) by running: python 04-clustering_herbaldb.py:

K-Means accuracy on top 10 data (Yanuar et al., 2014): 0.30

Citation

@mastersthesis{pujianto2017thesis,
	author={Rahman {Pujianto}},
    title={Drug Candidates Virtual Screening on Indonesian Herbal Plants Database using Machine Learning and Various Feature Selection Strategies},
	school={Universitas Indonesia},
	year={2017},
}

About

Experiment replication of Rahman Pujianto master thesis research (Universitas Indonesia, 2017)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages