- Title: Facilitating Prediction of Adverse Drug Reactions by Using Knowledge Graphs and Multi-Label Learning Models
- Authors: Emir Muñoz, Vít Novácek, Pierre-Yves Vandenbussche
- Contact: Emir Muñoz, emir.munoz@insight-centre.org
- URL: http://purl.org/bib-adr-prediction/
This page provides a full description of the data sets used in this manuscript and that are made available. These data sets were used to evaluate all approaches reviewed in the manuscript.
All data sets files are publicly available for download at https://doi.org/10.6084/m9.figshare.4823203.
This data set was originally proposed by Liu et al. (2012)1, and then processed after by Zhang et al. (2015)2 and Zhang et al. (2016)3 for machine learning. Liu's data set contains 832 drugs with 2892 features, and 1385 ADRs.
The results obtained using this data set are in Table 4 and Table 5 of the article.
Folder: liu/
Files:
Liu_drug_lists.csv
: list of 832 drugs. The file is incsv
format:DrugBank ID, Drug Name, PubChem ID
.Liu_dataset.mat
: a file with the features for each drug. The file uses MatLabmat
format, and contains a dictionary with features name and values for each of the 832 drugs. Drugs are represented by binary vectors whose elements encode the presence or absence of each feature as 1 or 0, respectively.feature_description/
: folder that contains the description of each feature mentioned above.chemical_feature_index.txt
(Seepubchem_fingerprints.txt
for a description.)enzyme_feature_index.txt
pathway_feature_index.txt
target_feature_index.txt
transporter_feature_index.txt
treatment_feature_index.txt
sideeffect_index.txt
The feature types, sources, and IDs are described as follows:
Feature type | Specific feature | Source | ID | Dimension | Dictionary key |
---|---|---|---|---|---|
Chemical | Substructures | PubChem | Substructure Fingerprints* | 881 | chemical |
Biological | Targets | DrugBank | GeneBank Gene IDs | 786 | Targets |
Biological | Transporters | DrugBank | HGNC IDs | 72 | Transporters |
Biological | Enzymes | DrugBank | GeneBank Gene IDs | 111 | Enzymes |
Biological | Pathways | KEGG | KEGG IDs | 173 | Pathways |
Phenotypic | Treatment indications | SIDER | CUI disease code | 869 | Treatment |
Label | Side effects | SIDER | CUI disease code | 1385 | side_effect |
(*) A full description of PubChem Substructure Fingerprints can be found at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt
We consider the list of drugs from Liu's data set but not their features. Instead, we extract the features from the Knowledge Graph generated from Bio2RDF v1 DrugBank and SIDER data sets (Muñoz et al., 2016)4. This generates 30161 features for the 832 drugs, and we consider the same set of 1385 ADRs in Liu's data set.
The original Bio2RDF RDF files can be downloaded at http://purl.org/bib-adr-prediction/data For the feature extraction from those files, please check the supplemental material of the article.
The results obtained using this data set are in Table 6 of the article.
Folder: bio2rdf_v1/
Files:
matrices.mat
: contains the design matrixX
and the target matrixy
that are passed to the machine learning methods.X_column_labels.json
: enumerates the 30161 features label extracted from Bio2RDF v1 data set.X_row_labels.json
: list of 832 drugs with the ID in the rows of matrixX
.y_column_labels.json
: list of 1385 ADRs with the ID in the columns of matrixy
.
We consider the list of drugs from Liu's data set but not their features. Instead, we extract the features from the Knowledge Graph generated from Bio2RDF v2 DrugBank, SIDER and KEGG data sets. This generates 37368 features for the 832 drugs, and we consider the same set of 1385 ADRs in Liu's data set.
The original Bio2RDF RDF files can be downloaded at http://purl.org/bib-adr-prediction/data For the feature extraction from those files, please check the supplemental material of the article.
The results obtained using this data set are in Table 7 of the article.
Folder: bio2rdf_v2/
Files:
matrices.mat
: contains the design matrixX
and the target matrixy
that are passed to the machine learning methods.X_column_labels.json
: enumerates the 37368 features label extracted from Bio2RDF v2 data set.X_row_labels.json
: list of 832 drugs with the ID in the rows of matrixX
.y_column_labels.json
: list of 1385 ADRs with the ID in the columns of matrixy
.
We also consider the integration of features from both Liu and Bio2RDF v2 data sets for the 832 drugs. This generates 40260 features in total, which are used to train the machine learning models.
The results obtained using this data set are in Table 8 of the article.
Folder: liubio2rdf_v2/
Files:
matrices.mat
: contains the design matrixX
and the target matrixy
that are passed to the machine learning methods.X_column_labels.json
: enumerate the 40260 features label extracted from Bio2RDF v2 data set.X_row_labels.json
: list of 832 drugs with the ID in the rows of matrixX
.y_column_labels.json
: list of 1385 ADRs with the ID in the columns of matrixy
.
We also performed an independent evaluation using the SIDER 4 data set provided by Zhang et al. (2015)2, which comprises a subset of the drugs from Liu's data set plus some newly added drugs.
Zhang, Wen; Liu, Feng; Luo, Longqiang; Zhang, Jingxia (2015): Predicting drug side effects by multi-label learning and ensemble learning. figshare. http://doi.org/10.6084/m9.figshare.c.3608738 Retrieved: 12 34, May 09, 2017 (GMT)
The results obtained using this data set are in Table 9 of the article.
Folder: sider4/
Files:
sider_test_dataset.mat
: contains the features for the 309 test set drugs.sider_train_dataset.mat
: contains the features for the 771 training set drugs.sider_test_dataset_drug_list.csv
: enumerates the 309 drugs in the test set.sider_train_dataset_drug_list.csv
: enumerates the 771 drugs in the training set.feature_description/
: folder that contains the description of each feature mentioned above.enzyme_feature_index.txt
pathway_feature_index.txt
target_feature_index.txt
transporter_feature_index.txt
treatment_feature_index.txt
sideeffect_index.txt
The feature types, sources, and IDs are described as follows:
Feature type | Specific feature | Source | ID | Dimension | Dictionary key |
---|---|---|---|---|---|
Chemical | Substructures | PubChem | Substructure Fingerprints* | 881 | chemical |
Biological | Targets | DrugBank | GeneBank Gene IDs | 1046 | Targets |
Biological | Transporters | DrugBank | HGNC IDs | 96 | Transporters |
Biological | Enzymes | DrugBank | GeneBank Gene IDs | 160 | Enzymes |
Biological | Pathways | KEGG | KEGG IDs | 268 | Pathways |
Phenotypic | Treatment indications | SIDER | CUI disease code | 2537 | Treatment |
Label | Side effects | SIDER | CUI disease code | 5579 | side_effect |
(*) A full description of PubChem Substructure Fingerprints can be found at ftp://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.txt
Similarly to what we did with Liu's data set, we also consider the list of drugs in SIDER 4 data set but not their features. Instead, we extract the features from the Knowledge Graph generated from Bio2RDF v2 DrugBank, SIDER and KEGG data sets. This generates 43843 features for the 1080 drugs (771 for training and 309 for testing), and we consider the same set of 5579 ADRs in SIDER 4 data set.
The results obtained using this data set are in Table 10 of the article.
Folder: sider4bio2rdf_v2_sider/
Files:
matrices.mat
: contains the design matricesX_train
andX_test
, and the target matricesy_train
andy_test
that are passed to the machine learning methods.X_train_row_labels.json
: list of 771 drugs in the training set with the ID in the rows of matrixX_train
.X_test_row_labels.json
: list of 309 drugs in the test set with the ID in the rows of matrixX_test
.y_column_labels.json
: list of 5579 ADRs with the ID in the columns of matricesy_train
andy_test
.
Additionally, we evaluate the predictions on newly added ADRs which were discovered (reported) after the generation of SIDER 4 data set. This relationships are published in the Aeolus data set, which is generated from the FAERS reports. The matrices shape is as in SIDER 4, and we update the matrix y_test
with drug-ADR relations from Aeolus.
The results obtained using this data set are in Table 11 of the article.
Folder: sider4bio2rdf_v2_aeolus/
Files:
matrices.mat
: contains the design matricesX_train
andX_test
, and the target matricesy_train
andy_test
that are passed to the machine learning methods.X_train_row_labels.json
: list of 771 drugs in the training set with the ID in the rows of matrixX_train
.X_test_row_labels.json
: list of 309 drugs in the test set with the ID in the rows of matrixX_test
.y_column_labels.json
: list of 5579 ADRs with the ID in the columns of matricesy_train
andy_test
.
- Version 0.0.3 (June 16, 2017 4:40 PM)
- Updating URLs with persistent URLs
- Update links to references
- Version 0.0.2 (May 9, 2017 1:50 PM)
- Adding table with detailed description of Liu's and SIDER 4 data sets features
- Version 0.0.1 (April 6, 2017 1:51 PM)
- Adding description to all data sets
Footnotes
-
Liu M, Wu Y, Chen Y, et al. Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. Journal of the American Medical Informatics Association. 2012. ↩
-
Zhang W, Liu F, Luo L, et al. Predicting drug side effects bymulti-label learning and ensemble learning. BMC bioinformatics. 2015;16:1. ↩ ↩2
-
Zhang W, Chen Y, Tu S, et al. Drug side effect prediction through linear neighborhoods and multiple data source integration. In: 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2016. p. 427–434. ↩
-
Muñoz E, Novacek V, Vandenbussche PY. Using drug similarities for discovery of possible adverse reactions. In: AMIA 2016, American Medical Informatics Association Annual Symposium. American Medical Informatics Association; 2016. p. 924–933. ↩