Repository with source code for paper "Disease classification for whole blood DNA methylation: meta-analysis, missing values imputation, and XAI" by A. Kalyakulina, I. Yusipov, M.G. Bacalini, C. Franceschi, M. Vedunova, M. Ivanchenko.
Install dependencies
# clone project
git clone https://github.com/GillianGrayson/DNAmClassMeta
cd DNAmClassMeta
# [OPTIONAL] create conda environment
conda create -n env_name python=3.8
conda activate env_name
# install pytorch according to instructions
# https://pytorch.org/get-started/
# install requirements
pip install -r requirements.txt
The data
directory contains harmonized and non-harmonized data (methylation M-values for top-1000 best CpGs) for Parkinson disease and Schizophrenia for all considered GSE datasets:
└── data <- Project data
├── Parkinson <- Data for Parkinson disease
│ ├── non_harmonized <- Non-harmonized data
│ │ ├── data.xlsx <- Dataframe with methylation data and additional subjects info
│ │ ├── features.xlsx <- Dataframe with features (models input)
│ │ └── labels.xlsx <- Dataframe with class labels (models output)
│ └── harmonized <- Harmonized data
│ ├── data.xlsx <- Dataframe with methylation data and additional subjects info
│ ├── features.xlsx <- Dataframe with features (models input)
│ └── labels.xlsx <- Dataframe with class labels (models output)
└── Schizophrenia <- Data for Schizophrenia
├── non_harmonized <- Non-harmonized data
│ ├── data.xlsx <- Dataframe with methylation data and additional subjects info
│ ├── features.xlsx <- Dataframe with features (models input)
│ └── labels.xlsx <- Dataframe with class labels (models output)
└── harmonized <- Harmonized data
├── data.xlsx <- Dataframe with methylation data and additional subjects info
├── features.xlsx <- Dataframe with features (models input)
└── labels.xlsx <- Dataframe with class labels (models output)
data.xlsx
is a dataframe, each row corresponds to subject (GSM), each column corresponds to feature. In addition to methylation levels (M-values) there are another features:Status
(control or case),Dataset
(original GSE) andPartition
(Train, Validation or Test).
features.xlsx
is a dataframe which contains features (CpGs), which will be used as input features of models. Modifying this file will change the set of features (CpGs),which will be used for building a model.
labels.xlsx
is a dataframe which class labels. Modifying this file allows to select the subset of subjects which will participate in model.
There are two types of experiments based on type of model:
- Stand-Alone (SA) models (Logistic Regression, SVM, XGBoost, CatBoost, LightGBM)
- PyTorch Lightning (PL) based models (TabNet, NODE)
Configuration files for the experiments can be found in the following directory:
└── configs
└── experiment
└── classification
└── trn_val_tst
├── sa.yaml <- Configuration file for Stand-Alone (SA) models
└── pl.yaml <- Configuration file for PyTorch Lightning (PL) based models
There are common parts in these configuration files:
disease: Parkinson # Disease type. Options: [Parkinson, Schizophrenia]
data_type: harmonized # Data type. Options: [non_harmonized, harmonized]
model_type: catboost # Model type. Options: for SA: [logistic_regression, svm, xgboost, catboost, lightgbm], for PL: [tabnet, node]
outcome: "Status" # Which column in `data.xlsx` contains class labels
optimized_metric: "accuracy_weighted" # Target metric in optimization process. Options: [accuracy_weighted, f1_weighted, auroc_weighted]
optimized_mean: "" # Optimizing mean metric value across all cross-validation splits. Options: ["", cv_mean]
optimized_part: "val" # Which partition should be optimized? Options: [trn, val, tst]
direction: "max" # Optimization metrics should be minimized or maximized? Options: [max, min]
max_epochs: 2000 # Maximum number of epochs in training process
patience: 100 # How many validation epochs of not improving until training stops
is_shap: True # SHAP values calculation. Options: [True, False]
is_shap_save: False # Save SHAP values to file?
shap_explainer: Tree # Explainer type for SHAP values. Options: for SA: [Tree, Kernel], for PL: [Deep, Kernel]
shap_bkgrd: tree_path_dependent # Background for calculating SHAP values. Options: [trn, all, tree_path_dependent]. Last option works only for GBDT models
datamodule:
_target_: src.datamodules.dnam.DNAmDataModule # Instantiated datamodule
feat_fn: "feat.xlsx" # Filename of target features file
label_fn: "label.xlsx" # Filename of target labels file
data_fn: "data.xlsx" # Filename of data file
outcome: "Status" # Which column in `data_fn' contains class labels
batch_size: 128 # How many samples per batch to load
num_workers: 0 # How many subprocesses to use for data loading
pin_memory: False # Data loader will copy Tensors into CUDA pinned memory before returning them
seed: 42 # Random seed
imputation: "fast_knn" # Imputation method for missing values. Options: [median, mean, fast_knn, random, mice, em, mode]
k: 1 # k for 'fast_knn' imputation method
Parameters of models can be changed in corresponding blocks of configuration files:
logistic_regression: # Logistic Regression model params
...
svm: # SVM params
...
xgboost: # XGBoost model params
...
catboost: # CatBoost model params
...
lightgbm: # LightGBM model params
...
tabnet: # TabNet model params
...
node: # NODE model params
...
Train SA model with configuration from sa.yaml
file:
python run_classification_trn_val_tst_sa.py experiment=classification/trn_val_tst/sa.yaml
Train PL model with configuration from pl.yaml
file:
python run_classification_trn_val_tst_pl.py experiment=classification/trn_val_tst/pl.yaml
To perform the hyperparametric search for the optimal combination of the parameters on the grid add --multirun
key and specify the parameters values:
python run_classification_trn_val_tst_pl.py experiment=classification/trn_val_tst/pl.yaml --multirun model_tabnet.optimizer_lr=0.001,0.005,0.01
After running the experiment the new working directory with results in logs
will be created:
└── logs # Folder for the logs generated by experiments
├── runs # Folder for single runs
│ └── experiment_name # Experiment name
│ ├── YYYY-MM-DD_HH-MM-SS # Datetime of the run
│ │ ├── csv # csv logs
│ │ ├── wandb # Weights&Biases logs
│ │ └── ... # Resulting tables, figures, etc.
│ └── ...
└── multiruns # Folder for multiruns
└── experiment_name # Experiment name
├── YYYY-MM-DD_HH-MM-SS # Datetime of the multirun
│ ├──1 # Multirun job number
│ ├──2
│ └── ...
└── ...
The research was supported by the Ministry of Science and Higher Education of the Russian Federation, agreement No. 075-15-2020-808. The authors acknowledge the use of computational resources provided by the “Lobachevsky” supercomputer.
This project is licensed under the MIT License.