Adaptable and intrepretable Multi-task Learning based gene-level methylation estimation

Introduction

Explored adaptable and interpretable neural network to find common genotype given 480k dimension sites, hundreds of sample.
Designed an explainable site-gene-pathway ontology constraint to NN to discover new biomarkers by checking weights.
Implemented a Variational Auto-Encoder to support gene-level embedding shared among datasets to obtain multi-task learning.
Optimized a pretrain-finetune training scheme to increase accuracy by over 10%.

Datasets

The method is tested on six datasets,including:

Rheumatoid arthritis
Systemic lupus erythematosus
Multiple sclerosis
Inflammatory bowel disease
Psoriasis
Type 1 diabetes

and is shown to have good performance in identifying common functions of DNA methylation in phenotypes.

How to run the program

Use following command to install the prequisites:

pip install -r requirements.txt

Use following command to run the program:

python ./main.py

How to run benchmark

Traditional (sci-kit learn based) machine learning algorithms benchmark: go to ./main.py, change the parameter justToCheckBaseline as True, and change datasetNameList = [ 'IBD','MS', 'Psoriasis', 'RA','SLE','diabetes1'] into specific dataset you want to test on benchmark, like datasetNameList = [ 'IBD']. Then run

python ./main.py

NN-based method, single task baseline:

python ./main_test_single_task.py

Output of this program

After run main.py:

There will be some partial results in ./result/

There will be a result for the setting of this run and all the test accuracy result in ./result-all/ named by the output_file_name in program ./tools.py (around line 127). For example, ./result-all/1-10results-together.csv.

There will be logs in ./log/ named by the date.

There will be cache file in ./cache/ if you will use same setting of number of residue for multiple dataset for multiple times, you can keep it so that to save time for preprocessing.

There will be some data in ./tensorboard_log/, you can use command

tensorboard --logdir="./tensorboard_log/"

to start the tensorboard to see the validation accuracy and weight distribution of each dataset each stages and each settings.

Research Project Background

This project mainly focus on topics on methylation, which is a phenomenon in DNA which will cause dysfunction. We want use residual methylation data to predict the diseases. However, the dimension of residual is enormous and the sample is comparatively fewer. Therefore, we want to propose a method to reduce the dimension and improve the performance.

We extract the information from residual methylation to get gene-level methylation, which is much lower in dimension. Moreover, the gene-level methylation may give us some common and critical information which can be transferred among different datasets. We can use some represention learning method to extract the feature of the mechanism of gene-level methylation.

The method we propose

First, we designed a refined auto-encoder architecture. Input is residual methylation and output is restored residual methylation. As we go deeper, the number of nodes in the layer becomes smaller in the first half and then increases to the same dimension as input. The two half parts are named encoder and decoder. For encoder, we can distill the inherently critical and low-dimension embedding data from enormous number of residuals without human labeling. We assume the bottleneck layer of the encoder represents pathway which provides information about the basic units of heredity.

For decoder, we designed an explainable neural network which prunes the node. The network restore data from pathway to gene-level methylation, then to the residual methylation again. For each step, dimensions become larger and the former layer is a collection of the latter layer. The reason why it’s explainable is that we only remained the connection between certain residuals in the gene and certain genes in the pathway according to expert knowledge. This pruning method reduces the dimension to calculate and also can be explained by gene rules. Moreover, the auto-encoder can be adaptable in different datasets because the embedding can be shared between different input data sources, which can support transfer learning.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
MeiNN		MeiNN
configs		configs
dataset		dataset
models		models
resVAE		resVAE
result		result
visualize_neural_network		visualize_neural_network
.gitignore		.gitignore
3-5-kerasAE-lr-epoch20_GSE66695_gene_level(origin_data_AE).txt		3-5-kerasAE-lr-epoch20_GSE66695_gene_level(origin_data_AE).txt
3-5-kerasAE-lr-epoch20_GSE66695_gene_level(origin_data_AE_embedding).txt		3-5-kerasAE-lr-epoch20_GSE66695_gene_level(origin_data_AE_embedding).txt
AutoEncoder.py		AutoEncoder.py
GSE66695_gene_Trank_100_result.csv		GSE66695_gene_Trank_100_result.csv
GSE66695_gene_Trank_rank.txt		GSE66695_gene_Trank_rank.txt
GSE66695_gene_level(origin_data_AE).txt		GSE66695_gene_level(origin_data_AE).txt
GSE66695_gene_level(origin_data_AE_embedding).txt		GSE66695_gene_level(origin_data_AE_embedding).txt
GSE66695_gene_level(origin_data_LinearRegression).txt		GSE66695_gene_level(origin_data_LinearRegression).txt
GeneDataset.py		GeneDataset.py
Pengcheng_Xu_Jinpu_Cai_3-17_epigenetics_jp-2.pptx		Pengcheng_Xu_Jinpu_Cai_3-17_epigenetics_jp-2.pptx
README.md		README.md
auto-encoder.pth		auto-encoder.pth
baseline-final-process.py		baseline-final-process.py
check_heatmap.py		check_heatmap.py
data_dict.py		data_dict.py
data_test.txt		data_test.txt
data_train.txt		data_train.txt
elm.py		elm.py
experiment.py		experiment.py
fully-connected-network.pth		fully-connected-network.pth
label_test.txt		label_test.txt
label_train.txt		label_train.txt
losses.py		losses.py
main-baseline.py		main-baseline.py
main.py		main.py
main_test_single_task.py		main_test_single_task.py
min_norm_solvers.py		min_norm_solvers.py
platform.json		platform.json
predict.py		predict.py
predict_keras.py		predict_keras.py
predict_keras_redefined_loss.py		predict_keras_redefined_loss.py
predict_keras_redefined_loss_test_single_task.py		predict_keras_redefined_loss_test_single_task.py
predict_pytorch.py		predict_pytorch.py
random_hidden_layer.py		random_hidden_layer.py
requirements.txt		requirements.txt
test.py		test.py
test_sample.tar		test_sample.tar
tools.py		tools.py
train.py		train.py
train_keras.py		train_keras.py
train_keras_redefined_loss-7-21.py		train_keras_redefined_loss-7-21.py
train_keras_redefined_loss.py		train_keras_redefined_loss.py
train_keras_redefined_loss_test_single_task.py		train_keras_redefined_loss_test_single_task.py
train_pytorch.py		train_pytorch.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptable and intrepretable Multi-task Learning based gene-level methylation estimation

Introduction

Datasets

How to run the program

How to run benchmark

Output of this program

Research Project Background

The method we propose

Results

About

Releases

Packages

Languages

explcre/Adaptable-and-intrepretable-multi-task-learning-based-gene-level-methylation-estimation

Folders and files

Latest commit

History

Repository files navigation

Adaptable and intrepretable Multi-task Learning based gene-level methylation estimation

Introduction

Datasets

How to run the program

How to run benchmark

Output of this program

Research Project Background

The method we propose

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages