Extracting Biologically Relevant Genes using AFExNet from Cancer Transcriptomes [Paper]
In this project, we introduce neural network based adversarial autoencoder (AAE) model to extract biologically-relevant features from RNA-Seq data. We also developed a method named TopGene to find highly interactive genes from the latent space. AFExNet in combination with TopGene method finds important genes which could be useful for finding cancer biomarkers.
The following instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See the instruction below:
The following libraries are required to reproduce this project:
-
Keras (2.0.6)
-
Keras-adverserial (0.0.3)
-
Tensorflow (1.13.1)
-
Scikit-Learn (0.20.3)
-
Numpy (1.16.3)
-
Imbalanced-Learn (0.4.3)
Supports both Python 2.5.0 and Python 3.5.6
├── results
│ ├── saved_results
│ │ ├── Gene_Analysis_Breast_Cancer.xlsx
│ │ ├── Gene_Analysis_UCEC.xlsx
│ ├── AAE
│ │ ├── aae_encoded.tsv
│ │ ├── aae_sorted_gene.tsv
│ │ ├── aae_weight_distribution.png
│ │ ├── aae_weight_matrix
│ ├── PCA
│ ├── ... # add LDA, SVD etc
├── data
│ ├── data will be stored here
├── feature_extraction
│ ├── AAE
│ │ ├── aae_encoder.h5
│ │ ├── aae_decoder.h5
│ │ ├── aae_discriminator.h5
│ │ ├── aae_history.csv
│ ├── PCA
│ ├──VAE
│ ├── ...
├── README.md
├── figures
│ ├── saved_figures
│ │ ├── Olfactory__Transduction_pathway.png
└── .gitignore
Run the following to extract features using different autoencoders
main.py
And run the following to extract features when PCA, NMF, FastICA, ICA, RBM etc. are used
main_pca.py
Gene ontology of molecular function was performed using DAVID 6.7 https://david-d.ncifcrf.gov/
More regarding gene ontology http://geneontology.org/docs/ontology-documentation/
- cBioPortal - Cancer Genomics Datasets
- Breast Invasive Carcinoma (TCGA, Cell 2015) - Clinical information is used to label various molecular subtypes
Breast Invasive Carcinoma (BRCA)
Molecular Subtypes | Number of Patients | Label |
---|---|---|
Luminal A | 304 | 0 |
Luminal B | 121 | 1 |
Basal & Triple Negetive | 137 | 2 |
Her 2 Enriched | 43 | 3 |
Total Number of Samples (Patients) | Total Number of Features (Genes) |
---|---|
605 | 20439 |
- Uterine Corpus Endometrial Carcinoma (TCGA, Nature 2013) - Clinical information is used to label various molecular subtypes.
Uterine Corpus Endometrial Carcinoma (UCEC)
Molecular Subtypes | Number of Patients | Label |
---|---|---|
Copy Number High | 60 | 0 |
Copy Number Low | 90 | 1 |
Hyper Mutated (MSI) | 64 | 2 |
Ultra Mutated (POLE) | 16 | 3 |
Total Number of Samples (Patients) | Total Number of Features (Genes) |
---|---|
230 | 20482 |
If you want to contribute to this project and make it better, your help is very welcome. When contributing to this repository please make a clean pull request.
- The proposed architecture is inspired by https://github.com/bstriner/keras-adversarial