Cancer-Molecular-Classification

A study of various ML methods to classify leukemia from gene expression monitoring.

This study is based on a proof-of-concept study published in 1999 by Golub et al.¹
The dataset is available at Kaggle : https://www.kaggle.com/crawford/gene-expression#data_set_ALL_AML_independent.csv

Machine learning (ML) has been gaining a lot of popularity in medicine where it helps predict or classify various medical conditions in patients. Leukemia classification has been studied in 1999 by Golub et al. It uses an approach called SOM (self-organizing maps) which classifies patients having leukemia into acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), based on gene expression monitoring by DNA microarrays.

The attached notebook provides work that helps to answers a question about what type of results we could obtain by using modern ML algorithms. The ML algorithms come from the scikit-learn library to classify the dataset. An alternative method proposed by McClintick and Edenberg² is also studied and compared to the ML solutions.

The results show that the Random Forest ML algorithm provides the highest ML score. For this reason it was selected as the ML algorithm to complement other, such PCA, Golup et al., etc. The Golup et al. method provides the highest overall score. PCA with random forest gives the fastest time and a lesser score than random forest alone, which is expected.

Method	Score (%)	Execution Time
Golup et al., Random Forest	70.6	11.4 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Extra Tree	67.6	160 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Random Forest	67.6	183 ms ± 63.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
PCA, Random Forest	64.7	10.2 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
kNN	64.7	66.6 ms ± 9.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
50% Present, Random Forest	61.8	57.3 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
AddaBoost	61.8	137 ms ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Decision Tree	61.8	121 ms ± 40.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Logistic	58.8	299 ms ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
KMeans	55.9	189 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

¹ Golub T. R., Slonim D. K., Tamayo P., et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531

² McClintick JN, Edenberg HJ. Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics. 2006;7:49. Published 2006 Jan 31. doi: 10.1186/1471-2105-7-49

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
LICENSE		LICENSE
README.md		README.md
gene classification.ipynb		gene classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cancer-Molecular-Classification

About

Releases

Packages

Languages

License

alebaho/Cancer-Molecular-Classification

Folders and files

Latest commit

History

Repository files navigation

Cancer-Molecular-Classification

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages