Skip to content

A study of various ML methods to classify Leukemia from gene expression monitoring.

License

Notifications You must be signed in to change notification settings

alebaho/Cancer-Molecular-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cancer-Molecular-Classification

A study of various ML methods to classify leukemia from gene expression monitoring.

This study is based on a proof-of-concept study published in 1999 by Golub et al.1
The dataset is available at Kaggle : https://www.kaggle.com/crawford/gene-expression#data_set_ALL_AML_independent.csv

Machine learning (ML) has been gaining a lot of popularity in medicine where it helps predict or classify various medical conditions in patients. Leukemia classification has been studied in 1999 by Golub et al. It uses an approach called SOM (self-organizing maps) which classifies patients having leukemia into acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL), based on gene expression monitoring by DNA microarrays.

The attached notebook provides work that helps to answers a question about what type of results we could obtain by using modern ML algorithms. The ML algorithms come from the scikit-learn library to classify the dataset. An alternative method proposed by McClintick and Edenberg2 is also studied and compared to the ML solutions.

The results show that the Random Forest ML algorithm provides the highest ML score. For this reason it was selected as the ML algorithm to complement other, such PCA, Golup et al., etc. The Golup et al. method provides the highest overall score. PCA with random forest gives the fastest time and a lesser score than random forest alone, which is expected.

Method Score (%) Execution Time
Golup et al., Random Forest 70.6 11.4 ms ± 5.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Extra Tree 67.6 160 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Random Forest 67.6 183 ms ± 63.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
PCA, Random Forest 64.7 10.2 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
kNN 64.7 66.6 ms ± 9.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
50% Present, Random Forest 61.8 57.3 ms ± 23.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
AddaBoost 61.8 137 ms ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Decision Tree 61.8 121 ms ± 40.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Logistic 58.8 299 ms ± 172 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
KMeans 55.9 189 ms ± 19.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



1 Golub T. R., Slonim D. K., Tamayo P., et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531

2 McClintick JN, Edenberg HJ. Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics. 2006;7:49. Published 2006 Jan 31. doi: 10.1186/1471-2105-7-49

About

A study of various ML methods to classify Leukemia from gene expression monitoring.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published