Prostate Cancer Feature Prediction

Making feature predictions on prostate cancer data, using DNA-methylation and RNA-sequencing data available on cBioPortal. Target features predicted are gleason score, tumor-stage and tumor recurrence.

Environment and Libraries

The code is implemented in Python 3 with the help of the following modules -

Pandas
Sci-kit Learn
Imblearn
Matplotlib
Mpl_toolkits
Pprint

All packages can be installed using pip install <package_name>

Files

There are two files in the repository - prostate_cancer.py and feature_processing.py. The prostate_cancer file contains the main function where execution begins and the feature_processing file contains the separate pipelines used for each of the target features.

Background

Prostate Cancer is a carcinogenic disease affecting the prostate gland in men. When compared to other types of cancer, prostate cancer is extremely slow and may not show symptoms for several years. Therefore it manages to slip under the radar. Although tackling this issue is out of the scope of this project, it aims to detect the presence of prostate cancer once patients' data has been collected. This implementation uses DNA and RNA data avaliable at cBioPortal.

Data

There are two main datasets - data collected from the methylation tests done on a patient's DNA and their RNA sequencing data. The third dataset contains clinical records of the patients (their files at the hospital). The clinical records contain a lot of information including the three target features for every patient. The DNA and RNA data is used to make predictions on these target features.

The distribution of values in the target features, is captured in the following pie-charts -

Implementation Details

To run the program you will need the datafiles from here. There are two download links on the page, one for the DNA and RNA data (direct download link) and one for the clinical data. Once downloaded, you can change the filepath variable in prostate_cancer.py to the path of the extracted data.

In the main function in prostate_cancer.py, a control variable mode determines how much text is output when the program is run. The default setting is on "partial" but setting mode to "show" paints a picture of the data as it is modified and updated step-by-step. Running the code will open up four plots, three of which are interactive 3D plots showing the spread of our targets with respect to lower-dimensional DNA data.

The machine learning pipeline used is as follows -

Feature selection (from DNA and RNA data) (select k-best features, using variance thresholds, selecting features using models, chi-square test)
Resampling, upsampling and downsampling data for imbalanced dataset.
Dimensionality reduction of training data (PCA, LDA - both can be toggled on or off in feature_processing.py)
Classifier (SVM, Random Forest, Naïve Bayes', K-Nearest Neighbors)

Notes

Machine learning is a quickly expanding field and is useful in several unrelated domains including health and medicine. There are many things that must be considered before we start trusting machines with lives. In addition to moral dilemmas there are also legal and societal aspects that have to be figured out. Still, this is no reason to refuse intelligent machines, especially since they have the potential to be more effecient than humans in certain areas.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
feature_processing.py		feature_processing.py
gleason-valcounts.png		gleason-valcounts.png
prostate_cancer.py		prostate_cancer.py
recurrence-valcounts.png		recurrence-valcounts.png
t_stage-valcounts.png		t_stage-valcounts.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prostate Cancer Feature Prediction

Environment and Libraries

Files

Background

Data

Implementation Details

Notes

About

Releases

Packages

Languages

shivsondhi/prostate_cancer

Folders and files

Latest commit

History

Repository files navigation

Prostate Cancer Feature Prediction

Environment and Libraries

Files

Background

Data

Implementation Details

Notes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages