GitHub - eotles/CS838: Cancer Bioinformatics Class Project

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
Algorithms.py		Algorithms.py
Kmean2.m		Kmean2.m
Kmean4.m		Kmean4.m
PCA_2.m		PCA_2.m
PipelineScript.m		PipelineScript.m
ProjectPresentation.key		ProjectPresentation.key
ProjectPresentation.pptx		ProjectPresentation.pptx
README		README
README.tm		README.tm
genes.mat		genes.mat
helloWorld.py		helloWorld.py
labels.mat		labels.mat
main.py		main.py
merge_data.m		merge_data.m
parser.py		parser.py
pca_data_3d.mat		pca_data_3d.mat
pipeline.py		pipeline.py
plot3d_kmean.m		plot3d_kmean.m
quantile2.m		quantile2.m
quantile4.m		quantile4.m
quantile_v2.m		quantile_v2.m
samples.mat		samples.mat

Repository files navigation

This readme is how to use the files for the project Accessing Clustering Pipeline.
The code is divided into two part: (1) parser, and (2) algorithms

(1) The parser is writen in python, in the file python.py. 
    To use the parser:
    
    data["TCGA"] = prs.tcga(tcga_filepath)
    data["CCLE"] = prs.ccle(ccle_filepath) 

    those two functions will parse RMA-normalized dataset from TCGA and CCLE seperately and store them in the data object.
    
    The two dataset files are from different versions of the U133 chip. To align the data to have the same genes, we implemented the alignment function, which filtered out the genes that are not shared between datasets. To use this function:

    algn = align([data["TCGA"], data["CCLE"]])
    
    The function returns the combined samples(in the example above, TCGA first, then CCLE) with shared genes. Then we output data to matlab use scipy.io.savemat to export the gene names, sample labels and datamatrix to matlab.
    An example for exporting data to Matlab would be:

    scipy.io.savemat('tcga_samples.mat', mdict={'tcga_samples': data["TCGA"].samples})
    scipy.io.savemat('tcga_genes.mat', mdict={'tcga_genes': data["TCGA"].names})
    scipy.io.savemat('ccle_samples.mat', mdict={'ccle_samples': data["CCLE"].samples})
    scipy.io.savemat('ccle_genes.mat', mdict={'ccle_genes': data["CCLE"].names})
    scipy.io.savemat('aligned_data.mat', mdict={'aligned_data': algn})

(2) Once we have the parsed data we use Matlab for normalization and clustering
    In the matlab script, we assumes the names of the datasets are:
    
    { CCLE_data_gbmlgg,CCLE_data_ov,TCGA_data_gbmlgg,TCGA_data_ov }

    PipelineScript.m is the script for ploting hierarchical clustering with/without PCA.
    
    For the other functions,

     quantile2.m and quantile4.m are used for full quantile and pair quantile normalization

     Kmean2.m and Kmean4.m are the wrappers for runing kmeans with k = 2 and k = 4 with a display of the confusion matrix.

     PCA_2.m is the script for PCA dimension reduction to 2 with ploting the scatter plot.

     merge_data.m is the script that selects the first 50 samples from all the dataset combine them and runs a simple k-mean with k = 4.
     
     The four Matlab datafile we have 'Raw.mat', 'Quantiled.mat', 'pair_quantiled.mat', 'z-score.mat', which contains the raw data and normalized data. The 'Raw.mat' also contains the sample labels and the gene list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

eotles/CS838

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages