These scripts will download, manipulate, and analyse TCGA data obtained through the R package TCGAbiolinks
$ python3 --version
Python 3.6.7
$ R --version
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.
> install.packages(c("tidyverse", "BiocManager"))
> BiocManager::install("TCGAbiolinks")
$ pip3 install --user pandas
$ pip3 install --user scipy
$ pip3 install --user os
$ pip3 install --user glob
functions.R - This will define all general functions used in this project
Download.R - This will download, format, and save all clinical and mutational data to csv files. Files downloaded as "\t" separated flat files and saved to Datasets directory
select_col.py - This a python script that will take the tab-delimated flat files created by Download.R and will extract important columns to make smaller, more efficient files (~/BiolinksAnalysis/Datasets/*_select.csv)
MannU_Test.py - This python file will calculate the pValues for all flat files written out by Analysis*_2.R. For processing reasons, you must specify the individual comparison directory (Smoke/Race/Gender) and then a valid cancer cohort within that directory (more info in file)
combine_pValues.R - This file will take the output of MannU_Test.py and combine them into one flat file per cancer to make later analysis easier
significant_genes.R - This will file will read in the outputs of combine_pValues.R and create two flat files. One has all the genes for each pipeline with a pvalue less than 0.05 and the other is a summary of the number of significant genes for each pipeline and cancer
Analysis1.R - Read in the csv data from Download.R and generate pValues/Quartile data for each Cancer, location, and mutational pipeline only comparing smokers and nonsmokers
Analysis1_1.R - This will perform the same action as Analysis1.R except that it will look at the whole cancer, not specific sites
Analysis1_2.R - This script will calculate the frequencies of Genes in each population and write out a tab delim file
Analysis1_3.R - This script will perform the same analysis as Analysis1_1.R except that it will focus only on stage I cancers
Analysis1_4.R - This script will calculate the number of mutations for each age and look for a relation
Analysis1_5.R - This script looks at which genes are signifcantly different across cancers and identify any common genes
Analysis1_6.R - This script will look at variant_classifications and identify which ones are significant
This folder will contain all info generated as a result of R/py files such as pValues and graphs and frequency tables
The data was calculated for each group, for each cancer, for each valid site in each cancer, and for each somatic varaint pipeline (muse, mutect, somaticsniper, varscan2)
Graphs/nucChange_Graphs contains boxplots showing the different nucleotide changes (eg A > G) and their frequency distrubution per person in the population
pValues/nucChange_pVal shows the pValue, calculated by wilcox.text, between each respective population for each nucleotide change
Graphs/TiTv_Graphs shows the frequency distribution of Transitions and Transversions per person between two populations
pValues/TiTv_pVal shows the pValue between each population for Transitions and Transversions
Graphs/total_mut_Graphs shows the distribution of the total number of somatic point mutations per person in each population
pValues/total_mut_pVal shows the pValue between each group for the distribution of somatic point mutations
Age contains graphs showing the relation between number of mutations and age.
Age/Files contains flat files with the rsquared values of the linear regression line
Age/Graphs contains the .jpg files of the graphs
Genes_Pvalues contains gene frequencies flat files and the files containg pvalues (*_FINAL_PVALUES) and the combined files of FINAL_PVALUES (*_pValues_Combined)
Genes_Pvalues/Summary contains flat files with all the genes that have pValues less than 0.05 and a combined summary table showing the number of significant genes for each pipeline
Graphs/var_class_Graph contains the jpgs showing the proportion of each variant classification for each the three groups and which ones are significant pValues/var_class_pval contains the flat files containing the pValue for each variant classification that were calculated by Analysis1_6.R
This direcotry puts all the scripts together into a summarised .rmd file that is rendered as a github markdown document and as a html webpage