From aa88556deeb04c4885137248d122e2278f930eb0 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 10 Nov 2020 09:54:25 -0500 Subject: [PATCH 01/11] Push a basic draft that doesn't exactly work --- .../network-analysis_rnaseq_01_wgcna.Rmd | 365 ++++++++++++++++++ 1 file changed, 365 insertions(+) create mode 100644 03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd new file mode 100644 index 00000000..1af08711 --- /dev/null +++ b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd @@ -0,0 +1,365 @@ +--- +title: "Over-Representation Analysis - RNA-seq" +author: "CCDL for ALSF" +date: "October 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +**DRAFT** + +# Purpose of this analysis + +This analysis 1) Normalizes and transforms data wth DESeq2. + 2) Run WGCNA to identify coexpressed gene modules. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +We are going to use non-quantile normalized data for this analysis. +To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". +Note that this option will only be available for RNA-seq datasets. + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [acute myeloid leukemia (AML) dataset](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples) [@Micol2017] + +@Micol2017 performed RNA-seq on primary peripheral blood and bone marrow samples from AML patients with and without _ASXL1/2_ mutations. + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `SRP078441` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `SRP078441` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +data_dir <- file.path("data", "SRP078441") # Replace with accession number which will be the name of the folder the files will be in + +# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` +data_file <- file.path(data_dir, "SRP078441.tsv") # Replace with file path to your dataset + +# Declare the file path to the metadata file using the data directory saved as `data_dir` +metadata_file <- file.path(data_dir, "metadata_SRP078441.tsv") # Replace with file path to your metadata +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the file path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +*** + +   + +# Over-Representation Analysis with `clusterProfiler` - RNA-seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform ORA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Igor2020; @Subramanian2005]. + +We will also need the [`org.Dr.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) package to perform gene identifier conversion and [`ggupset`](https://cran.r-project.org/web/packages/ggupset/readme/README.html) to make an UpSet plot [@Carlson2019-human; @Ahlmann-Eltze2020]. + +```{r} +if (!("clusterProfiler" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("clusterProfiler", update = FALSE) +} + +if (!("DESeq2" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("DESeq2", update = FALSE) +} + +# This is required to make one of the plots that clusterProfiler will make +if (!("ggupset" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("ggupset", update = FALSE) +} + +if (!("msigdbr" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("msigdbr", update = FALSE) +} + +if (!("org.Hs.eg.db" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("org.Hs.eg.db", update = FALSE) +} + +if (!("impute" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("impute") +} + +if (!("WGCNA" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("WGCNA") +} +``` + +Attach the packages we need for this analysis. + +```{r} +# Attach the library +library(clusterProfiler) + +# Attach the DESeq2 library +library(DESeq2) + +# Package that contains MSigDB gene sets in tidy format +library(msigdbr) + +# Homo sapiens annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) + +# We'll need this for finding gene modules +library(WGCNA) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read the both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +df <- readr::read_tsv(data_file) %>% + # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later + tibble::column_to_rownames("Gene") +``` + +Let's ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +df <- df %>% + dplyr::select(metadata$refinebio_accession_code) + +# Check if this is in the same order +all.equal(colnames(df), metadata$refinebio_accession_code) +``` + +### Prepare data for `DESeq2` + +We need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. + +```{r} +# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers +round_df <- df %>% + # Mutate numeric variables to be integers + dplyr::mutate_if(is.numeric, round) +``` + +## Create a DESeqDataset + +We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. +We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. +In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. + +```{r} +# Create a `DESeqDataSet` object +dds <- DESeqDataSetFromMatrix( + countData = round_df, # This is the data frame with the counts values for all replicates in our dataset + colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame + design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis +) +``` + +## Perform DESeq2 normalization and transformation + +We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. +For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). + +```{r} +# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package +dds_norm <- vst(dds) +``` + +## Identify a set of genes of interest + +Here we will use WGCNA to identify co-expressed gene modules. +We'll continue with the top gene module with ORA. + +Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA. + +```{r} +# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function +normalized_counts <- assay(dds_norm) %>% + t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function +``` + +Run WGCNA! + +```{r} +net <- WGCNA::blockwiseModules(normalized_counts, + power = 6, + TOMType = "unsigned", + minModuleSize = 30, + reassignThreshold = 0, + mergeCutHeight = 0.25, + numericLabels = TRUE, + pamRespectsDendro = FALSE, + saveTOMs = TRUE, + saveTOMFileBase = "SRP078441", + verbose = 3) +``` + +Save the gene IDs for this module as a vector. + +```{r} +module_39_genes <- rownames(df)[which(net$colors == 39)] +``` + +## Write results to file + +```{r} +readr::write_tsv( + kegg_result_df, + file.path( + results_dir, + "SRP078441_wgcna_results.tsv" + ) +) +``` + +# Resources for further learning + +- [WGCNA tutorial]() +- [WGCNA paper]() + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessionInfo() +``` + +# References From c0296006dec7ebdf95c5ef70daa1558f77638216 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 09:39:53 -0500 Subject: [PATCH 02/11] More words and polishing --- .../dimension-reduction_rnaseq_02_umap.Rmd | 3 - .../network-analysis_rnaseq_01_wgcna.Rmd | 161 +++++++++++------- 2 files changed, 95 insertions(+), 69 deletions(-) diff --git a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd index a5d9d88b..2a4ee8af 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd @@ -90,9 +90,6 @@ You will get an email when it is ready. ## About the dataset we are using for this example For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). - - - The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd index 1af08711..7441476b 100644 --- a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd +++ b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd @@ -13,8 +13,8 @@ output: # Purpose of this analysis -This analysis 1) Normalizes and transforms data wth DESeq2. - 2) Run WGCNA to identify coexpressed gene modules. +In this example, we use [weighted gene co-expression network analysis (WGCNA)]() to identify co-expressed gene modules. +WGCNA using a series of correlations to determine what genes are expressed together in your dataset. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ @@ -69,7 +69,7 @@ In the same place you put this `.Rmd` file, you should now have three new empty For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples). +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). Click the "Download Now" button on the right side of this screen. @@ -90,9 +90,9 @@ You will get an email when it is ready. ## About the dataset we are using for this example -For this example analysis, we will use this [acute myeloid leukemia (AML) dataset](https://www.refine.bio/experiments/SRP078441/rna-seq-of-primary-patient-aml-samples) [@Micol2017] - -@Micol2017 performed RNA-seq on primary peripheral blood and bone marrow samples from AML patients with and without _ASXL1/2_ mutations. +For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). +The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. +Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. ## Place the dataset in your new `data/` folder @@ -107,7 +107,7 @@ For more details on the contents of this folder see [these docs on refine.bio](h The `` folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like `GSE1235` or `SRP12345`. -Copy and paste the `SRP078441` folder into your newly created `data/` folder. +Copy and paste the `SRP133573` folder into your newly created `data/` folder. ## Check out our file structure! @@ -115,7 +115,7 @@ Your new analysis folder should contain: - The example analysis `.Rmd` you downloaded - A folder called "data" which contains: - - The `SRP078441` folder which contains: + - The `SRP133573` folder which contains: - The gene expression - The metadata TSV - A folder for `plots` (currently empty) @@ -133,13 +133,13 @@ This is handy to do because if we want to switch the dataset (see next section f ```{r} # Define the file path to the data directory -data_dir <- file.path("data", "SRP078441") # Replace with accession number which will be the name of the folder the files will be in +data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in # Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP078441.tsv") # Replace with file path to your dataset +data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset # Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP078441.tsv") # Replace with file path to your metadata +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata ``` Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. @@ -162,6 +162,8 @@ If you'd like to adapt an example analysis to use a different dataset from [refi We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences. +Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result [according to its authors](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html). So you will need to make sure the dataset you switch this to is sufficiently large. + ***   @@ -172,37 +174,15 @@ From here you can customize this analysis example to fit your own scientific que See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. -In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform ORA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Igor2020; @Subramanian2005]. - -We will also need the [`org.Dr.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Dr.eg.db.html) package to perform gene identifier conversion and [`ggupset`](https://cran.r-project.org/web/packages/ggupset/readme/README.html) to make an UpSet plot [@Carlson2019-human; @Ahlmann-Eltze2020]. +In this analysis, we will be using [`WGCNA`](https://cran.r-project.org/web/packages/WGCNA/index.html) package. +WGCNA also requires a package called `impute` that is sometimes has trouble installing so we recommend installing that first. ```{r} -if (!("clusterProfiler" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("clusterProfiler", update = FALSE) -} - if (!("DESeq2" %in% installed.packages())) { # Install this package if it isn't installed yet BiocManager::install("DESeq2", update = FALSE) } -# This is required to make one of the plots that clusterProfiler will make -if (!("ggupset" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("ggupset", update = FALSE) -} - -if (!("msigdbr" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("msigdbr", update = FALSE) -} - -if (!("org.Hs.eg.db" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("org.Hs.eg.db", update = FALSE) -} - if (!("impute" %in% installed.packages())) { # Install this package if it isn't installed yet BiocManager::install("impute") @@ -217,15 +197,9 @@ if (!("WGCNA" %in% installed.packages())) { Attach the packages we need for this analysis. ```{r} -# Attach the library -library(clusterProfiler) - # Attach the DESeq2 library library(DESeq2) -# Package that contains MSigDB gene sets in tidy format -library(msigdbr) - # Homo sapiens annotation package we'll use for gene identifier conversion library(org.Hs.eg.db) @@ -270,7 +244,9 @@ We need to make sure all of the values in our data are converted to integers as ```{r} # The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers -round_df <- df %>% +df <- df %>% + # Bring out rownames and store as their own column + tibble::rownames_to_column("Gene") %>% # Mutate numeric variables to be integers dplyr::mutate_if(is.numeric, round) ``` @@ -284,9 +260,10 @@ In this chunk of code, we will not provide a specific model to the `design` argu ```{r} # Create a `DESeqDataSet` object dds <- DESeqDataSetFromMatrix( - countData = round_df, # This is the data frame with the counts values for all replicates in our dataset + countData = df, # This is the data frame with the counts values for all replicates in our dataset colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame - design = ~1 # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis + design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis + tidy = TRUE ) ``` @@ -300,7 +277,7 @@ For more information about these transformation methods, [see here](https://alex dds_norm <- vst(dds) ``` -## Identify a set of genes of interest +## Format normalized data for WGCNA Here we will use WGCNA to identify co-expressed gene modules. We'll continue with the top gene module with ORA. @@ -313,38 +290,90 @@ normalized_counts <- assay(dds_norm) %>% t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function ``` -Run WGCNA! +## Determine parameters for WGCNA + +WGCNA has to decide what genes are included in which +Soft thresholds are a way for us give a parameter too WGCNA without it being absolute. + + +```{r} +sft <- pickSoftThreshold(normalized_counts, + dataIsExpr = TRUE, + corFnc = cor, + corOptions = list(use = 'p'), + networkType = "signed") +``` + +This `sft` object has a lot of information, we will want to plot some of it to ```{r} -net <- WGCNA::blockwiseModules(normalized_counts, - power = 6, - TOMType = "unsigned", - minModuleSize = 30, - reassignThreshold = 0, - mergeCutHeight = 0.25, - numericLabels = TRUE, - pamRespectsDendro = FALSE, - saveTOMs = TRUE, - saveTOMFileBase = "SRP078441", - verbose = 3) +sft_df <- data.frame(sft$fitIndices) %>% + dplyr::mutate(model_fit = -sign(slope)*SFT.R.sq) + +head(sft_df) +``` + +```{r} +sft_df %>% + ggplot2::ggplot() + + ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) + + ggplot2::xlab("Soft Threshold (power)") + + ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") + + ggplot2::ggtitle("Scale independence") + + ggplot2::theme_classic() ``` -Save the gene IDs for this module as a vector. +12 is the peak, we can use that for our `power` parameter in WGCNA. + +## Run WGCNA! + +We will use the `blockwiseModules()` function to build a gene coexpression network. +The `block` part refers to that these calculations will be done on chunks of your data at a time to help with computing resources. + +This will take some time to run. ```{r} -module_39_genes <- rownames(df)[which(net$colors == 39)] +bwnet <- blockwiseModules(normalized_counts, + TOMType = "signed", # topological overlap matrix + power = 12, # soft threshold for network construction + randomSeed = 1234) +``` + +There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page to see more. + +There are two main parameters in particular that we think you should be aware of: + +- `TOMtype` - what kind of topological overlap matrix (TOM) should be used to make gene modules. + - `"signed"` means that the TOM put positive or negative regulations clustered together -- this is what WGCNA's docs recommend for most situations. + - `"unsigned"` means that when the matrix is being built you won't pay attention to directionality, and focus on strength of relationship. + +For more on signed and unsigned network constructions: see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/). + +- `power` - the "soft" threshold for determining gene module networks. +- Here we will set `power` to 12, based on the above graph. + + +This object has a lot of information. +We can pull out the parts we are most interested in using for plotting. + +```{r} + +``` + +## Visualize how related these eigengenes are + +```{r} +correlate_eigegenes <- cor(bwnet$MEs) + +pheatmap::pheatmap(correlate_eigegenes) ``` ## Write results to file ```{r} -readr::write_tsv( - kegg_result_df, - file.path( - results_dir, - "SRP078441_wgcna_results.tsv" - ) -) +readr::write_rds(bwnet, + file = file.path("results", "SRP133573_wgcna_results.RDS") + ) ``` # Resources for further learning From 9ac81215aabc7dfbb86edd0a78779a5f07345caf Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 11:06:45 -0500 Subject: [PATCH 03/11] Try out some plots --- .../network-analysis_rnaseq_01_wgcna.Rmd | 140 ++++++++++++++---- 1 file changed, 115 insertions(+), 25 deletions(-) diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd index 7441476b..ae7c2c8c 100644 --- a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd +++ b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd @@ -267,8 +267,23 @@ dds <- DESeqDataSetFromMatrix( ) ``` +## Define a minimum counts cutoff + +We want to filter out the genes that have not been expressed or that have low expression counts. +This is recommended by [WGCNA docs for RNA-seq data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). +Removing low count genes can help improve your WGCNA results. +We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples. + +```{r} +# Define a minimum counts cutoff and filter `DESeqDataSet` object to include +# only rows that have counts above the cutoff +genes_to_keep <- rowSums(counts(dds)) >= 50 +dds <- dds[genes_to_keep, ] +``` + ## Perform DESeq2 normalization and transformation +We often suggest normalizing and transforming your data for various applications and in this instance WGCNA's authors [suggest using variance stablizing transformation before running WGCNA](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). @@ -292,80 +307,154 @@ normalized_counts <- assay(dds_norm) %>% ## Determine parameters for WGCNA -WGCNA has to decide what genes are included in which -Soft thresholds are a way for us give a parameter too WGCNA without it being absolute. - +WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. +By providing a soft-threshold power parameter, we can guide it on {{I still can't figure out what power actually represents here}} +The "soft" part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility. ```{r} sft <- pickSoftThreshold(normalized_counts, dataIsExpr = TRUE, corFnc = cor, - corOptions = list(use = 'p'), networkType = "signed") ``` -This `sft` object has a lot of information, we will want to plot some of it to +This `sft` object has a lot of information, we will want to plot some of it to figure out what our `power` soft-threshold should be. +We have to first calculate the model fit, the R squared and make that a new variable. ```{r} sft_df <- data.frame(sft$fitIndices) %>% dplyr::mutate(model_fit = -sign(slope)*SFT.R.sq) - -head(sft_df) ``` +Now, let's plot the model fitting by the `power` soft threshold so we can decide on a soft-threshold for power. + ```{r} sft_df %>% ggplot2::ggplot() + ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) + + # We will plot what WGCNA recommends as an R^2 cutoff + ggplot2::geom_hline(yintercept = 0.80, col = "red") + + # Just in case our values are low, we want to make sure we can still see the 0.80 level + ggplot2::ylim(c(min(sft_df$model_fit), 1)) + + # We can add more sensible labels for our axis ggplot2::xlab("Soft Threshold (power)") + ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") + ggplot2::ggtitle("Scale independence") + + # This adds some nicer aesthetics to our plot ggplot2::theme_classic() ``` -12 is the peak, we can use that for our `power` parameter in WGCNA. +Using this plot we can decide on a power parameter. +WGCNA's authors recommend using a `power` that has an signed R^2 above `0.80`, otherwise they warn your results may be too noisy to be meaningful. + +If you have multiple power values with signed R^2 above `0.80`, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation [{{SOURCE}}](https://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf). +You want to a `power` that gives you a big enough R^2 but is not excessively large. + +So using the plot above, going with a power soft-threshold of `14`! + +If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. +You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). + ## Run WGCNA! We will use the `blockwiseModules()` function to build a gene coexpression network. -The `block` part refers to that these calculations will be done on chunks of your data at a time to help with computing resources. +The `blockwise` part refers to that these calculations will be done on chunks of your data at a time to help with computing resources. This will take some time to run. +Note that for the `power` argument we are using `14` which we determined above. + ```{r} bwnet <- blockwiseModules(normalized_counts, TOMType = "signed", # topological overlap matrix - power = 12, # soft threshold for network construction - randomSeed = 1234) + power = 14, # soft threshold for network construction + randomSeed = 1234 # there's some randomness associated with this calculation + # so we should set a seed + ) ``` There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page to see more. -There are two main parameters in particular that we think you should be aware of: - -- `TOMtype` - what kind of topological overlap matrix (TOM) should be used to make gene modules. - - `"signed"` means that the TOM put positive or negative regulations clustered together -- this is what WGCNA's docs recommend for most situations. - - `"unsigned"` means that when the matrix is being built you won't pay attention to directionality, and focus on strength of relationship. +The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. +You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. +However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out. -For more on signed and unsigned network constructions: see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/). +## Explore our WGCNA results -- `power` - the "soft" threshold for determining gene module networks. -- Here we will set `power` to 12, based on the above graph. +This object, `bwnet` has a lot of information. +We can pull out the parts we are most interested in and may want to use use for plotting. +In `bwnet` we have a data frame of eigengene module data for each sample in the `MEs` slot. +These represent the collapsed and combined expression of the genes a part of each module. -This object has a lot of information. -We can pull out the parts we are most interested in using for plotting. +```{r} +module_eigengenes <- bwnet$MEs + +# Print out a preview +head(module_eigengenes) +``` +If you want to know which of your genes make up which modules, you can look at the `$colors` slot. +This is a named list which associates the genes with the module they are a part of. ```{r} +gene_modules <- bwnet$colors +# Print out a preview +head(gene_modules) ``` -## Visualize how related these eigengenes are +We can turn this into a data frame for handy use. + +```{r} +gene_modules_df <- data.frame(gene_modules) %>% + tibble::rownames_to_column("Gene") +``` + +Now we can print out how many genes there are per module. + +```{r} +gene_modules_df %>% dplyr::count(gene_modules) +``` + +Note that the color assignments don't have any particular biological meaning and just function as labels. + +## Take a look at eigengenes with your metadata + +We can also see if our eigengenes relate to our metadata labels. + +First we would have to make an annotation data frame that we can give to heatmap. + +```{r} +# Let's prepare the annotation data.frame for taking a look at our eigengenes +annotation_df <- metadata %>% + # We want to select the variables that we want for annotating the technical replicates heatmap + dplyr::select( + refinebio_accession_code, + refinebio_treatment + ) %>% + # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object + tibble::column_to_rownames("refinebio_accession_code") +``` + +Now we can create a heatmap. + +```{r} +pheatmap::pheatmap(module_eigengenes, + show_rownames = FALSE, + annotation_row = annotation_df) +``` + +## Visualize how correlated these eigengenes are ```{r} correlate_eigegenes <- cor(bwnet$MEs) +``` -pheatmap::pheatmap(correlate_eigegenes) +```{r} +pheatmap::pheatmap(correlate_eigegenes, + show_rownames = FALSE, + show_colnames = FALSE) ``` ## Write results to file @@ -378,8 +467,9 @@ readr::write_rds(bwnet, # Resources for further learning -- [WGCNA tutorial]() -- [WGCNA paper]() +- [WGCNA FAQ page](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html) +- [WGCNA tutorial](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) +- [WGCNA paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) # Session info From 51bd54d5a2f18318b6371b889ff05a564ad1e341 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 11:08:52 -0500 Subject: [PATCH 04/11] Put some DRAFT and REVIEW designations --- 03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd index ae7c2c8c..4e4d2121 100644 --- a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd +++ b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd @@ -9,8 +9,6 @@ output: number_sections: true --- -**DRAFT** - # Purpose of this analysis In this example, we use [weighted gene co-expression network analysis (WGCNA)]() to identify co-expressed gene modules. @@ -292,6 +290,8 @@ For more information about these transformation methods, [see here](https://alex dds_norm <- vst(dds) ``` +**REVIEW** + ## Format normalized data for WGCNA Here we will use WGCNA to identify co-expressed gene modules. @@ -380,6 +380,8 @@ The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) s You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out. +**DRAFT** + ## Explore our WGCNA results This object, `bwnet` has a lot of information. From 743fb45676a26db0da660be50dd7de693ecd00c4 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 11:20:41 -0500 Subject: [PATCH 05/11] Add to Dockerfile --- docker/Dockerfile | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docker/Dockerfile b/docker/Dockerfile index 89ab3b16..0e2aa178 100644 --- a/docker/Dockerfile +++ b/docker/Dockerfile @@ -35,6 +35,9 @@ RUN pip3 install \ #### R packages ############### +# We need to install impute so WGCNA will install in the next step +RUN R -e "options(warn = 2); BiocManager::install('impute', update = FALSE)" + # Commonly used R packages RUN Rscript -e "install.packages( \ c('cluster', \ @@ -45,7 +48,8 @@ RUN Rscript -e "install.packages( \ 'rprojroot', \ 'viridis', \ 'spelling', \ - 'styler'))" + 'styler', \ + 'WGCNA'))" # Install bioconductor packages # org.Mm.eg.db and org.Dr.eg.db are required for gene mapping From cff76f6e9b7a06636fda5d8d9016374958ba5e6a Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 11:29:06 -0500 Subject: [PATCH 06/11] Get rid of umap edit --- 03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd | 3 +++ 1 file changed, 3 insertions(+) diff --git a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd index 2a4ee8af..a5d9d88b 100644 --- a/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd +++ b/03-rnaseq/dimension-reduction_rnaseq_02_umap.Rmd @@ -90,6 +90,9 @@ You will get an email when it is ready. ## About the dataset we are using for this example For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). + + + The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. From 4f778623576fda2c23a96a4c6d08135bdf1c8976 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 12:11:16 -0500 Subject: [PATCH 07/11] Add wgcna example to snakefile and run --- .../network-analysis_rnaseq_01_wgcna.Rmd | 61 +- .../network-analysis_rnaseq_01_wgcna.html | 4387 +++++++++++++++++ Snakefile | 1 + 3 files changed, 4420 insertions(+), 29 deletions(-) create mode 100644 03-rnaseq/network-analysis_rnaseq_01_wgcna.html diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd index 4e4d2121..d4328f44 100644 --- a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd +++ b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd @@ -229,7 +229,7 @@ Let's ensure that the metadata and data are in the same sample order. ```{r} # Make the data in the order of the metadata -df <- df %>% +df <- df %>% dplyr::select(metadata$refinebio_accession_code) # Check if this is in the same order @@ -246,7 +246,7 @@ df <- df %>% # Bring out rownames and store as their own column tibble::rownames_to_column("Gene") %>% # Mutate numeric variables to be integers - dplyr::mutate_if(is.numeric, round) + dplyr::mutate_if(is.numeric, round) ``` ## Create a DESeqDataset @@ -312,34 +312,35 @@ By providing a soft-threshold power parameter, we can guide it on {{I still can' The "soft" part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility. ```{r} -sft <- pickSoftThreshold(normalized_counts, - dataIsExpr = TRUE, - corFnc = cor, - networkType = "signed") +sft <- pickSoftThreshold(normalized_counts, + dataIsExpr = TRUE, + corFnc = cor, + networkType = "signed" +) ``` This `sft` object has a lot of information, we will want to plot some of it to figure out what our `power` soft-threshold should be. We have to first calculate the model fit, the R squared and make that a new variable. ```{r} -sft_df <- data.frame(sft$fitIndices) %>% - dplyr::mutate(model_fit = -sign(slope)*SFT.R.sq) +sft_df <- data.frame(sft$fitIndices) %>% + dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq) ``` Now, let's plot the model fitting by the `power` soft threshold so we can decide on a soft-threshold for power. ```{r} -sft_df %>% - ggplot2::ggplot() + +sft_df %>% + ggplot2::ggplot() + ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) + # We will plot what WGCNA recommends as an R^2 cutoff ggplot2::geom_hline(yintercept = 0.80, col = "red") + # Just in case our values are low, we want to make sure we can still see the 0.80 level ggplot2::ylim(c(min(sft_df$model_fit), 1)) + # We can add more sensible labels for our axis - ggplot2::xlab("Soft Threshold (power)") + - ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") + - ggplot2::ggtitle("Scale independence") + + ggplot2::xlab("Soft Threshold (power)") + + ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") + + ggplot2::ggtitle("Scale independence") + # This adds some nicer aesthetics to our plot ggplot2::theme_classic() ``` @@ -366,12 +367,12 @@ This will take some time to run. Note that for the `power` argument we are using `14` which we determined above. ```{r} -bwnet <- blockwiseModules(normalized_counts, - TOMType = "signed", # topological overlap matrix - power = 14, # soft threshold for network construction - randomSeed = 1234 # there's some randomness associated with this calculation - # so we should set a seed - ) +bwnet <- blockwiseModules(normalized_counts, + TOMType = "signed", # topological overlap matrix + power = 14, # soft threshold for network construction + randomSeed = 1234 # there's some randomness associated with this calculation + # so we should set a seed +) ``` There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page to see more. @@ -409,7 +410,7 @@ head(gene_modules) We can turn this into a data frame for handy use. ```{r} -gene_modules_df <- data.frame(gene_modules) %>% +gene_modules_df <- data.frame(gene_modules) %>% tibble::rownames_to_column("Gene") ``` @@ -442,9 +443,10 @@ annotation_df <- metadata %>% Now we can create a heatmap. ```{r} -pheatmap::pheatmap(module_eigengenes, - show_rownames = FALSE, - annotation_row = annotation_df) +pheatmap::pheatmap(module_eigengenes, + show_rownames = FALSE, + annotation_row = annotation_df +) ``` ## Visualize how correlated these eigengenes are @@ -454,17 +456,18 @@ correlate_eigegenes <- cor(bwnet$MEs) ``` ```{r} -pheatmap::pheatmap(correlate_eigegenes, - show_rownames = FALSE, - show_colnames = FALSE) +pheatmap::pheatmap(correlate_eigegenes, + show_rownames = FALSE, + show_colnames = FALSE +) ``` ## Write results to file ```{r} -readr::write_rds(bwnet, - file = file.path("results", "SRP133573_wgcna_results.RDS") - ) +readr::write_rds(bwnet, + file = file.path("results", "SRP133573_wgcna_results.RDS") +) ``` # Resources for further learning diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.html b/03-rnaseq/network-analysis_rnaseq_01_wgcna.html new file mode 100644 index 00000000..f17c7f9b --- /dev/null +++ b/03-rnaseq/network-analysis_rnaseq_01_wgcna.html @@ -0,0 +1,4387 @@ + + + + + + + + + + + + + + +Over-Representation Analysis - RNA-seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules. WGCNA using a series of correlations to determine what genes are expressed together in your dataset.

+

⬇️ Jump to the analysis code ⬇️

+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the dataset from refine.bio

+

For general information about downloading data for these examples, see our ‘Getting Started’ section.

+

Go to this dataset’s page on refine.bio.

+

Click the “Download Now” button on the right side of this screen.

+

+

Fill out the pop up window with your email and our Terms and Conditions:

+

+

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

+

+

It may take a few minutes for the dataset to process. You will get an email when it is ready.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

+
+
+

2.5 Place the dataset in your new data/ folder

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

+

For more details on the contents of this folder see these docs on refine.bio.

+

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

+

Copy and paste the SRP133573 folder into your newly created data/ folder.

+
+
+

2.6 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called “data” which contains: +
      +
    • The SRP133573 folder which contains: +
        +
      • The gene expression
        +
      • +
      • The metadata TSV
        +
      • +
    • +
  • +
  • A folder for plots (currently empty)
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

+

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

+

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

+
# Define the file path to the data directory
+data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
+
+# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
+data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
+
+# Declare the file path to the metadata file using the data directory saved as `data_dir`
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
+

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

+
# Check if the gene expression matrix file is at the file path stored in `data_file`
+file.exists(data_file)
+
## [1] TRUE
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
+
## [1] TRUE
+

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+

Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you switch this to is sufficiently large.

+
+ +

 

+
+
+

4 Over-Representation Analysis with clusterProfiler - RNA-seq

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using WGCNA package. WGCNA also requires a package called impute that is sometimes has trouble installing so we recommend installing that first.

+
if (!("DESeq2" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("impute" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("impute")
+}
+
+if (!("WGCNA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("WGCNA")
+}
+

Attach the packages we need for this analysis.

+
# Attach the DESeq2 library
+library(DESeq2)
+
## Loading required package: S4Vectors
+
## Loading required package: stats4
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+##     union, unique, unsplit, which.max, which.min
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## Loading required package: IRanges
+
## Loading required package: GenomicRanges
+
## Loading required package: GenomeInfoDb
+
## Loading required package: SummarizedExperiment
+
## Loading required package: MatrixGenerics
+
## Loading required package: matrixStats
+
## 
+## Attaching package: 'MatrixGenerics'
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+##     colWeightedMeans, colWeightedMedians, colWeightedSds,
+##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+##     rowWeightedSds, rowWeightedVars
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## 
+## Attaching package: 'Biobase'
+
## The following object is masked from 'package:MatrixGenerics':
+## 
+##     rowMedians
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     anyMissing, rowMedians
+
# Homo sapiens annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
+
## Loading required package: AnnotationDbi
+
## 
+
# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll need this for finding gene modules
+library(WGCNA)
+
## Loading required package: dynamicTreeCut
+
## Loading required package: fastcluster
+
## 
+## Attaching package: 'fastcluster'
+
## The following object is masked from 'package:stats':
+## 
+##     hclust
+
## 
+
## 
+## Attaching package: 'WGCNA'
+
## The following object is masked from 'package:IRanges':
+## 
+##     cor
+
## The following object is masked from 'package:S4Vectors':
+## 
+##     cor
+
## The following object is masked from 'package:stats':
+## 
+##     cor
+
+
+

4.2 Import and set up data

+

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

+

We stored our file paths as objects named metadata_file and data_file in this previous step.

+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_character(),
+##   refinebio_age = col_logical(),
+##   refinebio_cell_line = col_logical(),
+##   refinebio_compound = col_logical(),
+##   refinebio_disease_stage = col_logical(),
+##   refinebio_genetic_information = col_logical(),
+##   refinebio_processed = col_logical(),
+##   refinebio_sex = col_logical(),
+##   refinebio_source_archive_url = col_logical(),
+##   refinebio_specimen_part = col_logical(),
+##   refinebio_time = col_logical()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_double(),
+##   Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+

Let’s ensure that the metadata and data are in the same sample order.

+
# Make the data in the order of the metadata
+df <- df %>%
+  dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
+
## [1] TRUE
+
+

4.2.1 Prepare data for DESeq2

+

We need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

+
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
+df <- df %>%
+  # Bring out rownames and store as their own column
+  tibble::rownames_to_column("Gene") %>%
+  # Mutate numeric variables to be integers
+  dplyr::mutate_if(is.numeric, round)
+
+
+
+

4.3 Create a DESeqDataset

+

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

+
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+  countData = df, # This is the data frame with the counts values for all replicates in our dataset
+  colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
+  design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
+  tidy = TRUE
+)
+
## converting counts to integer mode
+
+
+

4.4 Define a minimum counts cutoff

+

We want to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.

+
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
+# only rows that have counts above the cutoff
+genes_to_keep <- rowSums(counts(dds)) >= 50
+dds <- dds[genes_to_keep, ]
+
+
+

4.5 Perform DESeq2 normalization and transformation

+

We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stablizing transformation before running WGCNA.
+We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

+
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
+dds_norm <- vst(dds)
+

REVIEW

+
+
+

4.6 Format normalized data for WGCNA

+

Here we will use WGCNA to identify co-expressed gene modules. We’ll continue with the top gene module with ORA.

+

Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.

+
# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function
+normalized_counts <- assay(dds_norm) %>%
+  t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function
+
+
+

4.7 Determine parameters for WGCNA

+

WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. By providing a soft-threshold power parameter, we can guide it on {{I still can’t figure out what power actually represents here}} The “soft” part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility.

+
sft <- pickSoftThreshold(normalized_counts,
+  dataIsExpr = TRUE,
+  corFnc = cor,
+  networkType = "signed"
+)
+
## Warning: executing %dopar% sequentially: no parallel backend registered
+
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
+## 1      1  0.58200 12.200          0.957 13500.0   13600.0  15500
+## 2      2  0.44500  5.130          0.972  7630.0    7650.0   9910
+## 3      3  0.26300  2.570          0.985  4480.0    4450.0   6680
+## 4      4  0.06480  0.914          0.985  2730.0    2680.0   4720
+## 5      5  0.00662 -0.236          0.964  1720.0    1660.0   3450
+## 6      6  0.15900 -1.010          0.965  1120.0    1060.0   2580
+## 7      7  0.36500 -1.470          0.971   746.0     689.0   1980
+## 8      8  0.50000 -1.730          0.972   509.0     459.0   1550
+## 9      9  0.59700 -1.910          0.972   356.0     313.0   1220
+## 10    10  0.67000 -2.060          0.973   253.0     217.0    982
+## 11    12  0.74000 -2.260          0.970   135.0     110.0    651
+## 12    14  0.79400 -2.320          0.978    76.9      58.6    447
+## 13    16  0.82000 -2.350          0.981    45.9      32.7    315
+## 14    18  0.83800 -2.360          0.985    28.6      18.9    227
+## 15    20  0.84500 -2.350          0.987    18.5      11.2    167
+

This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate the model fit, the R squared and make that a new variable.

+
sft_df <- data.frame(sft$fitIndices) %>%
+  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
+

Now, let’s plot the model fitting by the power soft threshold so we can decide on a soft-threshold for power.

+
sft_df %>%
+  ggplot2::ggplot() +
+  ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) +
+  # We will plot what WGCNA recommends as an R^2 cutoff
+  ggplot2::geom_hline(yintercept = 0.80, col = "red") +
+  # Just in case our values are low, we want to make sure we can still see the 0.80 level
+  ggplot2::ylim(c(min(sft_df$model_fit), 1)) +
+  # We can add more sensible labels for our axis
+  ggplot2::xlab("Soft Threshold (power)") +
+  ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") +
+  ggplot2::ggtitle("Scale independence") +
+  # This adds some nicer aesthetics to our plot
+  ggplot2::theme_classic()
+

+

Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed R^2 above 0.80, otherwise they warn your results may be too noisy to be meaningful.

+

If you have multiple power values with signed R^2 above 0.80, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation {{SOURCE}}. You want to a power that gives you a big enough R^2 but is not excessively large.

+

So using the plot above, going with a power soft-threshold of 14!

+

If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

+
+
+

4.8 Run WGCNA!

+

We will use the blockwiseModules() function to build a gene coexpression network. The blockwise part refers to that these calculations will be done on chunks of your data at a time to help with computing resources.

+

This will take some time to run.

+

Note that for the power argument we are using 14 which we determined above.

+
bwnet <- blockwiseModules(normalized_counts,
+  TOMType = "signed", # topological overlap matrix
+  power = 14, # soft threshold for network construction
+  randomSeed = 1234 # there's some randomness associated with this calculation
+  # so we should set a seed
+)
+

There are a lot of other settings you can tweak – look at ?blockwiseModules help page to see more.

+

The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out.

+

DRAFT

+
+
+

4.9 Explore our WGCNA results

+

This object, bwnet has a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

+

In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed and combined expression of the genes a part of each module.

+
module_eigengenes <- bwnet$MEs
+
+# Print out a preview
+head(module_eigengenes)
+
+ +
+

If you want to know which of your genes make up which modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of.

+
gene_modules <- bwnet$colors
+
+# Print out a preview
+head(gene_modules)
+
## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
+##          "grey"           "tan"          "grey"          "grey"          "grey" 
+## ENSG00000000938 
+##          "grey"
+

We can turn this into a data frame for handy use.

+
gene_modules_df <- data.frame(gene_modules) %>%
+  tibble::rownames_to_column("Gene")
+

Now we can print out how many genes there are per module.

+
gene_modules_df %>% dplyr::count(gene_modules)
+
+ +
+

Note that the color assignments don’t have any particular biological meaning and just function as labels.

+
+
+

4.10 Take a look at eigengenes with your metadata

+

We can also see if our eigengenes relate to our metadata labels.

+

First we would have to make an annotation data frame that we can give to heatmap.

+
# Let's prepare the annotation data.frame for taking a look at our eigengenes
+annotation_df <- metadata %>%
+  # We want to select the variables that we want for annotating the technical replicates heatmap
+  dplyr::select(
+    refinebio_accession_code,
+    refinebio_treatment
+  ) %>%
+  # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object
+  tibble::column_to_rownames("refinebio_accession_code")
+

Now we can create a heatmap.

+
pheatmap::pheatmap(module_eigengenes,
+  show_rownames = FALSE,
+  annotation_row = annotation_df
+)
+

+
+
+

4.11 Visualize how correlated these eigengenes are

+
correlate_eigegenes <- cor(bwnet$MEs)
+
pheatmap::pheatmap(correlate_eigegenes,
+  show_rownames = FALSE,
+  show_colnames = FALSE
+)
+

+
+
+

4.12 Write results to file

+
readr::write_rds(bwnet,
+  file = file.path("results", "SRP133573_wgcna_results.RDS")
+)
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessionInfo()
+
## R version 4.0.2 (2020-06-22)
+## Platform: x86_64-pc-linux-gnu (64-bit)
+## Running under: Ubuntu 20.04 LTS
+## 
+## Matrix products: default
+## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+## 
+## locale:
+##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
+##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
+##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
+##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
+##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
+## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
+## 
+## attached base packages:
+## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
+## [8] methods   base     
+## 
+## other attached packages:
+##  [1] WGCNA_1.69                  fastcluster_1.1.25         
+##  [3] dynamicTreeCut_1.63-1       magrittr_1.5               
+##  [5] org.Hs.eg.db_3.12.0         AnnotationDbi_1.52.0       
+##  [7] DESeq2_1.30.0               SummarizedExperiment_1.20.0
+##  [9] Biobase_2.50.0              MatrixGenerics_1.2.0       
+## [11] matrixStats_0.57.0          GenomicRanges_1.42.0       
+## [13] GenomeInfoDb_1.26.0         IRanges_2.24.0             
+## [15] S4Vectors_0.28.0            BiocGenerics_0.36.0        
+## [17] optparse_1.6.6             
+## 
+## loaded via a namespace (and not attached):
+##  [1] colorspace_1.4-1       ellipsis_0.3.1         htmlTable_2.1.0       
+##  [4] XVector_0.30.0         base64enc_0.1-3        rstudioapi_0.11       
+##  [7] farver_2.0.3           getopt_1.20.3          bit64_4.0.5           
+## [10] fansi_0.4.1            codetools_0.2-16       splines_4.0.2         
+## [13] R.methodsS3_1.8.1      doParallel_1.0.15      impute_1.64.0         
+## [16] geneplotter_1.68.0     knitr_1.30             jsonlite_1.7.1        
+## [19] Formula_1.2-3          annotate_1.68.0        cluster_2.1.0         
+## [22] GO.db_3.12.1           png_0.1-7              R.oo_1.24.0           
+## [25] pheatmap_1.0.12        readr_1.4.0            compiler_4.0.2        
+## [28] httr_1.4.2             backports_1.1.10       assertthat_0.2.1      
+## [31] Matrix_1.2-18          cli_2.1.0              htmltools_0.5.0       
+## [34] tools_4.0.2            gtable_0.3.0           glue_1.4.2            
+## [37] GenomeInfoDbData_1.2.4 dplyr_1.0.2            Rcpp_1.0.5            
+## [40] styler_1.3.2           vctrs_0.3.4            preprocessCore_1.52.0 
+## [43] iterators_1.0.12       xfun_0.18              stringr_1.4.0         
+## [46] ps_1.4.0               lifecycle_0.2.0        XML_3.99-0.5          
+## [49] zlibbioc_1.36.0        scales_1.1.1           hms_0.5.3             
+## [52] rematch2_2.1.2         RColorBrewer_1.1-2     yaml_2.2.1            
+## [55] memoise_1.1.0          gridExtra_2.3          ggplot2_3.3.2         
+## [58] rpart_4.1-15           latticeExtra_0.6-29    stringi_1.5.3         
+## [61] RSQLite_2.2.1          genefilter_1.72.0      foreach_1.5.0         
+## [64] checkmate_2.0.0        BiocParallel_1.24.1    rlang_0.4.8           
+## [67] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14         
+## [70] lattice_0.20-41        purrr_0.3.4            htmlwidgets_1.5.2     
+## [73] labeling_0.3           bit_4.0.4              tidyselect_1.1.0      
+## [76] R6_2.4.1               generics_0.0.2         Hmisc_4.4-1           
+## [79] DelayedArray_0.16.0    DBI_1.1.0              pillar_1.4.6          
+## [82] foreign_0.8-80         survival_3.1-12        RCurl_1.98-1.2        
+## [85] nnet_7.3-14            tibble_3.0.4           crayon_1.3.4          
+## [88] rmarkdown_2.4          jpeg_0.1-8.1           locfit_1.5-9.4        
+## [91] grid_4.0.2             data.table_1.13.0      blob_1.2.1            
+## [94] digest_0.6.25          xtable_1.8-4           R.cache_0.14.0        
+## [97] R.utils_2.10.1         munsell_0.5.0
+
+
+

References

+
+
+

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/Snakefile b/Snakefile index 7fe64d3f..526e7d2a 100644 --- a/Snakefile +++ b/Snakefile @@ -18,6 +18,7 @@ rule target: "03-rnaseq/dimension-reduction_rnaseq_02_umap.html", "03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html", "03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html", + "03-rnaseq/network-analysis_rnaseq_01_wgcna.html", "04-advanced-topics/00-intro-to-advanced-topics.html" rule render_citations: From 364e1757b676cef069a5b1f817ded178b11e28bf Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 16:57:09 -0500 Subject: [PATCH 08/11] Move WGCNA to the "advanced topics" folder --- .../network-analysis_rnaseq_01_wgcna.Rmd | 489 ++ .../network-analysis_rnaseq_01_wgcna.html | 4387 +++++++++++++++++ Snakefile | 5 +- 3 files changed, 4879 insertions(+), 2 deletions(-) create mode 100644 04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd create mode 100644 04-advanced-topics/network-analysis_rnaseq_01_wgcna.html diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd new file mode 100644 index 00000000..d4328f44 --- /dev/null +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -0,0 +1,489 @@ +--- +title: "Over-Representation Analysis - RNA-seq" +author: "CCDL for ALSF" +date: "October 2020" +output: + html_notebook: + toc: true + toc_float: true + number_sections: true +--- + +# Purpose of this analysis + +In this example, we use [weighted gene co-expression network analysis (WGCNA)]() to identify co-expressed gene modules. +WGCNA using a series of correlations to determine what genes are expressed together in your dataset. + +⬇️ [**Jump to the analysis code**](#analysis) ⬇️ + +# How to run this example + +For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). +We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. + +## Obtain the `.Rmd` file + +To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd). + +Clicking this link will most likely send this to your downloads folder on your computer. +Move this `.Rmd` file to where you would like this example and its files to be stored. + +You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) + +## Set up your analysis folders + +Good file organization is helpful for keeping your data analysis project on track! +We have set up some code that will automatically set up a folder structure for you. +Run this next chunk to set up your folders! + +If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. + +```{r} +# Create the data folder if it doesn't exist +if (!dir.exists("data")) { + dir.create("data") +} + +# Define the file path to the plots directory +plots_dir <- "plots" # Can replace with path to desired output plots directory + +# Create the plots folder if it doesn't exist +if (!dir.exists(plots_dir)) { + dir.create(plots_dir) +} + +# Define the file path to the results directory +results_dir <- "results" # Can replace with path to desired output results directory + +# Create the results folder if it doesn't exist +if (!dir.exists(results_dir)) { + dir.create(results_dir) +} +``` + +In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! + +## Obtain the dataset from refine.bio + +For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). + +Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). + +Click the "Download Now" button on the right side of this screen. + + + +Fill out the pop up window with your email and our Terms and Conditions: + + + +We are going to use non-quantile normalized data for this analysis. +To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". +Note that this option will only be available for RNA-seq datasets. + + + +It may take a few minutes for the dataset to process. +You will get an email when it is ready. + +## About the dataset we are using for this example + +For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). +The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. +Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. + +## Place the dataset in your new `data/` folder + +refine.bio will send you a download button in the email when it is ready. +Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. +Double clicking should unzip this for you and create a folder of the same name. + + + +For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). + +The `` folder has the data and metadata TSV files you will need for this example analysis. +Experiment accession ids usually look something like `GSE1235` or `SRP12345`. + +Copy and paste the `SRP133573` folder into your newly created `data/` folder. + +## Check out our file structure! + +Your new analysis folder should contain: + +- The example analysis `.Rmd` you downloaded +- A folder called "data" which contains: + - The `SRP133573` folder which contains: + - The gene expression + - The metadata TSV +- A folder for `plots` (currently empty) +- A folder for `results` (currently empty) + +Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): + + + +In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. +These chunks will declare your file paths and double check that your files are in the right place. + +First we will declare our file paths to our data and metadata files, which should be in our data directory. +This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. + +```{r} +# Define the file path to the data directory +data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in + +# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` +data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset + +# Declare the file path to the metadata file using the data directory saved as `data_dir` +metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata +``` + +Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. + +```{r} +# Check if the gene expression matrix file is at the file path stored in `data_file` +file.exists(data_file) + +# Check if the metadata file is at the file path stored in `metadata_file` +file.exists(metadata_file) +``` + +If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. + +If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). + +# Using a different refine.bio dataset with this analysis? + +If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). +We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. +From here you can customize this analysis example to fit your own scientific questions and preferences. + +Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result [according to its authors](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html). So you will need to make sure the dataset you switch this to is sufficiently large. + +*** + +   + +# Over-Representation Analysis with `clusterProfiler` - RNA-seq + +## Install libraries + +See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. + +In this analysis, we will be using [`WGCNA`](https://cran.r-project.org/web/packages/WGCNA/index.html) package. +WGCNA also requires a package called `impute` that is sometimes has trouble installing so we recommend installing that first. + +```{r} +if (!("DESeq2" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("DESeq2", update = FALSE) +} + +if (!("impute" %in% installed.packages())) { + # Install this package if it isn't installed yet + BiocManager::install("impute") +} + +if (!("WGCNA" %in% installed.packages())) { + # Install this package if it isn't installed yet + install.packages("WGCNA") +} +``` + +Attach the packages we need for this analysis. + +```{r} +# Attach the DESeq2 library +library(DESeq2) + +# Homo sapiens annotation package we'll use for gene identifier conversion +library(org.Hs.eg.db) + +# We will need this so we can use the pipe: %>% +library(magrittr) + +# We'll need this for finding gene modules +library(WGCNA) +``` + +## Import and set up data + +Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. +This chunk of code will read the both TSV files and add them as data frames to your environment. + +We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). + +```{r} +# Read in metadata TSV file +metadata <- readr::read_tsv(metadata_file) + +# Read in data TSV file +df <- readr::read_tsv(data_file) %>% + # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later + tibble::column_to_rownames("Gene") +``` + +Let's ensure that the metadata and data are in the same sample order. + +```{r} +# Make the data in the order of the metadata +df <- df %>% + dplyr::select(metadata$refinebio_accession_code) + +# Check if this is in the same order +all.equal(colnames(df), metadata$refinebio_accession_code) +``` + +### Prepare data for `DESeq2` + +We need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. + +```{r} +# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers +df <- df %>% + # Bring out rownames and store as their own column + tibble::rownames_to_column("Gene") %>% + # Mutate numeric variables to be integers + dplyr::mutate_if(is.numeric, round) +``` + +## Create a DESeqDataset + +We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. +We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. +In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. + +```{r} +# Create a `DESeqDataSet` object +dds <- DESeqDataSetFromMatrix( + countData = df, # This is the data frame with the counts values for all replicates in our dataset + colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame + design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis + tidy = TRUE +) +``` + +## Define a minimum counts cutoff + +We want to filter out the genes that have not been expressed or that have low expression counts. +This is recommended by [WGCNA docs for RNA-seq data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). +Removing low count genes can help improve your WGCNA results. +We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples. + +```{r} +# Define a minimum counts cutoff and filter `DESeqDataSet` object to include +# only rows that have counts above the cutoff +genes_to_keep <- rowSums(counts(dds)) >= 50 +dds <- dds[genes_to_keep, ] +``` + +## Perform DESeq2 normalization and transformation + +We often suggest normalizing and transforming your data for various applications and in this instance WGCNA's authors [suggest using variance stablizing transformation before running WGCNA](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). +We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. +For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). + +```{r} +# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package +dds_norm <- vst(dds) +``` + +**REVIEW** + +## Format normalized data for WGCNA + +Here we will use WGCNA to identify co-expressed gene modules. +We'll continue with the top gene module with ORA. + +Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA. + +```{r} +# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function +normalized_counts <- assay(dds_norm) %>% + t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function +``` + +## Determine parameters for WGCNA + +WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. +By providing a soft-threshold power parameter, we can guide it on {{I still can't figure out what power actually represents here}} +The "soft" part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility. + +```{r} +sft <- pickSoftThreshold(normalized_counts, + dataIsExpr = TRUE, + corFnc = cor, + networkType = "signed" +) +``` + +This `sft` object has a lot of information, we will want to plot some of it to figure out what our `power` soft-threshold should be. +We have to first calculate the model fit, the R squared and make that a new variable. + +```{r} +sft_df <- data.frame(sft$fitIndices) %>% + dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq) +``` + +Now, let's plot the model fitting by the `power` soft threshold so we can decide on a soft-threshold for power. + +```{r} +sft_df %>% + ggplot2::ggplot() + + ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) + + # We will plot what WGCNA recommends as an R^2 cutoff + ggplot2::geom_hline(yintercept = 0.80, col = "red") + + # Just in case our values are low, we want to make sure we can still see the 0.80 level + ggplot2::ylim(c(min(sft_df$model_fit), 1)) + + # We can add more sensible labels for our axis + ggplot2::xlab("Soft Threshold (power)") + + ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") + + ggplot2::ggtitle("Scale independence") + + # This adds some nicer aesthetics to our plot + ggplot2::theme_classic() +``` + +Using this plot we can decide on a power parameter. +WGCNA's authors recommend using a `power` that has an signed R^2 above `0.80`, otherwise they warn your results may be too noisy to be meaningful. + +If you have multiple power values with signed R^2 above `0.80`, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation [{{SOURCE}}](https://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf). +You want to a `power` that gives you a big enough R^2 but is not excessively large. + +So using the plot above, going with a power soft-threshold of `14`! + +If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. +You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). + + +## Run WGCNA! + +We will use the `blockwiseModules()` function to build a gene coexpression network. +The `blockwise` part refers to that these calculations will be done on chunks of your data at a time to help with computing resources. + +This will take some time to run. + +Note that for the `power` argument we are using `14` which we determined above. + +```{r} +bwnet <- blockwiseModules(normalized_counts, + TOMType = "signed", # topological overlap matrix + power = 14, # soft threshold for network construction + randomSeed = 1234 # there's some randomness associated with this calculation + # so we should set a seed +) +``` + +There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page to see more. + +The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. +You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. +However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out. + +**DRAFT** + +## Explore our WGCNA results + +This object, `bwnet` has a lot of information. +We can pull out the parts we are most interested in and may want to use use for plotting. + +In `bwnet` we have a data frame of eigengene module data for each sample in the `MEs` slot. +These represent the collapsed and combined expression of the genes a part of each module. + +```{r} +module_eigengenes <- bwnet$MEs + +# Print out a preview +head(module_eigengenes) +``` +If you want to know which of your genes make up which modules, you can look at the `$colors` slot. +This is a named list which associates the genes with the module they are a part of. + +```{r} +gene_modules <- bwnet$colors + +# Print out a preview +head(gene_modules) +``` + +We can turn this into a data frame for handy use. + +```{r} +gene_modules_df <- data.frame(gene_modules) %>% + tibble::rownames_to_column("Gene") +``` + +Now we can print out how many genes there are per module. + +```{r} +gene_modules_df %>% dplyr::count(gene_modules) +``` + +Note that the color assignments don't have any particular biological meaning and just function as labels. + +## Take a look at eigengenes with your metadata + +We can also see if our eigengenes relate to our metadata labels. + +First we would have to make an annotation data frame that we can give to heatmap. + +```{r} +# Let's prepare the annotation data.frame for taking a look at our eigengenes +annotation_df <- metadata %>% + # We want to select the variables that we want for annotating the technical replicates heatmap + dplyr::select( + refinebio_accession_code, + refinebio_treatment + ) %>% + # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object + tibble::column_to_rownames("refinebio_accession_code") +``` + +Now we can create a heatmap. + +```{r} +pheatmap::pheatmap(module_eigengenes, + show_rownames = FALSE, + annotation_row = annotation_df +) +``` + +## Visualize how correlated these eigengenes are + +```{r} +correlate_eigegenes <- cor(bwnet$MEs) +``` + +```{r} +pheatmap::pheatmap(correlate_eigegenes, + show_rownames = FALSE, + show_colnames = FALSE +) +``` + +## Write results to file + +```{r} +readr::write_rds(bwnet, + file = file.path("results", "SRP133573_wgcna_results.RDS") +) +``` + +# Resources for further learning + +- [WGCNA FAQ page](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html) +- [WGCNA tutorial](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) +- [WGCNA paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) + +# Session info + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. +This helps make your code more reproducible by recording what versions of software and packages you used to run this. + +```{r} +# Print session info +sessionInfo() +``` + +# References diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html new file mode 100644 index 00000000..f17c7f9b --- /dev/null +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.html @@ -0,0 +1,4387 @@ + + + + + + + + + + + + + + +Over-Representation Analysis - RNA-seq + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+
+
+
+
+ +
+ + + + + + + + + +
+

1 Purpose of this analysis

+

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules. WGCNA using a series of correlations to determine what genes are expressed together in your dataset.

+

⬇️ Jump to the analysis code ⬇️

+
+
+

2 How to run this example

+

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

+
+

2.1 Obtain the .Rmd file

+

To run this example yourself, download the .Rmd for this analysis by clicking this link.

+

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

+

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

+
+
+

2.2 Set up your analysis folders

+

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

+

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

+
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+  dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+  dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+  dir.create(results_dir)
+}
+

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

+
+
+

2.3 Obtain the dataset from refine.bio

+

For general information about downloading data for these examples, see our ‘Getting Started’ section.

+

Go to this dataset’s page on refine.bio.

+

Click the “Download Now” button on the right side of this screen.

+

+

Fill out the pop up window with your email and our Terms and Conditions:

+

+

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

+

+

It may take a few minutes for the dataset to process. You will get an email when it is ready.

+
+
+

2.4 About the dataset we are using for this example

+

For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

+
+
+

2.5 Place the dataset in your new data/ folder

+

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

+

+

For more details on the contents of this folder see these docs on refine.bio.

+

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

+

Copy and paste the SRP133573 folder into your newly created data/ folder.

+
+
+

2.6 Check out our file structure!

+

Your new analysis folder should contain:

+
    +
  • The example analysis .Rmd you downloaded
    +
  • +
  • A folder called “data” which contains: +
      +
    • The SRP133573 folder which contains: +
        +
      • The gene expression
        +
      • +
      • The metadata TSV
        +
      • +
    • +
  • +
  • A folder for plots (currently empty)
  • +
  • A folder for results (currently empty)
  • +
+

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

+

+

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

+

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

+
# Define the file path to the data directory
+data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
+
+# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
+data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
+
+# Declare the file path to the metadata file using the data directory saved as `data_dir`
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
+

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

+
# Check if the gene expression matrix file is at the file path stored in `data_file`
+file.exists(data_file)
+
## [1] TRUE
+
# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
+
## [1] TRUE
+

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

+

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

+
+
+
+

3 Using a different refine.bio dataset with this analysis?

+

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

+

Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you switch this to is sufficiently large.

+
+ +

 

+
+
+

4 Over-Representation Analysis with clusterProfiler - RNA-seq

+
+

4.1 Install libraries

+

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

+

In this analysis, we will be using WGCNA package. WGCNA also requires a package called impute that is sometimes has trouble installing so we recommend installing that first.

+
if (!("DESeq2" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("impute" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  BiocManager::install("impute")
+}
+
+if (!("WGCNA" %in% installed.packages())) {
+  # Install this package if it isn't installed yet
+  install.packages("WGCNA")
+}
+

Attach the packages we need for this analysis.

+
# Attach the DESeq2 library
+library(DESeq2)
+
## Loading required package: S4Vectors
+
## Loading required package: stats4
+
## Loading required package: BiocGenerics
+
## Loading required package: parallel
+
## 
+## Attaching package: 'BiocGenerics'
+
## The following objects are masked from 'package:parallel':
+## 
+##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+##     clusterExport, clusterMap, parApply, parCapply, parLapply,
+##     parLapplyLB, parRapply, parSapply, parSapplyLB
+
## The following objects are masked from 'package:stats':
+## 
+##     IQR, mad, sd, var, xtabs
+
## The following objects are masked from 'package:base':
+## 
+##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+##     union, unique, unsplit, which.max, which.min
+
## 
+## Attaching package: 'S4Vectors'
+
## The following object is masked from 'package:base':
+## 
+##     expand.grid
+
## Loading required package: IRanges
+
## Loading required package: GenomicRanges
+
## Loading required package: GenomeInfoDb
+
## Loading required package: SummarizedExperiment
+
## Loading required package: MatrixGenerics
+
## Loading required package: matrixStats
+
## 
+## Attaching package: 'MatrixGenerics'
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+##     colWeightedMeans, colWeightedMedians, colWeightedSds,
+##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+##     rowWeightedSds, rowWeightedVars
+
## Loading required package: Biobase
+
## Welcome to Bioconductor
+## 
+##     Vignettes contain introductory material; view with
+##     'browseVignettes()'. To cite Bioconductor, see
+##     'citation("Biobase")', and for packages 'citation("pkgname")'.
+
## 
+## Attaching package: 'Biobase'
+
## The following object is masked from 'package:MatrixGenerics':
+## 
+##     rowMedians
+
## The following objects are masked from 'package:matrixStats':
+## 
+##     anyMissing, rowMedians
+
# Homo sapiens annotation package we'll use for gene identifier conversion
+library(org.Hs.eg.db)
+
## Loading required package: AnnotationDbi
+
## 
+
# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll need this for finding gene modules
+library(WGCNA)
+
## Loading required package: dynamicTreeCut
+
## Loading required package: fastcluster
+
## 
+## Attaching package: 'fastcluster'
+
## The following object is masked from 'package:stats':
+## 
+##     hclust
+
## 
+
## 
+## Attaching package: 'WGCNA'
+
## The following object is masked from 'package:IRanges':
+## 
+##     cor
+
## The following object is masked from 'package:S4Vectors':
+## 
+##     cor
+
## The following object is masked from 'package:stats':
+## 
+##     cor
+
+
+

4.2 Import and set up data

+

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

+

We stored our file paths as objects named metadata_file and data_file in this previous step.

+
# Read in metadata TSV file
+metadata <- readr::read_tsv(metadata_file)
+
## 
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_character(),
+##   refinebio_age = col_logical(),
+##   refinebio_cell_line = col_logical(),
+##   refinebio_compound = col_logical(),
+##   refinebio_disease_stage = col_logical(),
+##   refinebio_genetic_information = col_logical(),
+##   refinebio_processed = col_logical(),
+##   refinebio_sex = col_logical(),
+##   refinebio_source_archive_url = col_logical(),
+##   refinebio_specimen_part = col_logical(),
+##   refinebio_time = col_logical()
+## )
+## ℹ Use `spec()` for the full column specifications.
+
# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
+  tibble::column_to_rownames("Gene")
+
## 
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
+## cols(
+##   .default = col_double(),
+##   Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+

Let’s ensure that the metadata and data are in the same sample order.

+
# Make the data in the order of the metadata
+df <- df %>%
+  dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
+
## [1] TRUE
+
+

4.2.1 Prepare data for DESeq2

+

We need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

+
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
+df <- df %>%
+  # Bring out rownames and store as their own column
+  tibble::rownames_to_column("Gene") %>%
+  # Mutate numeric variables to be integers
+  dplyr::mutate_if(is.numeric, round)
+
+
+
+

4.3 Create a DESeqDataset

+

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

+
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+  countData = df, # This is the data frame with the counts values for all replicates in our dataset
+  colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
+  design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
+  tidy = TRUE
+)
+
## converting counts to integer mode
+
+
+

4.4 Define a minimum counts cutoff

+

We want to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.

+
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
+# only rows that have counts above the cutoff
+genes_to_keep <- rowSums(counts(dds)) >= 50
+dds <- dds[genes_to_keep, ]
+
+
+

4.5 Perform DESeq2 normalization and transformation

+

We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stablizing transformation before running WGCNA.
+We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

+
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
+dds_norm <- vst(dds)
+

REVIEW

+
+
+

4.6 Format normalized data for WGCNA

+

Here we will use WGCNA to identify co-expressed gene modules. We’ll continue with the top gene module with ORA.

+

Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.

+
# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function
+normalized_counts <- assay(dds_norm) %>%
+  t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function
+
+
+

4.7 Determine parameters for WGCNA

+

WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. By providing a soft-threshold power parameter, we can guide it on {{I still can’t figure out what power actually represents here}} The “soft” part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility.

+
sft <- pickSoftThreshold(normalized_counts,
+  dataIsExpr = TRUE,
+  corFnc = cor,
+  networkType = "signed"
+)
+
## Warning: executing %dopar% sequentially: no parallel backend registered
+
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
+## 1      1  0.58200 12.200          0.957 13500.0   13600.0  15500
+## 2      2  0.44500  5.130          0.972  7630.0    7650.0   9910
+## 3      3  0.26300  2.570          0.985  4480.0    4450.0   6680
+## 4      4  0.06480  0.914          0.985  2730.0    2680.0   4720
+## 5      5  0.00662 -0.236          0.964  1720.0    1660.0   3450
+## 6      6  0.15900 -1.010          0.965  1120.0    1060.0   2580
+## 7      7  0.36500 -1.470          0.971   746.0     689.0   1980
+## 8      8  0.50000 -1.730          0.972   509.0     459.0   1550
+## 9      9  0.59700 -1.910          0.972   356.0     313.0   1220
+## 10    10  0.67000 -2.060          0.973   253.0     217.0    982
+## 11    12  0.74000 -2.260          0.970   135.0     110.0    651
+## 12    14  0.79400 -2.320          0.978    76.9      58.6    447
+## 13    16  0.82000 -2.350          0.981    45.9      32.7    315
+## 14    18  0.83800 -2.360          0.985    28.6      18.9    227
+## 15    20  0.84500 -2.350          0.987    18.5      11.2    167
+

This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate the model fit, the R squared and make that a new variable.

+
sft_df <- data.frame(sft$fitIndices) %>%
+  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
+

Now, let’s plot the model fitting by the power soft threshold so we can decide on a soft-threshold for power.

+
sft_df %>%
+  ggplot2::ggplot() +
+  ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) +
+  # We will plot what WGCNA recommends as an R^2 cutoff
+  ggplot2::geom_hline(yintercept = 0.80, col = "red") +
+  # Just in case our values are low, we want to make sure we can still see the 0.80 level
+  ggplot2::ylim(c(min(sft_df$model_fit), 1)) +
+  # We can add more sensible labels for our axis
+  ggplot2::xlab("Soft Threshold (power)") +
+  ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") +
+  ggplot2::ggtitle("Scale independence") +
+  # This adds some nicer aesthetics to our plot
+  ggplot2::theme_classic()
+

+

Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed R^2 above 0.80, otherwise they warn your results may be too noisy to be meaningful.

+

If you have multiple power values with signed R^2 above 0.80, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation {{SOURCE}}. You want to a power that gives you a big enough R^2 but is not excessively large.

+

So using the plot above, going with a power soft-threshold of 14!

+

If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

+
+
+

4.8 Run WGCNA!

+

We will use the blockwiseModules() function to build a gene coexpression network. The blockwise part refers to that these calculations will be done on chunks of your data at a time to help with computing resources.

+

This will take some time to run.

+

Note that for the power argument we are using 14 which we determined above.

+
bwnet <- blockwiseModules(normalized_counts,
+  TOMType = "signed", # topological overlap matrix
+  power = 14, # soft threshold for network construction
+  randomSeed = 1234 # there's some randomness associated with this calculation
+  # so we should set a seed
+)
+

There are a lot of other settings you can tweak – look at ?blockwiseModules help page to see more.

+

The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out.

+

DRAFT

+
+
+

4.9 Explore our WGCNA results

+

This object, bwnet has a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

+

In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed and combined expression of the genes a part of each module.

+
module_eigengenes <- bwnet$MEs
+
+# Print out a preview
+head(module_eigengenes)
+
+ +
+

If you want to know which of your genes make up which modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of.

+
gene_modules <- bwnet$colors
+
+# Print out a preview
+head(gene_modules)
+
## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
+##          "grey"           "tan"          "grey"          "grey"          "grey" 
+## ENSG00000000938 
+##          "grey"
+

We can turn this into a data frame for handy use.

+
gene_modules_df <- data.frame(gene_modules) %>%
+  tibble::rownames_to_column("Gene")
+

Now we can print out how many genes there are per module.

+
gene_modules_df %>% dplyr::count(gene_modules)
+
+ +
+

Note that the color assignments don’t have any particular biological meaning and just function as labels.

+
+
+

4.10 Take a look at eigengenes with your metadata

+

We can also see if our eigengenes relate to our metadata labels.

+

First we would have to make an annotation data frame that we can give to heatmap.

+
# Let's prepare the annotation data.frame for taking a look at our eigengenes
+annotation_df <- metadata %>%
+  # We want to select the variables that we want for annotating the technical replicates heatmap
+  dplyr::select(
+    refinebio_accession_code,
+    refinebio_treatment
+  ) %>%
+  # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object
+  tibble::column_to_rownames("refinebio_accession_code")
+

Now we can create a heatmap.

+
pheatmap::pheatmap(module_eigengenes,
+  show_rownames = FALSE,
+  annotation_row = annotation_df
+)
+

+
+
+

4.11 Visualize how correlated these eigengenes are

+
correlate_eigegenes <- cor(bwnet$MEs)
+
pheatmap::pheatmap(correlate_eigegenes,
+  show_rownames = FALSE,
+  show_colnames = FALSE
+)
+

+
+
+

4.12 Write results to file

+
readr::write_rds(bwnet,
+  file = file.path("results", "SRP133573_wgcna_results.RDS")
+)
+
+
+
+

5 Resources for further learning

+ +
+
+

6 Session info

+

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

+
# Print session info
+sessionInfo()
+
## R version 4.0.2 (2020-06-22)
+## Platform: x86_64-pc-linux-gnu (64-bit)
+## Running under: Ubuntu 20.04 LTS
+## 
+## Matrix products: default
+## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+## 
+## locale:
+##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
+##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
+##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
+##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
+##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
+## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
+## 
+## attached base packages:
+## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
+## [8] methods   base     
+## 
+## other attached packages:
+##  [1] WGCNA_1.69                  fastcluster_1.1.25         
+##  [3] dynamicTreeCut_1.63-1       magrittr_1.5               
+##  [5] org.Hs.eg.db_3.12.0         AnnotationDbi_1.52.0       
+##  [7] DESeq2_1.30.0               SummarizedExperiment_1.20.0
+##  [9] Biobase_2.50.0              MatrixGenerics_1.2.0       
+## [11] matrixStats_0.57.0          GenomicRanges_1.42.0       
+## [13] GenomeInfoDb_1.26.0         IRanges_2.24.0             
+## [15] S4Vectors_0.28.0            BiocGenerics_0.36.0        
+## [17] optparse_1.6.6             
+## 
+## loaded via a namespace (and not attached):
+##  [1] colorspace_1.4-1       ellipsis_0.3.1         htmlTable_2.1.0       
+##  [4] XVector_0.30.0         base64enc_0.1-3        rstudioapi_0.11       
+##  [7] farver_2.0.3           getopt_1.20.3          bit64_4.0.5           
+## [10] fansi_0.4.1            codetools_0.2-16       splines_4.0.2         
+## [13] R.methodsS3_1.8.1      doParallel_1.0.15      impute_1.64.0         
+## [16] geneplotter_1.68.0     knitr_1.30             jsonlite_1.7.1        
+## [19] Formula_1.2-3          annotate_1.68.0        cluster_2.1.0         
+## [22] GO.db_3.12.1           png_0.1-7              R.oo_1.24.0           
+## [25] pheatmap_1.0.12        readr_1.4.0            compiler_4.0.2        
+## [28] httr_1.4.2             backports_1.1.10       assertthat_0.2.1      
+## [31] Matrix_1.2-18          cli_2.1.0              htmltools_0.5.0       
+## [34] tools_4.0.2            gtable_0.3.0           glue_1.4.2            
+## [37] GenomeInfoDbData_1.2.4 dplyr_1.0.2            Rcpp_1.0.5            
+## [40] styler_1.3.2           vctrs_0.3.4            preprocessCore_1.52.0 
+## [43] iterators_1.0.12       xfun_0.18              stringr_1.4.0         
+## [46] ps_1.4.0               lifecycle_0.2.0        XML_3.99-0.5          
+## [49] zlibbioc_1.36.0        scales_1.1.1           hms_0.5.3             
+## [52] rematch2_2.1.2         RColorBrewer_1.1-2     yaml_2.2.1            
+## [55] memoise_1.1.0          gridExtra_2.3          ggplot2_3.3.2         
+## [58] rpart_4.1-15           latticeExtra_0.6-29    stringi_1.5.3         
+## [61] RSQLite_2.2.1          genefilter_1.72.0      foreach_1.5.0         
+## [64] checkmate_2.0.0        BiocParallel_1.24.1    rlang_0.4.8           
+## [67] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14         
+## [70] lattice_0.20-41        purrr_0.3.4            htmlwidgets_1.5.2     
+## [73] labeling_0.3           bit_4.0.4              tidyselect_1.1.0      
+## [76] R6_2.4.1               generics_0.0.2         Hmisc_4.4-1           
+## [79] DelayedArray_0.16.0    DBI_1.1.0              pillar_1.4.6          
+## [82] foreign_0.8-80         survival_3.1-12        RCurl_1.98-1.2        
+## [85] nnet_7.3-14            tibble_3.0.4           crayon_1.3.4          
+## [88] rmarkdown_2.4          jpeg_0.1-8.1           locfit_1.5-9.4        
+## [91] grid_4.0.2             data.table_1.13.0      blob_1.2.1            
+## [94] digest_0.6.25          xtable_1.8-4           R.cache_0.14.0        
+## [97] R.utils_2.10.1         munsell_0.5.0
+
+
+

References

+
+
+

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

+
+
+
+ + + + +
+
+ +
+ + + + + + + + + + + + + + + + diff --git a/Snakefile b/Snakefile index 526e7d2a..803918b5 100644 --- a/Snakefile +++ b/Snakefile @@ -18,8 +18,9 @@ rule target: "03-rnaseq/dimension-reduction_rnaseq_02_umap.html", "03-rnaseq/gene-id-annotation_rnaseq_01_ensembl.html", "03-rnaseq/ortholog-mapping_rnaseq_01_ensembl.html", - "03-rnaseq/network-analysis_rnaseq_01_wgcna.html", - "04-advanced-topics/00-intro-to-advanced-topics.html" + "04-advanced-topics/00-intro-to-advanced-topics.html", + "04-advanced-topics/network-analysis_rnaseq_01_wgcna.html" + rule render_citations: input: From 58b4550d90fa74c16442a41b9cc2d703a89985b3 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 11 Nov 2020 17:28:11 -0500 Subject: [PATCH 09/11] remove rna-seq files --- .../network-analysis_rnaseq_01_wgcna.Rmd | 489 -- .../network-analysis_rnaseq_01_wgcna.html | 4387 ----------------- 2 files changed, 4876 deletions(-) delete mode 100644 03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd delete mode 100644 03-rnaseq/network-analysis_rnaseq_01_wgcna.html diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd b/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd deleted file mode 100644 index d4328f44..00000000 --- a/03-rnaseq/network-analysis_rnaseq_01_wgcna.Rmd +++ /dev/null @@ -1,489 +0,0 @@ ---- -title: "Over-Representation Analysis - RNA-seq" -author: "CCDL for ALSF" -date: "October 2020" -output: - html_notebook: - toc: true - toc_float: true - number_sections: true ---- - -# Purpose of this analysis - -In this example, we use [weighted gene co-expression network analysis (WGCNA)]() to identify co-expressed gene modules. -WGCNA using a series of correlations to determine what genes are expressed together in your dataset. - -⬇️ [**Jump to the analysis code**](#analysis) ⬇️ - -# How to run this example - -For general information about our tutorials and the basic software packages you will need, please see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-this-tutorial-is-structured). -We recommend taking a look at our [Resources for Learning R](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#resources-for-learning-r) if you have not written code in R before. - -## Obtain the `.Rmd` file - -To run this example yourself, [download the `.Rmd` for this analysis by clicking this link](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/differential_expression_rnaseq_01_rnaseq.Rmd). - -Clicking this link will most likely send this to your downloads folder on your computer. -Move this `.Rmd` file to where you would like this example and its files to be stored. - -You can open this `.Rmd` file in RStudio and follow the rest of these steps from there. (See our [section about getting started with R notebooks](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) if you are unfamiliar with `.Rmd` files.) - -## Set up your analysis folders - -Good file organization is helpful for keeping your data analysis project on track! -We have set up some code that will automatically set up a folder structure for you. -Run this next chunk to set up your folders! - -If you have trouble running this chunk, see our [introduction to using `.Rmd`s](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-and-use-rmds) for more resources and explanations. - -```{r} -# Create the data folder if it doesn't exist -if (!dir.exists("data")) { - dir.create("data") -} - -# Define the file path to the plots directory -plots_dir <- "plots" # Can replace with path to desired output plots directory - -# Create the plots folder if it doesn't exist -if (!dir.exists(plots_dir)) { - dir.create(plots_dir) -} - -# Define the file path to the results directory -results_dir <- "results" # Can replace with path to desired output results directory - -# Create the results folder if it doesn't exist -if (!dir.exists(results_dir)) { - dir.create(results_dir) -} -``` - -In the same place you put this `.Rmd` file, you should now have three new empty folders called `data`, `plots`, and `results`! - -## Obtain the dataset from refine.bio - -For general information about downloading data for these examples, see our ['Getting Started' section](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#how-to-get-the-data). - -Go to this [dataset's page on refine.bio](https://www.refine.bio/experiments/SRP133573/identification-of-transcription-factor-relationships-associated-with-androgen-deprivation-therapy-response-and-metastatic-progression-in-prostate-cancer). - -Click the "Download Now" button on the right side of this screen. - - - -Fill out the pop up window with your email and our Terms and Conditions: - - - -We are going to use non-quantile normalized data for this analysis. -To get this data, you will need to check the box that says "Skip quantile normalization for RNA-seq samples". -Note that this option will only be available for RNA-seq datasets. - - - -It may take a few minutes for the dataset to process. -You will get an email when it is ready. - -## About the dataset we are using for this example - -For this example analysis, we will use this [prostate cancer dataset](https://www.refine.bio/experiments/SRP133573). -The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. -Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens. - -## Place the dataset in your new `data/` folder - -refine.bio will send you a download button in the email when it is ready. -Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in `.zip`. -Double clicking should unzip this for you and create a folder of the same name. - - - -For more details on the contents of this folder see [these docs on refine.bio](http://docs.refine.bio/en/latest/main_text.html#downloadable-files). - -The `` folder has the data and metadata TSV files you will need for this example analysis. -Experiment accession ids usually look something like `GSE1235` or `SRP12345`. - -Copy and paste the `SRP133573` folder into your newly created `data/` folder. - -## Check out our file structure! - -Your new analysis folder should contain: - -- The example analysis `.Rmd` you downloaded -- A folder called "data" which contains: - - The `SRP133573` folder which contains: - - The gene expression - - The metadata TSV -- A folder for `plots` (currently empty) -- A folder for `results` (currently empty) - -Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using): - - - -In order for our example here to run without a hitch, we need these files to be in these locations so we've constructed a test to check before we get started with the analysis. -These chunks will declare your file paths and double check that your files are in the right place. - -First we will declare our file paths to our data and metadata files, which should be in our data directory. -This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started. - -```{r} -# Define the file path to the data directory -data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in - -# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir` -data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset - -# Declare the file path to the metadata file using the data directory saved as `data_dir` -metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata -``` - -Now that our file paths are declared, we can use the `file.exists()` function to check that the files are where we specified above. - -```{r} -# Check if the gene expression matrix file is at the file path stored in `data_file` -file.exists(data_file) - -# Check if the metadata file is at the file path stored in `metadata_file` -file.exists(metadata_file) -``` - -If the chunk above printed out `FALSE` to either of those tests, you won't be able to run this analysis _as is_ until those files are in the appropriate place. - -If the concept of a "file path" is unfamiliar to you; we recommend taking a look at our [section about file paths](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#an-important-note-about-file-paths-and-Rmds). - -# Using a different refine.bio dataset with this analysis? - -If you'd like to adapt an example analysis to use a different dataset from [refine.bio](https://www.refine.bio/), we recommend placing the files in the `data/` directory you created and changing the filenames and paths in the notebook to match these files (we've put comments to signify where you would need to change the code). -We suggest saving plots and results to `plots/` and `results/` directories, respectively, as these are automatically created by the notebook. -From here you can customize this analysis example to fit your own scientific questions and preferences. - -Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result [according to its authors](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html). So you will need to make sure the dataset you switch this to is sufficiently large. - -*** - -   - -# Over-Representation Analysis with `clusterProfiler` - RNA-seq - -## Install libraries - -See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. - -In this analysis, we will be using [`WGCNA`](https://cran.r-project.org/web/packages/WGCNA/index.html) package. -WGCNA also requires a package called `impute` that is sometimes has trouble installing so we recommend installing that first. - -```{r} -if (!("DESeq2" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("DESeq2", update = FALSE) -} - -if (!("impute" %in% installed.packages())) { - # Install this package if it isn't installed yet - BiocManager::install("impute") -} - -if (!("WGCNA" %in% installed.packages())) { - # Install this package if it isn't installed yet - install.packages("WGCNA") -} -``` - -Attach the packages we need for this analysis. - -```{r} -# Attach the DESeq2 library -library(DESeq2) - -# Homo sapiens annotation package we'll use for gene identifier conversion -library(org.Hs.eg.db) - -# We will need this so we can use the pipe: %>% -library(magrittr) - -# We'll need this for finding gene modules -library(WGCNA) -``` - -## Import and set up data - -Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. -This chunk of code will read the both TSV files and add them as data frames to your environment. - -We stored our file paths as objects named `metadata_file` and `data_file` in [this previous step](#check-out-our-file-structure). - -```{r} -# Read in metadata TSV file -metadata <- readr::read_tsv(metadata_file) - -# Read in data TSV file -df <- readr::read_tsv(data_file) %>% - # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later - tibble::column_to_rownames("Gene") -``` - -Let's ensure that the metadata and data are in the same sample order. - -```{r} -# Make the data in the order of the metadata -df <- df %>% - dplyr::select(metadata$refinebio_accession_code) - -# Check if this is in the same order -all.equal(colnames(df), metadata$refinebio_accession_code) -``` - -### Prepare data for `DESeq2` - -We need to make sure all of the values in our data are converted to integers as required by a `DESeq2` function we will use later. - -```{r} -# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers -df <- df %>% - # Bring out rownames and store as their own column - tibble::rownames_to_column("Gene") %>% - # Mutate numeric variables to be integers - dplyr::mutate_if(is.numeric, round) -``` - -## Create a DESeqDataset - -We will be using the `DESeq2` package for [normalizing and transforming our data](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods), which requires us to format our data into a `DESeqDataSet` object. -We turn the data frame (or matrix) into a [`DESeqDataSet` object](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#02_About_DESeq2). ) and specify which variable labels our experimental groups using the [`design` argument](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) [@Love2014]. -In this chunk of code, we will not provide a specific model to the `design` argument because we are not performing a differential expression analysis. - -```{r} -# Create a `DESeqDataSet` object -dds <- DESeqDataSetFromMatrix( - countData = df, # This is the data frame with the counts values for all replicates in our dataset - colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame - design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis - tidy = TRUE -) -``` - -## Define a minimum counts cutoff - -We want to filter out the genes that have not been expressed or that have low expression counts. -This is recommended by [WGCNA docs for RNA-seq data](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). -Removing low count genes can help improve your WGCNA results. -We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples. - -```{r} -# Define a minimum counts cutoff and filter `DESeqDataSet` object to include -# only rows that have counts above the cutoff -genes_to_keep <- rowSums(counts(dds)) >= 50 -dds <- dds[genes_to_keep, ] -``` - -## Perform DESeq2 normalization and transformation - -We often suggest normalizing and transforming your data for various applications and in this instance WGCNA's authors [suggest using variance stablizing transformation before running WGCNA](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html#:~:text=Can%20WGCNA%20be%20used%20to,Yes.&text=Whether%20one%20uses%20RPKM%2C%20FPKM,were%20processed%20the%20same%20way.). -We are going to use the `vst()` function from the `DESeq2` package to normalize and transform the data. -For more information about these transformation methods, [see here](https://alexslemonade.github.io/refinebio-examples/03-rnaseq/00-intro-to-rnaseq.html#deseq2-transformation-methods). - -```{r} -# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package -dds_norm <- vst(dds) -``` - -**REVIEW** - -## Format normalized data for WGCNA - -Here we will use WGCNA to identify co-expressed gene modules. -We'll continue with the top gene module with ORA. - -Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA. - -```{r} -# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function -normalized_counts <- assay(dds_norm) %>% - t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function -``` - -## Determine parameters for WGCNA - -WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. -By providing a soft-threshold power parameter, we can guide it on {{I still can't figure out what power actually represents here}} -The "soft" part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility. - -```{r} -sft <- pickSoftThreshold(normalized_counts, - dataIsExpr = TRUE, - corFnc = cor, - networkType = "signed" -) -``` - -This `sft` object has a lot of information, we will want to plot some of it to figure out what our `power` soft-threshold should be. -We have to first calculate the model fit, the R squared and make that a new variable. - -```{r} -sft_df <- data.frame(sft$fitIndices) %>% - dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq) -``` - -Now, let's plot the model fitting by the `power` soft threshold so we can decide on a soft-threshold for power. - -```{r} -sft_df %>% - ggplot2::ggplot() + - ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) + - # We will plot what WGCNA recommends as an R^2 cutoff - ggplot2::geom_hline(yintercept = 0.80, col = "red") + - # Just in case our values are low, we want to make sure we can still see the 0.80 level - ggplot2::ylim(c(min(sft_df$model_fit), 1)) + - # We can add more sensible labels for our axis - ggplot2::xlab("Soft Threshold (power)") + - ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") + - ggplot2::ggtitle("Scale independence") + - # This adds some nicer aesthetics to our plot - ggplot2::theme_classic() -``` - -Using this plot we can decide on a power parameter. -WGCNA's authors recommend using a `power` that has an signed R^2 above `0.80`, otherwise they warn your results may be too noisy to be meaningful. - -If you have multiple power values with signed R^2 above `0.80`, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation [{{SOURCE}}](https://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf). -You want to a `power` that gives you a big enough R^2 but is not excessively large. - -So using the plot above, going with a power soft-threshold of `14`! - -If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. -You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). - - -## Run WGCNA! - -We will use the `blockwiseModules()` function to build a gene coexpression network. -The `blockwise` part refers to that these calculations will be done on chunks of your data at a time to help with computing resources. - -This will take some time to run. - -Note that for the `power` argument we are using `14` which we determined above. - -```{r} -bwnet <- blockwiseModules(normalized_counts, - TOMType = "signed", # topological overlap matrix - power = 14, # soft threshold for network construction - randomSeed = 1234 # there's some randomness associated with this calculation - # so we should set a seed -) -``` - -There are a lot of other settings you can tweak -- look at `?blockwiseModules` help page to see more. - -The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. -You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. -However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out. - -**DRAFT** - -## Explore our WGCNA results - -This object, `bwnet` has a lot of information. -We can pull out the parts we are most interested in and may want to use use for plotting. - -In `bwnet` we have a data frame of eigengene module data for each sample in the `MEs` slot. -These represent the collapsed and combined expression of the genes a part of each module. - -```{r} -module_eigengenes <- bwnet$MEs - -# Print out a preview -head(module_eigengenes) -``` -If you want to know which of your genes make up which modules, you can look at the `$colors` slot. -This is a named list which associates the genes with the module they are a part of. - -```{r} -gene_modules <- bwnet$colors - -# Print out a preview -head(gene_modules) -``` - -We can turn this into a data frame for handy use. - -```{r} -gene_modules_df <- data.frame(gene_modules) %>% - tibble::rownames_to_column("Gene") -``` - -Now we can print out how many genes there are per module. - -```{r} -gene_modules_df %>% dplyr::count(gene_modules) -``` - -Note that the color assignments don't have any particular biological meaning and just function as labels. - -## Take a look at eigengenes with your metadata - -We can also see if our eigengenes relate to our metadata labels. - -First we would have to make an annotation data frame that we can give to heatmap. - -```{r} -# Let's prepare the annotation data.frame for taking a look at our eigengenes -annotation_df <- metadata %>% - # We want to select the variables that we want for annotating the technical replicates heatmap - dplyr::select( - refinebio_accession_code, - refinebio_treatment - ) %>% - # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object - tibble::column_to_rownames("refinebio_accession_code") -``` - -Now we can create a heatmap. - -```{r} -pheatmap::pheatmap(module_eigengenes, - show_rownames = FALSE, - annotation_row = annotation_df -) -``` - -## Visualize how correlated these eigengenes are - -```{r} -correlate_eigegenes <- cor(bwnet$MEs) -``` - -```{r} -pheatmap::pheatmap(correlate_eigegenes, - show_rownames = FALSE, - show_colnames = FALSE -) -``` - -## Write results to file - -```{r} -readr::write_rds(bwnet, - file = file.path("results", "SRP133573_wgcna_results.RDS") -) -``` - -# Resources for further learning - -- [WGCNA FAQ page](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/faq.html) -- [WGCNA tutorial](https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/) -- [WGCNA paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559) - -# Session info - -At the end of every analysis, before saving your notebook, we recommend printing out your session info. -This helps make your code more reproducible by recording what versions of software and packages you used to run this. - -```{r} -# Print session info -sessionInfo() -``` - -# References diff --git a/03-rnaseq/network-analysis_rnaseq_01_wgcna.html b/03-rnaseq/network-analysis_rnaseq_01_wgcna.html deleted file mode 100644 index f17c7f9b..00000000 --- a/03-rnaseq/network-analysis_rnaseq_01_wgcna.html +++ /dev/null @@ -1,4387 +0,0 @@ - - - - - - - - - - - - - - -Over-Representation Analysis - RNA-seq - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
-
-
- -
- - - - - - - - - -
-

1 Purpose of this analysis

-

In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules. WGCNA using a series of correlations to determine what genes are expressed together in your dataset.

-

⬇️ Jump to the analysis code ⬇️

-
-
-

2 How to run this example

-

For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.

-
-

2.1 Obtain the .Rmd file

-

To run this example yourself, download the .Rmd for this analysis by clicking this link.

-

Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd file to where you would like this example and its files to be stored.

-

You can open this .Rmd file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd files.)

-
-
-

2.2 Set up your analysis folders

-

Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!

-

If you have trouble running this chunk, see our introduction to using .Rmds for more resources and explanations.

-
# Create the data folder if it doesn't exist
-if (!dir.exists("data")) {
-  dir.create("data")
-}
-
-# Define the file path to the plots directory
-plots_dir <- "plots" # Can replace with path to desired output plots directory
-
-# Create the plots folder if it doesn't exist
-if (!dir.exists(plots_dir)) {
-  dir.create(plots_dir)
-}
-
-# Define the file path to the results directory
-results_dir <- "results" # Can replace with path to desired output results directory
-
-# Create the results folder if it doesn't exist
-if (!dir.exists(results_dir)) {
-  dir.create(results_dir)
-}
-

In the same place you put this .Rmd file, you should now have three new empty folders called data, plots, and results!

-
-
-

2.3 Obtain the dataset from refine.bio

-

For general information about downloading data for these examples, see our ‘Getting Started’ section.

-

Go to this dataset’s page on refine.bio.

-

Click the “Download Now” button on the right side of this screen.

-

-

Fill out the pop up window with your email and our Terms and Conditions:

-

-

We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.

-

-

It may take a few minutes for the dataset to process. You will get an email when it is ready.

-
-
-

2.4 About the dataset we are using for this example

-

For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.

-
-
-

2.5 Place the dataset in your new data/ folder

-

refine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip. Double clicking should unzip this for you and create a folder of the same name.

-

-

For more details on the contents of this folder see these docs on refine.bio.

-

The <experiment_accession_id> folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235 or SRP12345.

-

Copy and paste the SRP133573 folder into your newly created data/ folder.

-
-
-

2.6 Check out our file structure!

-

Your new analysis folder should contain:

-
    -
  • The example analysis .Rmd you downloaded
    -
  • -
  • A folder called “data” which contains: -
      -
    • The SRP133573 folder which contains: -
        -
      • The gene expression
        -
      • -
      • The metadata TSV
        -
      • -
    • -
  • -
  • A folder for plots (currently empty)
  • -
  • A folder for results (currently empty)
  • -
-

Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):

-

-

In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.

-

First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.

-
# Define the file path to the data directory
-data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
-
-# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
-data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
-
-# Declare the file path to the metadata file using the data directory saved as `data_dir`
-metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
-

Now that our file paths are declared, we can use the file.exists() function to check that the files are where we specified above.

-
# Check if the gene expression matrix file is at the file path stored in `data_file`
-file.exists(data_file)
-
## [1] TRUE
-
# Check if the metadata file is at the file path stored in `metadata_file`
-file.exists(metadata_file)
-
## [1] TRUE
-

If the chunk above printed out FALSE to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.

-

If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.

-
-
-
-

3 Using a different refine.bio dataset with this analysis?

-

If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/ directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/ and results/ directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.

-

Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you switch this to is sufficiently large.

-
- -

 

-
-
-

4 Over-Representation Analysis with clusterProfiler - RNA-seq

-
-

4.1 Install libraries

-

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using WGCNA package. WGCNA also requires a package called impute that is sometimes has trouble installing so we recommend installing that first.

-
if (!("DESeq2" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("DESeq2", update = FALSE)
-}
-
-if (!("impute" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  BiocManager::install("impute")
-}
-
-if (!("WGCNA" %in% installed.packages())) {
-  # Install this package if it isn't installed yet
-  install.packages("WGCNA")
-}
-

Attach the packages we need for this analysis.

-
# Attach the DESeq2 library
-library(DESeq2)
-
## Loading required package: S4Vectors
-
## Loading required package: stats4
-
## Loading required package: BiocGenerics
-
## Loading required package: parallel
-
## 
-## Attaching package: 'BiocGenerics'
-
## The following objects are masked from 'package:parallel':
-## 
-##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
-##     clusterExport, clusterMap, parApply, parCapply, parLapply,
-##     parLapplyLB, parRapply, parSapply, parSapplyLB
-
## The following objects are masked from 'package:stats':
-## 
-##     IQR, mad, sd, var, xtabs
-
## The following objects are masked from 'package:base':
-## 
-##     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
-##     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
-##     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
-##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
-##     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
-##     union, unique, unsplit, which.max, which.min
-
## 
-## Attaching package: 'S4Vectors'
-
## The following object is masked from 'package:base':
-## 
-##     expand.grid
-
## Loading required package: IRanges
-
## Loading required package: GenomicRanges
-
## Loading required package: GenomeInfoDb
-
## Loading required package: SummarizedExperiment
-
## Loading required package: MatrixGenerics
-
## Loading required package: matrixStats
-
## 
-## Attaching package: 'MatrixGenerics'
-
## The following objects are masked from 'package:matrixStats':
-## 
-##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
-##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
-##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
-##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
-##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
-##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
-##     colWeightedMeans, colWeightedMedians, colWeightedSds,
-##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
-##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
-##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
-##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
-##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
-##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
-##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
-##     rowWeightedSds, rowWeightedVars
-
## Loading required package: Biobase
-
## Welcome to Bioconductor
-## 
-##     Vignettes contain introductory material; view with
-##     'browseVignettes()'. To cite Bioconductor, see
-##     'citation("Biobase")', and for packages 'citation("pkgname")'.
-
## 
-## Attaching package: 'Biobase'
-
## The following object is masked from 'package:MatrixGenerics':
-## 
-##     rowMedians
-
## The following objects are masked from 'package:matrixStats':
-## 
-##     anyMissing, rowMedians
-
# Homo sapiens annotation package we'll use for gene identifier conversion
-library(org.Hs.eg.db)
-
## Loading required package: AnnotationDbi
-
## 
-
# We will need this so we can use the pipe: %>%
-library(magrittr)
-
-# We'll need this for finding gene modules
-library(WGCNA)
-
## Loading required package: dynamicTreeCut
-
## Loading required package: fastcluster
-
## 
-## Attaching package: 'fastcluster'
-
## The following object is masked from 'package:stats':
-## 
-##     hclust
-
## 
-
## 
-## Attaching package: 'WGCNA'
-
## The following object is masked from 'package:IRanges':
-## 
-##     cor
-
## The following object is masked from 'package:S4Vectors':
-## 
-##     cor
-
## The following object is masked from 'package:stats':
-## 
-##     cor
-
-
-

4.2 Import and set up data

-

Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.

-

We stored our file paths as objects named metadata_file and data_file in this previous step.

-
# Read in metadata TSV file
-metadata <- readr::read_tsv(metadata_file)
-
## 
-## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
-## cols(
-##   .default = col_character(),
-##   refinebio_age = col_logical(),
-##   refinebio_cell_line = col_logical(),
-##   refinebio_compound = col_logical(),
-##   refinebio_disease_stage = col_logical(),
-##   refinebio_genetic_information = col_logical(),
-##   refinebio_processed = col_logical(),
-##   refinebio_sex = col_logical(),
-##   refinebio_source_archive_url = col_logical(),
-##   refinebio_specimen_part = col_logical(),
-##   refinebio_time = col_logical()
-## )
-## ℹ Use `spec()` for the full column specifications.
-
# Read in data TSV file
-df <- readr::read_tsv(data_file) %>%
-  # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
-  tibble::column_to_rownames("Gene")
-
## 
-## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
-## cols(
-##   .default = col_double(),
-##   Gene = col_character()
-## )
-## ℹ Use `spec()` for the full column specifications.
-

Let’s ensure that the metadata and data are in the same sample order.

-
# Make the data in the order of the metadata
-df <- df %>%
-  dplyr::select(metadata$refinebio_accession_code)
-
-# Check if this is in the same order
-all.equal(colnames(df), metadata$refinebio_accession_code)
-
## [1] TRUE
-
-

4.2.1 Prepare data for DESeq2

-

We need to make sure all of the values in our data are converted to integers as required by a DESeq2 function we will use later.

-
# The `DESeqDataSetFromMatrix()` function needs the values to be converted to integers
-df <- df %>%
-  # Bring out rownames and store as their own column
-  tibble::rownames_to_column("Gene") %>%
-  # Mutate numeric variables to be integers
-  dplyr::mutate_if(is.numeric, round)
-
-
-
-

4.3 Create a DESeqDataset

-

We will be using the DESeq2 package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet object. We turn the data frame (or matrix) into a DESeqDataSet object. ) and specify which variable labels our experimental groups using the design argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design argument because we are not performing a differential expression analysis.

-
# Create a `DESeqDataSet` object
-dds <- DESeqDataSetFromMatrix(
-  countData = df, # This is the data frame with the counts values for all replicates in our dataset
-  colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
-  design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
-  tidy = TRUE
-)
-
## converting counts to integer mode
-
-
-

4.4 Define a minimum counts cutoff

-

We want to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.

-
# Define a minimum counts cutoff and filter `DESeqDataSet` object to include
-# only rows that have counts above the cutoff
-genes_to_keep <- rowSums(counts(dds)) >= 50
-dds <- dds[genes_to_keep, ]
-
-
-

4.5 Perform DESeq2 normalization and transformation

-

We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stablizing transformation before running WGCNA.
-We are going to use the vst() function from the DESeq2 package to normalize and transform the data. For more information about these transformation methods, see here.

-
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
-dds_norm <- vst(dds)
-

REVIEW

-
-
-

4.6 Format normalized data for WGCNA

-

Here we will use WGCNA to identify co-expressed gene modules. We’ll continue with the top gene module with ORA.

-

Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.

-
# First we are going to retrieve the normalized data from the `DESeqDataSet` object using the `assay()` function
-normalized_counts <- assay(dds_norm) %>%
-  t() # We need to transpose this data in preparation for the `WGCNA::blockwiseModules()` function
-
-
-

4.7 Determine parameters for WGCNA

-

WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. By providing a soft-threshold power parameter, we can guide it on {{I still can’t figure out what power actually represents here}} The “soft” part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility.

-
sft <- pickSoftThreshold(normalized_counts,
-  dataIsExpr = TRUE,
-  corFnc = cor,
-  networkType = "signed"
-)
-
## Warning: executing %dopar% sequentially: no parallel backend registered
-
##    Power SFT.R.sq  slope truncated.R.sq mean.k. median.k. max.k.
-## 1      1  0.58200 12.200          0.957 13500.0   13600.0  15500
-## 2      2  0.44500  5.130          0.972  7630.0    7650.0   9910
-## 3      3  0.26300  2.570          0.985  4480.0    4450.0   6680
-## 4      4  0.06480  0.914          0.985  2730.0    2680.0   4720
-## 5      5  0.00662 -0.236          0.964  1720.0    1660.0   3450
-## 6      6  0.15900 -1.010          0.965  1120.0    1060.0   2580
-## 7      7  0.36500 -1.470          0.971   746.0     689.0   1980
-## 8      8  0.50000 -1.730          0.972   509.0     459.0   1550
-## 9      9  0.59700 -1.910          0.972   356.0     313.0   1220
-## 10    10  0.67000 -2.060          0.973   253.0     217.0    982
-## 11    12  0.74000 -2.260          0.970   135.0     110.0    651
-## 12    14  0.79400 -2.320          0.978    76.9      58.6    447
-## 13    16  0.82000 -2.350          0.981    45.9      32.7    315
-## 14    18  0.83800 -2.360          0.985    28.6      18.9    227
-## 15    20  0.84500 -2.350          0.987    18.5      11.2    167
-

This sft object has a lot of information, we will want to plot some of it to figure out what our power soft-threshold should be. We have to first calculate the model fit, the R squared and make that a new variable.

-
sft_df <- data.frame(sft$fitIndices) %>%
-  dplyr::mutate(model_fit = -sign(slope) * SFT.R.sq)
-

Now, let’s plot the model fitting by the power soft threshold so we can decide on a soft-threshold for power.

-
sft_df %>%
-  ggplot2::ggplot() +
-  ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) +
-  # We will plot what WGCNA recommends as an R^2 cutoff
-  ggplot2::geom_hline(yintercept = 0.80, col = "red") +
-  # Just in case our values are low, we want to make sure we can still see the 0.80 level
-  ggplot2::ylim(c(min(sft_df$model_fit), 1)) +
-  # We can add more sensible labels for our axis
-  ggplot2::xlab("Soft Threshold (power)") +
-  ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") +
-  ggplot2::ggtitle("Scale independence") +
-  # This adds some nicer aesthetics to our plot
-  ggplot2::theme_classic()
-

-

Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power that has an signed R^2 above 0.80, otherwise they warn your results may be too noisy to be meaningful.

-

If you have multiple power values with signed R^2 above 0.80, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation {{SOURCE}}. You want to a power that gives you a big enough R^2 but is not excessively large.

-

So using the plot above, going with a power soft-threshold of 14!

-

If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).

-
-
-

4.8 Run WGCNA!

-

We will use the blockwiseModules() function to build a gene coexpression network. The blockwise part refers to that these calculations will be done on chunks of your data at a time to help with computing resources.

-

This will take some time to run.

-

Note that for the power argument we are using 14 which we determined above.

-
bwnet <- blockwiseModules(normalized_counts,
-  TOMType = "signed", # topological overlap matrix
-  power = 14, # soft threshold for network construction
-  randomSeed = 1234 # there's some randomness associated with this calculation
-  # so we should set a seed
-)
-

There are a lot of other settings you can tweak – look at ?blockwiseModules help page to see more.

-

The TOMtype argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned network, where positive/negative is ignored see this article to help you figure that out.

-

DRAFT

-
-
-

4.9 Explore our WGCNA results

-

This object, bwnet has a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.

-

In bwnet we have a data frame of eigengene module data for each sample in the MEs slot. These represent the collapsed and combined expression of the genes a part of each module.

-
module_eigengenes <- bwnet$MEs
-
-# Print out a preview
-head(module_eigengenes)
-
- -
-

If you want to know which of your genes make up which modules, you can look at the $colors slot. This is a named list which associates the genes with the module they are a part of.

-
gene_modules <- bwnet$colors
-
-# Print out a preview
-head(gene_modules)
-
## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
-##          "grey"           "tan"          "grey"          "grey"          "grey" 
-## ENSG00000000938 
-##          "grey"
-

We can turn this into a data frame for handy use.

-
gene_modules_df <- data.frame(gene_modules) %>%
-  tibble::rownames_to_column("Gene")
-

Now we can print out how many genes there are per module.

-
gene_modules_df %>% dplyr::count(gene_modules)
-
- -
-

Note that the color assignments don’t have any particular biological meaning and just function as labels.

-
-
-

4.10 Take a look at eigengenes with your metadata

-

We can also see if our eigengenes relate to our metadata labels.

-

First we would have to make an annotation data frame that we can give to heatmap.

-
# Let's prepare the annotation data.frame for taking a look at our eigengenes
-annotation_df <- metadata %>%
-  # We want to select the variables that we want for annotating the technical replicates heatmap
-  dplyr::select(
-    refinebio_accession_code,
-    refinebio_treatment
-  ) %>%
-  # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object
-  tibble::column_to_rownames("refinebio_accession_code")
-

Now we can create a heatmap.

-
pheatmap::pheatmap(module_eigengenes,
-  show_rownames = FALSE,
-  annotation_row = annotation_df
-)
-

-
-
-

4.11 Visualize how correlated these eigengenes are

-
correlate_eigegenes <- cor(bwnet$MEs)
-
pheatmap::pheatmap(correlate_eigegenes,
-  show_rownames = FALSE,
-  show_colnames = FALSE
-)
-

-
-
-

4.12 Write results to file

-
readr::write_rds(bwnet,
-  file = file.path("results", "SRP133573_wgcna_results.RDS")
-)
-
-
-
-

5 Resources for further learning

- -
-
-

6 Session info

-

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessionInfo()
-
## R version 4.0.2 (2020-06-22)
-## Platform: x86_64-pc-linux-gnu (64-bit)
-## Running under: Ubuntu 20.04 LTS
-## 
-## Matrix products: default
-## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
-## 
-## locale:
-##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
-##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
-##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
-##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
-##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
-## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
-## 
-## attached base packages:
-## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
-## [8] methods   base     
-## 
-## other attached packages:
-##  [1] WGCNA_1.69                  fastcluster_1.1.25         
-##  [3] dynamicTreeCut_1.63-1       magrittr_1.5               
-##  [5] org.Hs.eg.db_3.12.0         AnnotationDbi_1.52.0       
-##  [7] DESeq2_1.30.0               SummarizedExperiment_1.20.0
-##  [9] Biobase_2.50.0              MatrixGenerics_1.2.0       
-## [11] matrixStats_0.57.0          GenomicRanges_1.42.0       
-## [13] GenomeInfoDb_1.26.0         IRanges_2.24.0             
-## [15] S4Vectors_0.28.0            BiocGenerics_0.36.0        
-## [17] optparse_1.6.6             
-## 
-## loaded via a namespace (and not attached):
-##  [1] colorspace_1.4-1       ellipsis_0.3.1         htmlTable_2.1.0       
-##  [4] XVector_0.30.0         base64enc_0.1-3        rstudioapi_0.11       
-##  [7] farver_2.0.3           getopt_1.20.3          bit64_4.0.5           
-## [10] fansi_0.4.1            codetools_0.2-16       splines_4.0.2         
-## [13] R.methodsS3_1.8.1      doParallel_1.0.15      impute_1.64.0         
-## [16] geneplotter_1.68.0     knitr_1.30             jsonlite_1.7.1        
-## [19] Formula_1.2-3          annotate_1.68.0        cluster_2.1.0         
-## [22] GO.db_3.12.1           png_0.1-7              R.oo_1.24.0           
-## [25] pheatmap_1.0.12        readr_1.4.0            compiler_4.0.2        
-## [28] httr_1.4.2             backports_1.1.10       assertthat_0.2.1      
-## [31] Matrix_1.2-18          cli_2.1.0              htmltools_0.5.0       
-## [34] tools_4.0.2            gtable_0.3.0           glue_1.4.2            
-## [37] GenomeInfoDbData_1.2.4 dplyr_1.0.2            Rcpp_1.0.5            
-## [40] styler_1.3.2           vctrs_0.3.4            preprocessCore_1.52.0 
-## [43] iterators_1.0.12       xfun_0.18              stringr_1.4.0         
-## [46] ps_1.4.0               lifecycle_0.2.0        XML_3.99-0.5          
-## [49] zlibbioc_1.36.0        scales_1.1.1           hms_0.5.3             
-## [52] rematch2_2.1.2         RColorBrewer_1.1-2     yaml_2.2.1            
-## [55] memoise_1.1.0          gridExtra_2.3          ggplot2_3.3.2         
-## [58] rpart_4.1-15           latticeExtra_0.6-29    stringi_1.5.3         
-## [61] RSQLite_2.2.1          genefilter_1.72.0      foreach_1.5.0         
-## [64] checkmate_2.0.0        BiocParallel_1.24.1    rlang_0.4.8           
-## [67] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14         
-## [70] lattice_0.20-41        purrr_0.3.4            htmlwidgets_1.5.2     
-## [73] labeling_0.3           bit_4.0.4              tidyselect_1.1.0      
-## [76] R6_2.4.1               generics_0.0.2         Hmisc_4.4-1           
-## [79] DelayedArray_0.16.0    DBI_1.1.0              pillar_1.4.6          
-## [82] foreign_0.8-80         survival_3.1-12        RCurl_1.98-1.2        
-## [85] nnet_7.3-14            tibble_3.0.4           crayon_1.3.4          
-## [88] rmarkdown_2.4          jpeg_0.1-8.1           locfit_1.5-9.4        
-## [91] grid_4.0.2             data.table_1.13.0      blob_1.2.1            
-## [94] digest_0.6.25          xtable_1.8-4           R.cache_0.14.0        
-## [97] R.utils_2.10.1         munsell_0.5.0
-
-
-

References

-
-
-

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

-
-
-
- - - - -
-
- -
- - - - - - - - - - - - - - - - From ab3df5b620a7627f082df2485521cf2da10a0a7c Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Thu, 12 Nov 2020 11:20:52 -0500 Subject: [PATCH 10/11] A set up with ANOVA --- .../network-analysis_rnaseq_01_wgcna.Rmd | 153 ++++++++++++------ 1 file changed, 103 insertions(+), 50 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index d4328f44..6f710c6d 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -12,7 +12,7 @@ output: # Purpose of this analysis In this example, we use [weighted gene co-expression network analysis (WGCNA)]() to identify co-expressed gene modules. -WGCNA using a series of correlations to determine what genes are expressed together in your dataset. +WGCNA using a series of correlations to determine what genes are expressed together in your data set. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ @@ -290,12 +290,9 @@ For more information about these transformation methods, [see here](https://alex dds_norm <- vst(dds) ``` -**REVIEW** - ## Format normalized data for WGCNA Here we will use WGCNA to identify co-expressed gene modules. -We'll continue with the top gene module with ORA. Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA. @@ -351,12 +348,11 @@ WGCNA's authors recommend using a `power` that has an signed R^2 above `0.80`, o If you have multiple power values with signed R^2 above `0.80`, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation [{{SOURCE}}](https://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf). You want to a `power` that gives you a big enough R^2 but is not excessively large. -So using the plot above, going with a power soft-threshold of `14`! +So using the plot above, going with a power soft-threshold of `16`! If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. You can try returning to [gene filtering step](#define-a-minimum-counts-cutoff) and choosing a more stringent cutoff (you'll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped). - ## Run WGCNA! We will use the `blockwiseModules()` function to build a gene coexpression network. @@ -364,13 +360,14 @@ The `blockwise` part refers to that these calculations will be done on chunks of This will take some time to run. -Note that for the `power` argument we are using `14` which we determined above. +Note that for the `power` argument we are using `16` which we determined above. ```{r} bwnet <- blockwiseModules(normalized_counts, TOMType = "signed", # topological overlap matrix - power = 14, # soft threshold for network construction - randomSeed = 1234 # there's some randomness associated with this calculation + power = 16, # soft threshold for network construction + numericLabels = TRUE, # Let's use numbers instead of colors for module labels + randomSeed = 1234, # there's some randomness associated with this calculation # so we should set a seed ) ``` @@ -381,7 +378,13 @@ The `TOMtype` argument specifies what kind of topological overlap matrix (TOM) s You can safely assume for most situations a `signed` network represents what you want -- we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an `unsigned` network, where positive/negative is ignored see [this article](https://peterlangfelder.com/2018/11/25/signed-or-unsigned-which-network-type-is-preferable/) to help you figure that out. -**DRAFT** +## Write main WGCNA results object to file + +```{r} +readr::write_rds(bwnet, + file = file.path("results", "SRP133573_wgcna_results.RDS") +) +``` ## Explore our WGCNA results @@ -397,77 +400,127 @@ module_eigengenes <- bwnet$MEs # Print out a preview head(module_eigengenes) ``` -If you want to know which of your genes make up which modules, you can look at the `$colors` slot. -This is a named list which associates the genes with the module they are a part of. -```{r} -gene_modules <- bwnet$colors +## Which modules have biggest differences across treatment groups -# Print out a preview -head(gene_modules) +We can also see if our eigengenes relate to our metadata labels. +First we would have to make an annotation data frame that we can give to heatmap. + +```{r} +# Let's prepare the annotation data.frame for taking a look at our eigengenes +labeled_eigengenes_df <- module_eigengenes %>% + tibble::rownames_to_column("sample") %>% + dplyr::inner_join(metadata %>% + dplyr::select(refinebio_accession_code, refinebio_treatment), + by = c("sample" = "refinebio_accession_code")) ``` -We can turn this into a data frame for handy use. +Set up a custom function for running ANOVA and extracting summary stats. ```{r} -gene_modules_df <- data.frame(gene_modules) %>% - tibble::rownames_to_column("Gene") +run_lineaar_model <- function(anova_df = labeled_eigengenes_df, + eigengene_col = "ME52", + group_variable_col = "refinebio_treatment") { + # Given a data frame with the variables you want to teest + # + # Args: + # anova_df: a data fram that contains the variables you want to use ANOVA to test + # eigengene_col: a string that is the column name for the eigengene you want to test + # group_variable_col: a string that is the column name for the grouping variable you would like # to look for differences across + # + # Returns: + # The summary statistics from running ANOVA + + # Set up a formula using the eigengene column name and grouping variable name + aov_formula <- formula(paste(eigengene_col, "~", group_variable_col)) + + # Run ANOVA based on that formula + aov_results <- lmFit(aov_formula, data = anova_df) + + # Extract the summary stats + summary_stats <- data.frame(summary(aov_results)$coefficients) %>% + tibble::rownames_to_column("group") %>% + dplyr::filter(variable != ) + + # Make a new column that has the eigengene module name + dplyr::mutate(module = eigengene_col) + + # Return the summary stats + return(summary_stats) +} ``` -Now we can print out how many genes there are per module. +Obtain a vector of the module names we can supply to the `run_anova()` function in the next step. ```{r} -gene_modules_df %>% dplyr::count(gene_modules) +# We need the module column names but not the other column names +module_names <- grep("ME", colnames(labeled_eigengenes_df), value = TRUE) ``` -Note that the color assignments don't have any particular biological meaning and just function as labels. - -## Take a look at eigengenes with your metadata +Run ANOVA on each eigengene module and bind results together. -We can also see if our eigengenes relate to our metadata labels. +```{r} +# Run ANOVA on each module +aov_results_df <- purrr::map(module_names, + ~ run_anova(anova_df = labeled_eigengenes_df, + eigengene_col = .x, + group_variable_col = "refinebio_treatment")) %>% + # Bind all the modules results into one data frame + dplyr::bind_rows() %>% + # The p value column has an annoying name, let's change it, let's also reearrange so module name comes first + dplyr::select(module, dplyr::everything(), p_val = Pr..F.) %>% + # Make a new column of adjusted p values based on multiple testing corrections + dplyr::mutate(p_adjust = p.adjust(p_val, method = "hochberg")) +``` -First we would have to make an annotation data frame that we can give to heatmap. +Let's arrange the results by the smallest adjusted p values. ```{r} -# Let's prepare the annotation data.frame for taking a look at our eigengenes -annotation_df <- metadata %>% - # We want to select the variables that we want for annotating the technical replicates heatmap - dplyr::select( - refinebio_accession_code, - refinebio_treatment - ) %>% - # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object - tibble::column_to_rownames("refinebio_accession_code") +aov_results_df %>% + dplyr::arrange(p_adjust) ``` -Now we can create a heatmap. +Module 52 seems to be the most differentially expressed across `refinebio_treatment` groups. +Now we can do some investigation into this module. + +## Let's make plot of module 52? + +Let's use `ggplot` to see what module 52's eigengene looks like between treatment groups. ```{r} -pheatmap::pheatmap(module_eigengenes, - show_rownames = FALSE, - annotation_row = annotation_df -) +labeled_eigengenes_df %>% + ggplot2::ggplot(ggplot2::aes(x = refinebio_treatment, + y = ME52, + color = refinebio_treatment)) + + ggforce::geom_sina() + + ggplot2::theme_classic() ``` -## Visualize how correlated these eigengenes are +This makes sense! + +## What genes are a part of module 52? + +If you want to know which of your genes make up a modules, you can look at the `$colors` slot. +This is a named list which associates the genes with the module they are a part of. +We can turn this into a data frame for handy use. ```{r} -correlate_eigegenes <- cor(bwnet$MEs) +gene_modules_df <- tibble::enframe(bwnet$colors, name = "gene", value = "module") %>% + # Let's add the `ME` part so its more clear what these numbers are and it matches elsewhere + dplyr::mutate(module = paste0("ME", module)) ``` +Now we can find what genes are a part of module 52. + ```{r} -pheatmap::pheatmap(correlate_eigegenes, - show_rownames = FALSE, - show_colnames = FALSE -) +module_52_genes <- gene_modules_df %>% + dplyr::filter(module == "ME52") ``` -## Write results to file +Let's save this gene to module key to a TSV file for future use. ```{r} -readr::write_rds(bwnet, - file = file.path("results", "SRP133573_wgcna_results.RDS") -) +readr::write_tsv(gene_modules_df, file = file.path(results, "SRP133573_wgcna_gene_to_module.tsv")) ``` # Resources for further learning From d16639aa40d7eab61ef4515d1be32a6ed0a0d211 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Thu, 12 Nov 2020 11:42:03 -0500 Subject: [PATCH 11/11] Use limma instead. Its more straightforward and gives essentially the same results --- .../network-analysis_rnaseq_01_wgcna.Rmd | 82 ++++++------------- 1 file changed, 25 insertions(+), 57 deletions(-) diff --git a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd index 6f710c6d..30d99c20 100644 --- a/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd +++ b/04-advanced-topics/network-analysis_rnaseq_01_wgcna.Rmd @@ -401,83 +401,50 @@ module_eigengenes <- bwnet$MEs head(module_eigengenes) ``` -## Which modules have biggest differences across treatment groups +## Which modules have biggest differences across treatment groups? We can also see if our eigengenes relate to our metadata labels. -First we would have to make an annotation data frame that we can give to heatmap. +First we double check that our samples are still in order. ```{r} -# Let's prepare the annotation data.frame for taking a look at our eigengenes -labeled_eigengenes_df <- module_eigengenes %>% - tibble::rownames_to_column("sample") %>% - dplyr::inner_join(metadata %>% - dplyr::select(refinebio_accession_code, refinebio_treatment), - by = c("sample" = "refinebio_accession_code")) +all.equal(metadata$refinebio_accession_code, rownames(module_eigengenes)) ``` -Set up a custom function for running ANOVA and extracting summary stats. +```{r} +# Create the design matrix from the refinebio_treatment variable +des_mat <- model.matrix(~ metadata$refinebio_treatment) +``` + +Limma wants our tests to be per row, so we need to transpose so the eigengenes are rows ```{r} -run_lineaar_model <- function(anova_df = labeled_eigengenes_df, - eigengene_col = "ME52", - group_variable_col = "refinebio_treatment") { - # Given a data frame with the variables you want to teest - # - # Args: - # anova_df: a data fram that contains the variables you want to use ANOVA to test - # eigengene_col: a string that is the column name for the eigengene you want to test - # group_variable_col: a string that is the column name for the grouping variable you would like # to look for differences across - # - # Returns: - # The summary statistics from running ANOVA - - # Set up a formula using the eigengene column name and grouping variable name - aov_formula <- formula(paste(eigengene_col, "~", group_variable_col)) - - # Run ANOVA based on that formula - aov_results <- lmFit(aov_formula, data = anova_df) - - # Extract the summary stats - summary_stats <- data.frame(summary(aov_results)$coefficients) %>% - tibble::rownames_to_column("group") %>% - dplyr::filter(variable != ) - - # Make a new column that has the eigengene module name - dplyr::mutate(module = eigengene_col) - - # Return the summary stats - return(summary_stats) -} +# Transpose so the eigengenes are rows and samples are columns +module_eigengenes_t <- t(module_eigengenes) ``` -Obtain a vector of the module names we can supply to the `run_anova()` function in the next step. +Run linear model on each module. ```{r} -# We need the module column names but not the other column names -module_names <- grep("ME", colnames(labeled_eigengenes_df), value = TRUE) +# Apply linear model to data +fit <- limma::lmFit(module_eigengenes_t, design = des_mat) + +# Apply empirical Bayes to smooth standard errors +fit <- limma::eBayes(fit) ``` -Run ANOVA on each eigengene module and bind results together. +Apply multiple testing correction and obtain stats in a data frame. ```{r} -# Run ANOVA on each module -aov_results_df <- purrr::map(module_names, - ~ run_anova(anova_df = labeled_eigengenes_df, - eigengene_col = .x, - group_variable_col = "refinebio_treatment")) %>% - # Bind all the modules results into one data frame - dplyr::bind_rows() %>% - # The p value column has an annoying name, let's change it, let's also reearrange so module name comes first - dplyr::select(module, dplyr::everything(), p_val = Pr..F.) %>% - # Make a new column of adjusted p values based on multiple testing corrections - dplyr::mutate(p_adjust = p.adjust(p_val, method = "hochberg")) +# Apply multiple testing correction and obtain stats +stats_df <- limma::topTable(fit, number = nrow(module_eigengenes_t)) %>% + tibble::rownames_to_column("module") ``` Let's arrange the results by the smallest adjusted p values. ```{r} -aov_results_df %>% - dplyr::arrange(p_adjust) +stats_df %>% + dplyr::arrange(adj.P.Val) ``` Module 52 seems to be the most differentially expressed across `refinebio_treatment` groups. @@ -520,7 +487,8 @@ module_52_genes <- gene_modules_df %>% Let's save this gene to module key to a TSV file for future use. ```{r} -readr::write_tsv(gene_modules_df, file = file.path(results, "SRP133573_wgcna_gene_to_module.tsv")) +readr::write_tsv(gene_modules_df, + file = file.path("results", "SRP133573_wgcna_gene_to_module.tsv")) ``` # Resources for further learning