In this example, we use weighted gene co-expression network analysis (WGCNA) to identify co-expressed gene modules. WGCNA using a series of correlations to determine what genes are expressed together in your dataset.
+ +For general information about our tutorials and the basic software packages you will need, please see our ‘Getting Started’ section. We recommend taking a look at our Resources for Learning R if you have not written code in R before.
+.Rmd
fileTo run this example yourself, download the .Rmd
for this analysis by clicking this link.
Clicking this link will most likely send this to your downloads folder on your computer. Move this .Rmd
file to where you would like this example and its files to be stored.
You can open this .Rmd
file in RStudio and follow the rest of these steps from there. (See our section about getting started with R notebooks if you are unfamiliar with .Rmd
files.)
Good file organization is helpful for keeping your data analysis project on track! We have set up some code that will automatically set up a folder structure for you. Run this next chunk to set up your folders!
+If you have trouble running this chunk, see our introduction to using .Rmd
s for more resources and explanations.
# Create the data folder if it doesn't exist
+if (!dir.exists("data")) {
+ dir.create("data")
+}
+
+# Define the file path to the plots directory
+plots_dir <- "plots" # Can replace with path to desired output plots directory
+
+# Create the plots folder if it doesn't exist
+if (!dir.exists(plots_dir)) {
+ dir.create(plots_dir)
+}
+
+# Define the file path to the results directory
+results_dir <- "results" # Can replace with path to desired output results directory
+
+# Create the results folder if it doesn't exist
+if (!dir.exists(results_dir)) {
+ dir.create(results_dir)
+}
In the same place you put this .Rmd
file, you should now have three new empty folders called data
, plots
, and results
!
For general information about downloading data for these examples, see our ‘Getting Started’ section.
+Go to this dataset’s page on refine.bio.
+Click the “Download Now” button on the right side of this screen.
+ +Fill out the pop up window with your email and our Terms and Conditions:
+ +We are going to use non-quantile normalized data for this analysis. To get this data, you will need to check the box that says “Skip quantile normalization for RNA-seq samples”. Note that this option will only be available for RNA-seq datasets.
+ +It may take a few minutes for the dataset to process. You will get an email when it is ready.
+For this example analysis, we will use this prostate cancer dataset. The data that we downloaded from refine.bio for this analysis has 175 RNA-seq samples obtained from 20 patients with prostate cancer. Patients underwent androgen deprivation therapy (ADT) and RNA-seq samples include pre-ADT biopsies and post-ADT prostatectomy specimens.
+data/
folderrefine.bio will send you a download button in the email when it is ready. Follow the prompt to download a zip file that has a name with a series of letters and numbers and ends in .zip
. Double clicking should unzip this for you and create a folder of the same name.
For more details on the contents of this folder see these docs on refine.bio.
+The <experiment_accession_id>
folder has the data and metadata TSV files you will need for this example analysis. Experiment accession ids usually look something like GSE1235
or SRP12345
.
Copy and paste the SRP133573
folder into your newly created data/
folder.
Your new analysis folder should contain:
+.Rmd
you downloadedSRP133573
folder which contains:
+plots
(currently empty)results
(currently empty)Your example analysis folder should now look something like this (except with respective experiment accession ID and analysis notebook name you are using):
+ +In order for our example here to run without a hitch, we need these files to be in these locations so we’ve constructed a test to check before we get started with the analysis. These chunks will declare your file paths and double check that your files are in the right place.
+First we will declare our file paths to our data and metadata files, which should be in our data directory. This is handy to do because if we want to switch the dataset (see next section for more on this) we are using for this analysis, we will only have to change the file path here to get started.
+# Define the file path to the data directory
+data_dir <- file.path("data", "SRP133573") # Replace with accession number which will be the name of the folder the files will be in
+
+# Declare the file path to the gene expression matrix file using the data directory saved as `data_dir`
+data_file <- file.path(data_dir, "SRP133573.tsv") # Replace with file path to your dataset
+
+# Declare the file path to the metadata file using the data directory saved as `data_dir`
+metadata_file <- file.path(data_dir, "metadata_SRP133573.tsv") # Replace with file path to your metadata
Now that our file paths are declared, we can use the file.exists()
function to check that the files are where we specified above.
# Check if the gene expression matrix file is at the file path stored in `data_file`
+file.exists(data_file)
## [1] TRUE
+# Check if the metadata file is at the file path stored in `metadata_file`
+file.exists(metadata_file)
## [1] TRUE
+If the chunk above printed out FALSE
to either of those tests, you won’t be able to run this analysis as is until those files are in the appropriate place.
If the concept of a “file path” is unfamiliar to you; we recommend taking a look at our section about file paths.
+If you’d like to adapt an example analysis to use a different dataset from refine.bio, we recommend placing the files in the data/
directory you created and changing the filenames and paths in the notebook to match these files (we’ve put comments to signify where you would need to change the code). We suggest saving plots and results to plots/
and results/
directories, respectively, as these are automatically created by the notebook. From here you can customize this analysis example to fit your own scientific questions and preferences.
Keep in mind that WGCNA requires at least 15 samples to produce a meaningful result according to its authors. So you will need to make sure the dataset you switch this to is sufficiently large.
+clusterProfiler
- RNA-seqSee our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.
+In this analysis, we will be using WGCNA
package. WGCNA also requires a package called impute
that is sometimes has trouble installing so we recommend installing that first.
if (!("DESeq2" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("DESeq2", update = FALSE)
+}
+
+if (!("impute" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ BiocManager::install("impute")
+}
+
+if (!("WGCNA" %in% installed.packages())) {
+ # Install this package if it isn't installed yet
+ install.packages("WGCNA")
+}
Attach the packages we need for this analysis.
+ +## Loading required package: S4Vectors
+## Loading required package: stats4
+## Loading required package: BiocGenerics
+## Loading required package: parallel
+##
+## Attaching package: 'BiocGenerics'
+## The following objects are masked from 'package:parallel':
+##
+## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
+## clusterExport, clusterMap, parApply, parCapply, parLapply,
+## parLapplyLB, parRapply, parSapply, parSapplyLB
+## The following objects are masked from 'package:stats':
+##
+## IQR, mad, sd, var, xtabs
+## The following objects are masked from 'package:base':
+##
+## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
+## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
+## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
+## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
+## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
+## union, unique, unsplit, which.max, which.min
+##
+## Attaching package: 'S4Vectors'
+## The following object is masked from 'package:base':
+##
+## expand.grid
+## Loading required package: IRanges
+## Loading required package: GenomicRanges
+## Loading required package: GenomeInfoDb
+## Loading required package: SummarizedExperiment
+## Loading required package: MatrixGenerics
+## Loading required package: matrixStats
+##
+## Attaching package: 'MatrixGenerics'
+## The following objects are masked from 'package:matrixStats':
+##
+## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
+## colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
+## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
+## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
+## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
+## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
+## colWeightedMeans, colWeightedMedians, colWeightedSds,
+## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
+## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
+## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
+## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
+## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
+## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
+## rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
+## rowWeightedSds, rowWeightedVars
+## Loading required package: Biobase
+## Welcome to Bioconductor
+##
+## Vignettes contain introductory material; view with
+## 'browseVignettes()'. To cite Bioconductor, see
+## 'citation("Biobase")', and for packages 'citation("pkgname")'.
+##
+## Attaching package: 'Biobase'
+## The following object is masked from 'package:MatrixGenerics':
+##
+## rowMedians
+## The following objects are masked from 'package:matrixStats':
+##
+## anyMissing, rowMedians
+
+## Loading required package: AnnotationDbi
+##
+# We will need this so we can use the pipe: %>%
+library(magrittr)
+
+# We'll need this for finding gene modules
+library(WGCNA)
## Loading required package: dynamicTreeCut
+## Loading required package: fastcluster
+##
+## Attaching package: 'fastcluster'
+## The following object is masked from 'package:stats':
+##
+## hclust
+##
+##
+## Attaching package: 'WGCNA'
+## The following object is masked from 'package:IRanges':
+##
+## cor
+## The following object is masked from 'package:S4Vectors':
+##
+## cor
+## The following object is masked from 'package:stats':
+##
+## cor
+Data downloaded from refine.bio include a metadata tab separated values (TSV) file and a data TSV file. This chunk of code will read the both TSV files and add them as data frames to your environment.
+We stored our file paths as objects named metadata_file
and data_file
in this previous step.
##
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
+## cols(
+## .default = col_character(),
+## refinebio_age = col_logical(),
+## refinebio_cell_line = col_logical(),
+## refinebio_compound = col_logical(),
+## refinebio_disease_stage = col_logical(),
+## refinebio_genetic_information = col_logical(),
+## refinebio_processed = col_logical(),
+## refinebio_sex = col_logical(),
+## refinebio_source_archive_url = col_logical(),
+## refinebio_specimen_part = col_logical(),
+## refinebio_time = col_logical()
+## )
+## ℹ Use `spec()` for the full column specifications.
+# Read in data TSV file
+df <- readr::read_tsv(data_file) %>%
+ # Here we are going to store the gene IDs as rownames so that we can have a numeric matrix to perform calculations on later
+ tibble::column_to_rownames("Gene")
##
+## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
+## cols(
+## .default = col_double(),
+## Gene = col_character()
+## )
+## ℹ Use `spec()` for the full column specifications.
+Let’s ensure that the metadata and data are in the same sample order.
+# Make the data in the order of the metadata
+df <- df %>%
+ dplyr::select(metadata$refinebio_accession_code)
+
+# Check if this is in the same order
+all.equal(colnames(df), metadata$refinebio_accession_code)
## [1] TRUE
+DESeq2
We need to make sure all of the values in our data are converted to integers as required by a DESeq2
function we will use later.
We will be using the DESeq2
package for normalizing and transforming our data, which requires us to format our data into a DESeqDataSet
object. We turn the data frame (or matrix) into a DESeqDataSet
object. ) and specify which variable labels our experimental groups using the design
argument (Love et al. 2014). In this chunk of code, we will not provide a specific model to the design
argument because we are not performing a differential expression analysis.
# Create a `DESeqDataSet` object
+dds <- DESeqDataSetFromMatrix(
+ countData = df, # This is the data frame with the counts values for all replicates in our dataset
+ colData = metadata, # This is the data frame with the annotation data for the replicates in the counts data frame
+ design = ~1, # Here we are not specifying a model -- Replace with an appropriate design variable for your analysis
+ tidy = TRUE
+)
## converting counts to integer mode
+We want to filter out the genes that have not been expressed or that have low expression counts. This is recommended by WGCNA docs for RNA-seq data. Removing low count genes can help improve your WGCNA results. We are going to do some pre-filtering to keep only genes with 10 or more reads in total across the samples.
+ +We often suggest normalizing and transforming your data for various applications and in this instance WGCNA’s authors suggest using variance stablizing transformation before running WGCNA.
+We are going to use the vst()
function from the DESeq2
package to normalize and transform the data. For more information about these transformation methods, see here.
# Normalize and transform the data in the `DESeqDataSet` object using the `vst()` function from the `DESEq2` R package
+dds_norm <- vst(dds)
REVIEW
+Here we will use WGCNA to identify co-expressed gene modules. We’ll continue with the top gene module with ORA.
+Extract the normalized counts to a matrix and transpose it so we can pass it to WGCNA.
+ +WGCNA has to decide what genes are included in which modules, but it might encounter some genes that are close to call between modules. By providing a soft-threshold power parameter, we can guide it on {{I still can’t figure out what power actually represents here}} The “soft” part of a soft threshold just means that this will not be an absolute cutoff and WGCNA will use this parameter with some flexibility.
+sft <- pickSoftThreshold(normalized_counts,
+ dataIsExpr = TRUE,
+ corFnc = cor,
+ networkType = "signed"
+)
## Warning: executing %dopar% sequentially: no parallel backend registered
+## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
+## 1 1 0.58200 12.200 0.957 13500.0 13600.0 15500
+## 2 2 0.44500 5.130 0.972 7630.0 7650.0 9910
+## 3 3 0.26300 2.570 0.985 4480.0 4450.0 6680
+## 4 4 0.06480 0.914 0.985 2730.0 2680.0 4720
+## 5 5 0.00662 -0.236 0.964 1720.0 1660.0 3450
+## 6 6 0.15900 -1.010 0.965 1120.0 1060.0 2580
+## 7 7 0.36500 -1.470 0.971 746.0 689.0 1980
+## 8 8 0.50000 -1.730 0.972 509.0 459.0 1550
+## 9 9 0.59700 -1.910 0.972 356.0 313.0 1220
+## 10 10 0.67000 -2.060 0.973 253.0 217.0 982
+## 11 12 0.74000 -2.260 0.970 135.0 110.0 651
+## 12 14 0.79400 -2.320 0.978 76.9 58.6 447
+## 13 16 0.82000 -2.350 0.981 45.9 32.7 315
+## 14 18 0.83800 -2.360 0.985 28.6 18.9 227
+## 15 20 0.84500 -2.350 0.987 18.5 11.2 167
+This sft
object has a lot of information, we will want to plot some of it to figure out what our power
soft-threshold should be. We have to first calculate the model fit, the R squared and make that a new variable.
Now, let’s plot the model fitting by the power
soft threshold so we can decide on a soft-threshold for power.
sft_df %>%
+ ggplot2::ggplot() +
+ ggplot2::geom_label(ggplot2::aes(x = Power, y = model_fit), label = sft_df$Power) +
+ # We will plot what WGCNA recommends as an R^2 cutoff
+ ggplot2::geom_hline(yintercept = 0.80, col = "red") +
+ # Just in case our values are low, we want to make sure we can still see the 0.80 level
+ ggplot2::ylim(c(min(sft_df$model_fit), 1)) +
+ # We can add more sensible labels for our axis
+ ggplot2::xlab("Soft Threshold (power)") +
+ ggplot2::ylab("Scale Free Topology Model Fit, signed R^2") +
+ ggplot2::ggtitle("Scale independence") +
+ # This adds some nicer aesthetics to our plot
+ ggplot2::theme_classic()
Using this plot we can decide on a power parameter. WGCNA’s authors recommend using a power
that has an signed R^2 above 0.80
, otherwise they warn your results may be too noisy to be meaningful.
If you have multiple power values with signed R^2 above 0.80
, then picking the one at an inflection point, in other words where the R^2 values seem to have reached their saturation {{SOURCE}}. You want to a power
that gives you a big enough R^2 but is not excessively large.
So using the plot above, going with a power soft-threshold of 14
!
If you find you have all very low R^2 values this may be because there are too many genes with low values that are cluttering up the calculations. You can try returning to gene filtering step and choosing a more stringent cutoff (you’ll then need to re-run the transformation and subsequent steps to remake this plot to see if that helped).
+We will use the blockwiseModules()
function to build a gene coexpression network. The blockwise
part refers to that these calculations will be done on chunks of your data at a time to help with computing resources.
This will take some time to run.
+Note that for the power
argument we are using 14
which we determined above.
bwnet <- blockwiseModules(normalized_counts,
+ TOMType = "signed", # topological overlap matrix
+ power = 14, # soft threshold for network construction
+ randomSeed = 1234 # there's some randomness associated with this calculation
+ # so we should set a seed
+)
There are a lot of other settings you can tweak – look at ?blockwiseModules
help page to see more.
The TOMtype
argument specifies what kind of topological overlap matrix (TOM) should be used to make gene modules. You can safely assume for most situations a signed
network represents what you want – we want WGCNA to pay attention to directionality. However if you suspect you may benefit from an unsigned
network, where positive/negative is ignored see this article to help you figure that out.
DRAFT
+This object, bwnet
has a lot of information. We can pull out the parts we are most interested in and may want to use use for plotting.
In bwnet
we have a data frame of eigengene module data for each sample in the MEs
slot. These represent the collapsed and combined expression of the genes a part of each module.
If you want to know which of your genes make up which modules, you can look at the $colors
slot. This is a named list which associates the genes with the module they are a part of.
## ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
+## "grey" "tan" "grey" "grey" "grey"
+## ENSG00000000938
+## "grey"
+We can turn this into a data frame for handy use.
+ +Now we can print out how many genes there are per module.
+ +Note that the color assignments don’t have any particular biological meaning and just function as labels.
+We can also see if our eigengenes relate to our metadata labels.
+First we would have to make an annotation data frame that we can give to heatmap.
+# Let's prepare the annotation data.frame for taking a look at our eigengenes
+annotation_df <- metadata %>%
+ # We want to select the variables that we want for annotating the technical replicates heatmap
+ dplyr::select(
+ refinebio_accession_code,
+ refinebio_treatment
+ ) %>%
+ # The `pheatmap()` function requires that the row names of our annotation object matches the column names of our `DESeaDataSet` object
+ tibble::column_to_rownames("refinebio_accession_code")
Now we can create a heatmap.
+ + +At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.
+ +## R version 4.0.2 (2020-06-22)
+## Platform: x86_64-pc-linux-gnu (64-bit)
+## Running under: Ubuntu 20.04 LTS
+##
+## Matrix products: default
+## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
+##
+## locale:
+## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
+## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
+## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
+## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
+## [9] LC_ADDRESS=C LC_TELEPHONE=C
+## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
+##
+## attached base packages:
+## [1] parallel stats4 stats graphics grDevices utils datasets
+## [8] methods base
+##
+## other attached packages:
+## [1] WGCNA_1.69 fastcluster_1.1.25
+## [3] dynamicTreeCut_1.63-1 magrittr_1.5
+## [5] org.Hs.eg.db_3.12.0 AnnotationDbi_1.52.0
+## [7] DESeq2_1.30.0 SummarizedExperiment_1.20.0
+## [9] Biobase_2.50.0 MatrixGenerics_1.2.0
+## [11] matrixStats_0.57.0 GenomicRanges_1.42.0
+## [13] GenomeInfoDb_1.26.0 IRanges_2.24.0
+## [15] S4Vectors_0.28.0 BiocGenerics_0.36.0
+## [17] optparse_1.6.6
+##
+## loaded via a namespace (and not attached):
+## [1] colorspace_1.4-1 ellipsis_0.3.1 htmlTable_2.1.0
+## [4] XVector_0.30.0 base64enc_0.1-3 rstudioapi_0.11
+## [7] farver_2.0.3 getopt_1.20.3 bit64_4.0.5
+## [10] fansi_0.4.1 codetools_0.2-16 splines_4.0.2
+## [13] R.methodsS3_1.8.1 doParallel_1.0.15 impute_1.64.0
+## [16] geneplotter_1.68.0 knitr_1.30 jsonlite_1.7.1
+## [19] Formula_1.2-3 annotate_1.68.0 cluster_2.1.0
+## [22] GO.db_3.12.1 png_0.1-7 R.oo_1.24.0
+## [25] pheatmap_1.0.12 readr_1.4.0 compiler_4.0.2
+## [28] httr_1.4.2 backports_1.1.10 assertthat_0.2.1
+## [31] Matrix_1.2-18 cli_2.1.0 htmltools_0.5.0
+## [34] tools_4.0.2 gtable_0.3.0 glue_1.4.2
+## [37] GenomeInfoDbData_1.2.4 dplyr_1.0.2 Rcpp_1.0.5
+## [40] styler_1.3.2 vctrs_0.3.4 preprocessCore_1.52.0
+## [43] iterators_1.0.12 xfun_0.18 stringr_1.4.0
+## [46] ps_1.4.0 lifecycle_0.2.0 XML_3.99-0.5
+## [49] zlibbioc_1.36.0 scales_1.1.1 hms_0.5.3
+## [52] rematch2_2.1.2 RColorBrewer_1.1-2 yaml_2.2.1
+## [55] memoise_1.1.0 gridExtra_2.3 ggplot2_3.3.2
+## [58] rpart_4.1-15 latticeExtra_0.6-29 stringi_1.5.3
+## [61] RSQLite_2.2.1 genefilter_1.72.0 foreach_1.5.0
+## [64] checkmate_2.0.0 BiocParallel_1.24.1 rlang_0.4.8
+## [67] pkgconfig_2.0.3 bitops_1.0-6 evaluate_0.14
+## [70] lattice_0.20-41 purrr_0.3.4 htmlwidgets_1.5.2
+## [73] labeling_0.3 bit_4.0.4 tidyselect_1.1.0
+## [76] R6_2.4.1 generics_0.0.2 Hmisc_4.4-1
+## [79] DelayedArray_0.16.0 DBI_1.1.0 pillar_1.4.6
+## [82] foreign_0.8-80 survival_3.1-12 RCurl_1.98-1.2
+## [85] nnet_7.3-14 tibble_3.0.4 crayon_1.3.4
+## [88] rmarkdown_2.4 jpeg_0.1-8.1 locfit_1.5-9.4
+## [91] grid_4.0.2 data.table_1.13.0 blob_1.2.1
+## [94] digest_0.6.25 xtable_1.8-4 R.cache_0.14.0
+## [97] R.utils_2.10.1 munsell_0.5.0
+Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8
+