diff --git a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd index c5977247..4b574bf1 100644 --- a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd +++ b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.Rmd @@ -13,10 +13,8 @@ output: This example is one of pathway analysis module set, we recommend looking at the [pathway analysis table below](#how-to-choose-a-pathway-analysis) to help you determine which pathway analysis method is best suited for your purposes. -This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states. -Genes are ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic, for each gene set. -More specifically, an ES is calculated by starting with the most highly ranked genes (based on the gene-level statistic selected for ranking) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. -Normalized enrichment scores (NES) are enrichment scores that are scaled to account for gene sets of different sizes [@Subramanian2005; @Korotkevich2019]. +This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions [@Subramanian2005]. +Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small. ⬇️ [**Jump to the analysis code**](#analysis) ⬇️ @@ -128,7 +126,7 @@ From here you can customize this analysis example to fit your own scientific que See our Getting Started page with [instructions for package installation](https://alexslemonade.github.io/refinebio-examples/01-getting-started/getting-started.html#what-you-need-to-install) for a list of the other software you will need, as well as more tips and resources. -In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Igor2020; @Subramanian2005]. +In this analysis, we will be using [`clusterProfiler`](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) package to perform GSEA and the [`msigdbr`](https://cran.r-project.org/web/packages/msigdbr/index.html) package which contains gene sets from the [Molecular Signatures Database (MSigDB)](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp) already in the tidy format required by `clusterProfiler` [@Yu2012; @Igor2020; @Subramanian2005; @Liberzon2011]. We will also need the [`org.Mm.eg.db`](https://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html) package to perform gene identifier conversion [@Carlson2019]. @@ -165,13 +163,6 @@ library(org.Mm.eg.db) library(magrittr) ``` -The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible. - -```{r} -# Set the seed so our results are reproducible: -set.seed(2020) -``` - ## Download data file We will read in the differential expression results we will download from online. @@ -231,33 +222,33 @@ Let's take a look at what the contrast results from the differential expression dge_df ``` -## Getting familiar with `clusterProfiler`'s options +## Getting familiar with MSigDB gene sets available via `msigdbr` + +The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses [@Subramanian2005; @Liberzon2011]. +We can use the `msigdbr` package to access these gene sets in a format compatible with the package we'll use for analysis, `clusterProfiler` [@Igor2020; @Yu2012]. + +The gene sets available directly from MSigDB are applicable to human studies. +`msigdbr` also supports commonly studied model organisms. -Let's take a look at what organisms the package supports. +Let's take a look at what organisms the package supports with `msigdbr_species()`. ```{r} msigdbr_species() ``` -MSigDB contains 8 different gene set collections [@Subramanian2005]. +MSigDB contains [8 different gene set collections](https://www.gsea-msigdb.org/gsea/msigdb/collections.jsp) [@Subramanian2005; @Liberzon2011] that are distinguished by how they are derived (e.g., computationally mined, curated). - H: hallmark gene sets - C1: positional gene sets - C2: curated gene sets - C3: motif gene sets - C4: computational gene sets - C5: GO gene sets - C6: oncogenic signatures - C7: immunologic signatures +In this example, we will use a collection called Hallmark gene sets for GSEA [@Liberzon2015]. +Here's an excerpt of [the collection description from MSigDB](https://www.gsea-msigdb.org/gsea/msigdb/collection_details.jsp#H): -MSigDB includes a collection called Hallmark gene sets. -Here's an excerpt of the collection description [@Liberzon2015]: +> Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. +> These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. +> The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. -Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. -These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. -The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA. +Notably, there are only 50 gene sets included in this collection. +The fewer gene sets we test, the lower our multiple hypothesis testing burden. -The data we're interested in here comes from mouse samples, so we can obtain just the gene sets relevant to _M. musculus_ by specifying `species = "Mus musculus"` and only the Hallmark gene sets by specifying `category = "H"` to the `msigdbr()` function. +The data we're interested in here comes from mouse samples, so we can obtain only the Hallmarks gene sets relevant to _M. musculus_ by specifying `category = "H"` and `species = "Mus musculus"`, respectively, to the `msigdbr()` function. ```{r} mm_hallmark_sets <- msigdbr( @@ -271,7 +262,7 @@ See `?msigdbr` for more options. Let's preview what's in `mm_hallmark_sets`. -```{r} +```{r rownames.print=FALSE} head(mm_hallmark_sets) ``` @@ -295,19 +286,24 @@ keytypes(org.Mm.eg.db) Even though we'll use this package to convert from Ensembl gene IDs (`ENSEMBL`) to gene symbols (`SYMBOL`), we could just as easily use it to convert to and from any of these `keytypes()` listed above. -The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()`. +The function we will use to map from Ensembl gene IDs to gene symbols is called `mapIds()` and comes from the `AnnotationDbi`. Let's create a data frame that shows the mapped gene symbols along with the differential expression stats for the respective Ensembl IDs. ```{r} -# First let's create a mapped data frame we can join to the differential expression stats +# First let's create a mapped data frame we can join to the differential +# expression stats dge_mapped_df <- data.frame( - "gene_symbol" = mapIds( - org.Mm.eg.db, # Replace with annotation package for the organism relevant to your data + gene_symbol = mapIds( + # Replace with annotation package for the organism relevant to your data + org.Mm.eg.db, keys = dge_df$Gene, - column = "SYMBOL", # Replace with the type of gene identifiers you would like to map to - keytype = "ENSEMBL", # Replace with the type of gene identifiers in your data - multiVals = "first" # This will keep only the first mapped value for each Ensembl ID + # Replace with the type of gene identifiers in your data + keytype = "ENSEMBL", + # Replace with the type of gene identifiers you would like to map to + column = "SYMBOL", + # This will keep only the first mapped value for each Ensembl ID + multiVals = "first" ) ) %>% # If an Ensembl gene identifier doesn't map to a gene symbol, drop that @@ -327,14 +323,29 @@ Take a look at our other gene identifier conversion examples for examples with d Let's see a preview of `dge_mapped_df`. -```{r} +```{r rownames.print=FALSE} head(dge_mapped_df) ``` -We will want to keep in mind that GSEA requires that genes are ranked from most highly positive to most highly negative and weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic. -GSEA also requires unique gene identifiers to produce the most accurate results. -This means if any duplicate gene identifiers are found in our dataset, we will want to retain the instance with the higher absolute value as this will be the instance most likely to be ranked most highly negative or most highly positive. -Otherwise the enrichment score results may be skewed and the `GSEA()` function will throw a warning. +## Perform gene set enrichment analysis (GSEA) + +The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude [@Subramanian2005]. + +GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. +This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings [@Subramanian2005; @clusterProfiler-book]. +Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. +In this example, the enrichment score for a pathway is the running sum's maximum deviation from zero. +GSEA also assesses statistical significance of the scores for each pathway through permutation testing. +As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing [@Subramanian2005; @clusterProfiler-book]. + +The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we'll use log2 fold changes calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). +When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked. + +### Determine our pre-ranked genes list + +The `GSEA()` function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. +It requires _unique gene identifiers_ to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. +(The `GSEA()` function will throw a warning if we do not do this ahead of time.) Let's check to see if we have any gene symbols that mapped to multiple Ensembl IDs. @@ -343,7 +354,7 @@ any(duplicated(dge_mapped_df$gene_symbol)) ``` Looks like we do have duplicated gene symbols. -Let's find out which gene symbols have been duplicated. +Let's find out which ones. ```{r} dup_gene_symbols <- dge_mapped_df %>% @@ -361,25 +372,30 @@ dge_mapped_df %>% We can see that the associated values vary for each row. -As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA steps later, so let's keep the gene symbols associated with the higher absolute value of the log2 fold change. -GSEA relies on genes' rankings on the basis of this statistic. -Retaining the instance of the gene symbol with the higher absolute value means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. +As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the `GSEA()` step. +Let's keep the Entrez IDs associated with the higher absolute value of the log2 fold change. +GSEA relies on genes' rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings [@Subramanian2005; @clusterProfiler-book]. + +Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. -For example, if the duplicate identifiers tended to be enriched in a particular gene set, we may get an optimistic view of how perturbed that gene set is by preferentially selecting values that have a higher absolute value. +For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking. + +We are removing values for 33 out of thousands of genes here, so it is unlikely to have a considerable impact on our results. In the next chunk, we are going to filter out the duplicated row using the `dplyr::distinct()` function This will keep the first row with the duplicated value thus keeping the row with the highest absolute value of the log2 fold change. ```{r} filtered_dge_mapped_df <- dge_mapped_df %>% - # Sort so that the highest absolute values of the log2 fold change are at the top + # Sort so that the highest absolute values of the log2 fold change are at the + # top dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>% # Filter out the duplicated rows using `dplyr::distinct()` dplyr::distinct(gene_symbol, .keep_all = TRUE) ``` -Note that the log2 fold change values we use here have been transformed by `DESeq2` to account for genes with low counts or highly variable counts. -See the [`DESeq2` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) for more information on how `DESeq2` handles the log2 fold change values with the `lfcShrink()` function. +_Note that the log2 fold change estimates we use here have been subject to shrinkage to account for genes with low counts or highly variable counts. +See the [`DESeq2` package vignette](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) for more information on how `DESeq2` handles the log2 fold change values with the `lfcShrink()` function._ Let's check to see that we removed the duplicate gene symbols and kept the rows with the higher absolute value of the log2 fold change. @@ -389,16 +405,6 @@ any(duplicated(filtered_dge_mapped_df$gene_symbol)) Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the log2 fold change! -## Perform gene set enrichment analysis (GSEA) - -The goal of GSEA is to detect situations where all genes in a gene set change in a coordinated way, detecting even small statistical but coordinated changes. -This is done by ranking genes within a gene set from most highly positive to most highly negative, weighting them according to their gene-level statistic, to calculate the enrichment score (ES), which is a pathway-level statistic [@clusterProfiler-book]. - -### Determine our pre-ranked genes list - -The `GSEA()` function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. -We will therefore need to create a gene list with statistics that GSEA will rank by. - In the next chunk, we will create a named vector ranked based on the gene-level log2 fold change values. ```{r} @@ -419,19 +425,29 @@ head(lfc_vector) ### Run GSEA using the `GSEA()` function -Genes were ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, in the previous section. -In this section, we will implement GSEA to calculate the enrichment score (ES) for each gene set using our pre-ranked gene list. +Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. +In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list. + +The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible. + +```{r} +# Set the seed so our results are reproducible: +set.seed(2020) +``` We can use the `GSEA()` function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., `gseKEGG()`). +Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. +The `pAdjustMethod` argument to `GSEA()` above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the `pvalueCutoff` argument tells the function to only return pathways with adjusted p-values less than that threshold in the `results` slot. + ```{r} gsea_results <- GSEA( - geneList = lfc_vector, # ordered ranked gene list - minGSSize = 25, # minimum gene set size - maxGSSize = 500, # maximum gene set set - pvalueCutoff = 0.05, # p value cutoff - eps = 0, # boundary for calculating the p value - seed = TRUE, # set seed to make results reproducible + geneList = lfc_vector, # Ordered ranked gene list + minGSSize = 25, # Minimum gene set size + maxGSSize = 500, # Maximum gene set set + pvalueCutoff = 0.05, # p-value cutoff + eps = 0, # Boundary for calculating the p value + seed = TRUE, # Set seed to make results reproducible pAdjustMethod = "BH", # Benjamini-Hochberg correction TERM2GENE = dplyr::select( mm_hallmark_sets, @@ -442,27 +458,24 @@ gsea_results <- GSEA( ``` The warning message above tells us that there are few genes that have the same log2 fold change value and they are therefore ranked equally. -This means that `GSEA()` will arbitrarily choose which comes first in the ranked list [@GSEA-user-guide]. +`fgsea`, the method that underlies `GSEA()`, will arbitrarily choose which comes first in the ranked list [@gene-set-testing-rnaseq]. This percentage of `0.19` is small, so we are not concerned that it will significantly effect our results. -If the percentage was large on the other hand, we would be concerned about the log2 fold change results. +If the percentage much larger on the other hand, we would be concerned about the log2 fold change results. -Let's take a look at the results. +Let's take a look at the table in the `result` slot of `gsea_results`. -```{r} +```{r rownames.print=FALSE} # We can access the results from our `gsea_results` object using `@result` head(gsea_results@result) ``` -Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the ES of the gene set for the permuted data, which generates a null distribution for the ES. -The ES for each gene set is then normalized to account for the size of the set, resulting in a normalized enrichment score (NES), and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing. - -Looks like we have gene sets returned as significant at FDR of `0.05`, according to the `p.adjust` column in the table above. - -The information we're most likely interested in is in the `results` slot. -Note that if we didn't have any significant results, our `results` slot would be empty as nothing would have met our `pvalueCutoff` above. +Looks like we have gene sets returned as significant at FDR (false discovery rate) of `0.05`. +If we did not have results that met the `pvalueCutoff` condition, this table would be empty. If we wanted all results returned we would need to set the `pvalueCutoff = 1`. -Let's convert this into a data frame that we can use for further analysis and write to file. +The `NES` column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway. + +Let's convert the contents of `result` into a data frame that we can use for further analysis and write to a file later. ```{r} gsea_result_df <- data.frame(gsea_results@result) @@ -475,11 +488,11 @@ Let's take a look at 2 different pathways -- one with a highly positive NES and ### Most Positive NES -Let's look for the gene set with the most positive NES. +Let's look at the 3 gene sets with the most positive NES. -```{r} +```{r rownames.print=FALSE} gsea_result_df %>% - # Use `slice_max()` to return the top n observation using `NES` as the ordering variable + # This returns the 3 rows with the largest NES values dplyr::slice_max(NES, n = 3) ``` @@ -495,9 +508,9 @@ most_positive_nes_plot <- enrichplot::gseaplot( most_positive_nes_plot ``` -The red dashed line indicates the peak ES score (the ES is the maximum deviation from zero). -As mentioned earlier, an ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. -In this case, the most highly positive enrichment score's data are being displayed. +Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. +The red dashed line indicates the enrichment score, which is the maximum deviation from zero. +As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold changes values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. The plots returned by `enrichplot::gseaplot` are ggplots, so we can use `ggplot2::ggsave()` to save them to file. @@ -511,12 +524,11 @@ ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"), ### Most Negative NES -Let's look for the gene set with the most negative NES. +Let's look for the 3 gene sets with the most negative NES. -```{r} +```{r rownames.print = FALSE} gsea_result_df %>% - # Use `slice_min()` to return the bottom n observation using `NES` as the ordering - # variable + # Return the 3 rows with the smallest (most negative) NES values dplyr::slice_min(NES, n = 3) ``` @@ -532,12 +544,11 @@ most_negative_nes_plot <- enrichplot::gseaplot( most_negative_nes_plot ``` -Again, the red dashed line here indicates the maximum deviation from zero, in other words, the most negative ES score. -As we know, the ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. -A negative enrichment score will be returned when not many of the genes are found at the top of the list, as this would mean decreasing the score a great deal thus producing a negative value. -In this case, the most negative enrichment score's data are being displayed. +This gene set shows the opposite pattern -- genes in the pathway tend to be on the right side of the graph. +Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. +A _negative_ enrichment score will be returned when many genes are near the bottom of the ranked list. -Let's save this plot to PNG. +Let's save this plot to PNG as well. ```{r} ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"), diff --git a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html index a4201bf1..f37b1eb9 100644 --- a/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html +++ b/03-rnaseq/pathway-analysis_rnaseq_02_gsea.html @@ -3782,7 +3782,7 @@

December 2020

1 Purpose of this analysis

This example is one of pathway analysis module set, we recommend looking at the pathway analysis table below to help you determine which pathway analysis method is best suited for your purposes.

-

This particular example analysis shows how you can use gene set enrichment analysis (GSEA) to detect situations where all genes in a predefined gene set change in a coordinated way, detecting even small statistical but coordinated changes between two biological states. Genes are ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic, for each gene set. More specifically, an ES is calculated by starting with the most highly ranked genes (based on the gene-level statistic selected for ranking) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. Normalized enrichment scores (NES) are enrichment scores that are scaled to account for gene sets of different sizes (Korotkevich et al. 2019; Subramanian et al. 2005).

+

This particular example analysis shows how you can use Gene Set Enrichment Analysis (GSEA) to detect situations where genes in a predefined gene set or pathway change in a coordinated way between two conditions (Subramanian et al. 2005). Changes at the pathway-level may be statistically significant, and contribute to phenotypic differences, even if the changes in the expression level of individual genes are small.

⬇️ Jump to the analysis code ⬇️

1.0.1 What is pathway analysis?

@@ -3906,7 +3906,7 @@

4 Gene set enrichment analysis -

4.1 Install libraries

See our Getting Started page with instructions for package installation for a list of the other software you will need, as well as more tips and resources.

-

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005).

+

In this analysis, we will be using clusterProfiler package to perform GSEA and the msigdbr package which contains gene sets from the Molecular Signatures Database (MSigDB) already in the tidy format required by clusterProfiler (Yu et al. 2012; Dolgalev 2020; Subramanian et al. 2005; Liberzon et al. 2011).

We will also need the org.Mm.eg.db package to perform gene identifier conversion (Carlson 2019).

if (!("clusterProfiler" %in% installed.packages())) {
   # Install this package if it isn't installed yet
@@ -3934,39 +3934,36 @@ 

4.1 Install libraries

# We will need this so we can use the pipe: %>% library(magrittr)
-

The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

-
# Set the seed so our results are reproducible:
-set.seed(2020)

4.2 Download data file

We will read in the differential expression results we will download from online. These results are from an acute lymphoblastic leukemia (ALL) mouse lymphoid cell model we used for differential expression analysis using DESeq2 (Love et al. 2014). The table contains summary statistics including Ensembl gene IDs, log2 fold change values, and adjusted p-values (FDR in this case). We can identify differentially regulated genes by filtering these results and use this list as input to GSEA.

Instead of using the URL below, you can use a file path to a TSV file with your desired gene list results. First we will assign the URL to its own variable called, dge_url.

-
# Define the url to your differential expression results file
-dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv"
+
# Define the url to your differential expression results file
+dge_url <- "https://refinebio-examples.s3.us-east-2.amazonaws.com/03-rnaseq/results/SRP123625/SRP123625_differential_expression_results.tsv"

We will also declare a file path to where we want this file to be downloaded to and we can use the same file path later for reading the file into R.

-
dge_results_file <- file.path(
-  results_dir,
-  "SRP123625_differential_expression_results.tsv"
-)
+
dge_results_file <- file.path(
+  results_dir,
+  "SRP123625_differential_expression_results.tsv"
+)

Using the URL (dge_url) and file path (dge_results_file) we can download the file and use the destfile argument to specify where it should be saved.

-
download.file(
-  dge_url,
-  # The file will be saved to this location and with this name
-  destfile = dge_results_file
-)
+
download.file(
+  dge_url,
+  # The file will be saved to this location and with this name
+  destfile = dge_results_file
+)

Now let’s double check that the results file is in the right place.

-
# Check if the results file exists
-file.exists(dge_results_file)
+
# Check if the results file exists
+file.exists(dge_results_file)
## [1] TRUE

4.3 Import data

Read in the file that has differential expression results.

-
# Read in the contents of the differential expression results file
-dge_df <- readr::read_tsv(dge_results_file)
+
# Read in the contents of the differential expression results file
+dge_df <- readr::read_tsv(dge_results_file)
## 
-## ── Column specification ──────────────────────────────────────────────
+## ── Column specification ────────────────────────────────────────────────────────
 ## cols(
 ##   Gene = col_character(),
 ##   baseMean = col_double(),
@@ -3978,41 +3975,38 @@ 

4.3 Import data

## )

Note that read_tsv() can also read TSV files directly from a URL and doesn’t necessarily require you download the file first. If you wanted to use that feature, you could replace the call above with readr::read_tsv(dge_url) and skip the download steps.

Let’s take a look at what the contrast results from the differential expression analysis looks like.

-
dge_df
+
dge_df
-
-

4.4 Getting familiar with clusterProfiler’s options

-

Let’s take a look at what organisms the package supports.

-
msigdbr_species()
+
+

4.4 Getting familiar with MSigDB gene sets available via msigdbr

+

The Molecular Signatures Database (MSigDB) is a resource that contains annotated gene sets that can be used for pathway or gene set analyses (Subramanian et al. 2005; Liberzon et al. 2011). We can use the msigdbr package to access these gene sets in a format compatible with the package we’ll use for analysis, clusterProfiler (Yu et al. 2012; Dolgalev 2020).

+

The gene sets available directly from MSigDB are applicable to human studies. msigdbr also supports commonly studied model organisms.

+

Let’s take a look at what organisms the package supports with msigdbr_species().

+
msigdbr_species()
-

MSigDB contains 8 different gene set collections (Subramanian et al. 2005).

-
H: hallmark gene sets
-C1: positional gene sets
-C2: curated gene sets
-C3: motif gene sets
-C4: computational gene sets
-C5: GO gene sets
-C6: oncogenic signatures
-C7: immunologic signatures
-

MSigDB includes a collection called Hallmark gene sets. Here’s an excerpt of the collection description (Liberzon et al. 2015):

+

MSigDB contains 8 different gene set collections (Subramanian et al. 2005; Liberzon et al. 2011) that are distinguished by how they are derived (e.g., computationally mined, curated).

+

In this example, we will use a collection called Hallmark gene sets for GSEA (Liberzon et al. 2015). Here’s an excerpt of the collection description from MSigDB:

+

Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression. These gene sets were generated by a computational methodology based on identifying gene set overlaps and retaining genes that display coordinate expression. The hallmarks reduce noise and redundancy and provide a better delineated biological space for GSEA.

-

The data we’re interested in here comes from mouse samples, so we can obtain just the gene sets relevant to M. musculus by specifying species = "Mus musculus" and only the Hallmark gene sets by specifying category = "H" to the msigdbr() function.

-
mm_hallmark_sets <- msigdbr(
-  species = "Mus musculus", # Replace with species name relevant to your data
-  category = "H"
-)
+
+

Notably, there are only 50 gene sets included in this collection. The fewer gene sets we test, the lower our multiple hypothesis testing burden.

+

The data we’re interested in here comes from mouse samples, so we can obtain only the Hallmarks gene sets relevant to M. musculus by specifying category = "H" and species = "Mus musculus", respectively, to the msigdbr() function.

+
mm_hallmark_sets <- msigdbr(
+  species = "Mus musculus", # Replace with species name relevant to your data
+  category = "H"
+)

If you run the chunk above without specifying a category to the msigdbr() function, it will return all of the MSigDB gene sets for mouse. See ?msigdbr for more options.

Let’s preview what’s in mm_hallmark_sets.

-
head(mm_hallmark_sets)
+
head(mm_hallmark_sets)
-

We will want to keep in mind that GSEA requires that genes are ranked from most highly positive to most highly negative and weighted according to their gene-level statistic, which is essential to the calculation of the enrichment score (ES), a pathway-level statistic. GSEA also requires unique gene identifiers to produce the most accurate results. This means if any duplicate gene identifiers are found in our dataset, we will want to retain the instance with the higher absolute value as this will be the instance most likely to be ranked most highly negative or most highly positive. Otherwise the enrichment score results may be skewed and the GSEA() function will throw a warning.

+
+
+

4.6 Perform gene set enrichment analysis (GSEA)

+

The goal of GSEA is to detect situations where many genes in a gene set change in a coordinated way, even when individual changes are small in magnitude (Subramanian et al. 2005).

+

GSEA calculates a pathway-level metric, called an enrichment score (sometimes abbreviated as ES), by ranking genes by a gene-level statistic. This score reflects whether or not a gene set or pathway is overrepresented at the top or bottom of the gene rankings (Yu 2020; Subramanian et al. 2005). Specifically, genes are ranked from most positive to most negative based on their statistic and a running sum is calculated by starting with the most highly ranked genes and increasing the score when a gene is in the pathway and decreasing the score when a gene is not. In this example, the enrichment score for a pathway is the running sum’s maximum deviation from zero. GSEA also assesses statistical significance of the scores for each pathway through permutation testing. As a result, each input pathway will have a p-value associated with it that is then corrected for multiple hypothesis testing (Yu 2020; Subramanian et al. 2005).

+

The implementation of GSEA we use in this examples requires a gene list ordered by some statistic (here we’ll use log2 fold changes calculated as part of differential gene expression analysis) and input gene sets (Hallmark collection). When you use previously computed gene-level statistics with GSEA, it is called GSEA pre-ranked.

+
+

4.6.1 Determine our pre-ranked genes list

+

The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. It requires unique gene identifiers to produce the most accurate results, so we will need to resolve any duplicates found in our dataset. (The GSEA() function will throw a warning if we do not do this ahead of time.)

Let’s check to see if we have any gene symbols that mapped to multiple Ensembl IDs.

-
any(duplicated(dge_mapped_df$gene_symbol))
+
any(duplicated(dge_mapped_df$gene_symbol))
## [1] TRUE
-

Looks like we do have duplicated gene symbols. Let’s find out which gene symbols have been duplicated.

-
dup_gene_symbols <- dge_mapped_df %>%
-  dplyr::filter(duplicated(gene_symbol)) %>%
-  dplyr::pull(gene_symbol)
+

Looks like we do have duplicated gene symbols. Let’s find out which ones.

+
dup_gene_symbols <- dge_mapped_df %>%
+  dplyr::filter(duplicated(gene_symbol)) %>%
+  dplyr::pull(gene_symbol)

Now let’s take a look at the rows associated with the duplicated gene symbols.

-
dge_mapped_df %>%
-  dplyr::filter(gene_symbol %in% dup_gene_symbols) %>%
-  dplyr::arrange(gene_symbol)
+
dge_mapped_df %>%
+  dplyr::filter(gene_symbol %in% dup_gene_symbols) %>%
+  dplyr::arrange(gene_symbol)

We can see that the associated values vary for each row.

-

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA steps later, so let’s keep the gene symbols associated with the higher absolute value of the log2 fold change. GSEA relies on genes’ rankings on the basis of this statistic. Retaining the instance of the gene symbol with the higher absolute value means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if the duplicate identifiers tended to be enriched in a particular gene set, we may get an optimistic view of how perturbed that gene set is by preferentially selecting values that have a higher absolute value.

+

As we mentioned earlier, we will want to remove duplicated gene identifiers in preparation for the GSEA() step. Let’s keep the Entrez IDs associated with the higher absolute value of the log2 fold change. GSEA relies on genes’ rankings on the basis of a gene-level statistic and the enrichment score that is calculated reflects the degree to which genes in a gene set are overrepresented in the top or bottom of the rankings (Yu 2020; Subramanian et al. 2005).

+

Retaining the instance of the Entrez ID with the higher absolute value of a gene-level statistic means that we will retain the value that is likely to be more highly- or lowly-ranked or, put another way, the values less likely to be towards the middle of the ranked gene list. We should keep this decision in mind when interpreting our results. For example, if all the duplicate identifiers happened to be in a particular gene set, we may get an overly optimistic view of how perturbed that gene set is because we preferentially selected instances of the identifier that have a higher absolute value of the statistic used for ranking.

+

We are removing values for 33 out of thousands of genes here, so it is unlikely to have a considerable impact on our results.

In the next chunk, we are going to filter out the duplicated row using the dplyr::distinct() function This will keep the first row with the duplicated value thus keeping the row with the highest absolute value of the log2 fold change.

-
filtered_dge_mapped_df <- dge_mapped_df %>%
-  # Sort so that the highest absolute values of the log2 fold change are at the top
-  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
-  # Filter out the duplicated rows using `dplyr::distinct()`
-  dplyr::distinct(gene_symbol, .keep_all = TRUE)
-

Note that the log2 fold change values we use here have been transformed by DESeq2 to account for genes with low counts or highly variable counts. See the DESeq2 package vignette for more information on how DESeq2 handles the log2 fold change values with the lfcShrink() function.

+
filtered_dge_mapped_df <- dge_mapped_df %>%
+  # Sort so that the highest absolute values of the log2 fold change are at the
+  # top
+  dplyr::arrange(dplyr::desc(abs(log2FoldChange))) %>%
+  # Filter out the duplicated rows using `dplyr::distinct()`
+  dplyr::distinct(gene_symbol, .keep_all = TRUE)
+

Note that the log2 fold change estimates we use here have been subject to shrinkage to account for genes with low counts or highly variable counts. See the DESeq2 package vignette for more information on how DESeq2 handles the log2 fold change values with the lfcShrink() function.

Let’s check to see that we removed the duplicate gene symbols and kept the rows with the higher absolute value of the log2 fold change.

-
any(duplicated(filtered_dge_mapped_df$gene_symbol))
+
any(duplicated(filtered_dge_mapped_df$gene_symbol))
## [1] FALSE

Looks like we were able to successfully get rid of the duplicate gene identifiers and keep the observations with the higher absolute value of the log2 fold change!

-
-
-

4.6 Perform gene set enrichment analysis (GSEA)

-

The goal of GSEA is to detect situations where all genes in a gene set change in a coordinated way, detecting even small statistical but coordinated changes. This is done by ranking genes within a gene set from most highly positive to most highly negative, weighting them according to their gene-level statistic, to calculate the enrichment score (ES), which is a pathway-level statistic (Yu).

-
-

4.6.1 Determine our pre-ranked genes list

-

The GSEA() function takes a pre-ranked and sorted named vector of statistics, where the names in the vector are gene identifiers. We will therefore need to create a gene list with statistics that GSEA will rank by.

In the next chunk, we will create a named vector ranked based on the gene-level log2 fold change values.

-
# Let's create a named vector ranked based on the log2 fold change values
-lfc_vector <- filtered_dge_mapped_df$log2FoldChange
-names(lfc_vector) <- filtered_dge_mapped_df$gene_symbol
-
-# We need to sort the log2 fold change values in descending order here
-lfc_vector <- sort(lfc_vector, decreasing = TRUE)
+
# Let's create a named vector ranked based on the log2 fold change values
+lfc_vector <- filtered_dge_mapped_df$log2FoldChange
+names(lfc_vector) <- filtered_dge_mapped_df$gene_symbol
+
+# We need to sort the log2 fold change values in descending order here
+lfc_vector <- sort(lfc_vector, decreasing = TRUE)

Let’s preview our pre-ranked named vector.

-
# Look at first entries of the ranked log2 fold change vector
-head(lfc_vector)
+
# Look at first entries of the ranked log2 fold change vector
+head(lfc_vector)
##   Lpgat1   Lgals7    Gm973     Bbs7     Clnk   Zfp575 
 ## 13.34941 12.64196 12.51824 12.19278 11.52481 10.20900

4.6.2 Run GSEA using the GSEA() function

-

Genes were ranked from most highly positive to most highly negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score (ES) for each gene set using our pre-ranked gene list.

+

Genes were ranked from most positive to most negative, weighted according to their gene-level statistic, in the previous section. In this section, we will implement GSEA to calculate the enrichment score for each gene set using our pre-ranked gene list.

+

The GSEA algorithm utilizes random sampling so we are going to set the seed to make our results reproducible.

+
# Set the seed so our results are reproducible:
+set.seed(2020)

We can use the GSEA() function to perform GSEA with any generic set of gene sets, but there are several functions for using specific, commonly used gene sets (e.g., gseKEGG()).

-
gsea_results <- GSEA(
-  geneList = lfc_vector, # ordered ranked gene list
-  minGSSize = 25, # minimum gene set size
-  maxGSSize = 500, # maximum gene set set
-  pvalueCutoff = 0.05, # p value cutoff
-  eps = 0, # boundary for calculating the p value
-  seed = TRUE, # set seed to make results reproducible
-  pAdjustMethod = "BH", # Benjamini-Hochberg correction
-  TERM2GENE = dplyr::select(
-    mm_hallmark_sets,
-    gs_name,
-    gene_symbol
-  )
-)
+

Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the enrichment scores of the gene set for the permuted data, which generates a null distribution for the enrichment score. The pAdjustMethod argument to GSEA() above specifies what method to use for adjusting the p-values to account for multiple hypothesis testing; the pvalueCutoff argument tells the function to only return pathways with adjusted p-values less than that threshold in the results slot.

+
gsea_results <- GSEA(
+  geneList = lfc_vector, # Ordered ranked gene list
+  minGSSize = 25, # Minimum gene set size
+  maxGSSize = 500, # Maximum gene set set
+  pvalueCutoff = 0.05, # p-value cutoff
+  eps = 0, # Boundary for calculating the p value
+  seed = TRUE, # Set seed to make results reproducible
+  pAdjustMethod = "BH", # Benjamini-Hochberg correction
+  TERM2GENE = dplyr::select(
+    mm_hallmark_sets,
+    gs_name,
+    gene_symbol
+  )
+)
## preparing geneSet collections...
## GSEA analysis...
## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.19% of the list).
 ## The order of those tied genes will be arbitrary, which may produce unexpected results.
## leading edge analysis...
## done...
-

The warning message above tells us that there are few genes that have the same log2 fold change value and they are therefore ranked equally. This means that GSEA() will arbitrarily choose which comes first in the ranked list (Broad Institute Team 2019). This percentage of 0.19 is small, so we are not concerned that it will significantly effect our results. If the percentage was large on the other hand, we would be concerned about the log2 fold change results.

-

Let’s take a look at the results.

-
# We can access the results from our `gsea_results` object using `@result`
-head(gsea_results@result)
+

The warning message above tells us that there are few genes that have the same log2 fold change value and they are therefore ranked equally. fgsea, the method that underlies GSEA(), will arbitrarily choose which comes first in the ranked list (Ballereau et al. 2018). This percentage of 0.19 is small, so we are not concerned that it will significantly effect our results. If the percentage much larger on the other hand, we would be concerned about the log2 fold change results.

+

Let’s take a look at the table in the result slot of gsea_results.

+
# We can access the results from our `gsea_results` object using `@result`
+head(gsea_results@result)
-

Significance is assessed by permuting the gene labels of the pre-ranked gene list and recomputing the ES of the gene set for the permuted data, which generates a null distribution for the ES. The ES for each gene set is then normalized to account for the size of the set, resulting in a normalized enrichment score (NES), and an FDR (false discovery rate) value is calculated to account for multiple hypothesis testing.

-

Looks like we have gene sets returned as significant at FDR of 0.05, according to the p.adjust column in the table above.

-

The information we’re most likely interested in is in the results slot. Note that if we didn’t have any significant results, our results slot would be empty as nothing would have met our pvalueCutoff above. If we wanted all results returned we would need to set the pvalueCutoff = 1.

-

Let’s convert this into a data frame that we can use for further analysis and write to file.

-
gsea_result_df <- data.frame(gsea_results@result)
+

Looks like we have gene sets returned as significant at FDR (false discovery rate) of 0.05. If we did not have results that met the pvalueCutoff condition, this table would be empty. If we wanted all results returned we would need to set the pvalueCutoff = 1.

+

The NES column contains the normalized enrichment score, which normalizes for the gene set size, for that pathway.

+

Let’s convert the contents of result into a data frame that we can use for further analysis and write to a file later.

+
gsea_result_df <- data.frame(gsea_results@result)
@@ -4157,86 +4163,85 @@

4.7 Visualizing results

We can visualize GSEA results for individual pathways or gene sets using enrichplot::gseaplot(). Let’s take a look at 2 different pathways – one with a highly positive NES and one with a highly negative NES – to get more insight into how ES are calculated.

4.7.1 Most Positive NES

-

Let’s look for the gene set with the most positive NES.

-
gsea_result_df %>%
-  # Use `slice_max()` to return the top n observation using `NES` as the ordering variable
-  dplyr::slice_max(NES, n = 3)
+

Let’s look at the 3 gene sets with the most positive NES.

+
gsea_result_df %>%
+  # This returns the 3 rows with the largest NES values
+  dplyr::slice_max(NES, n = 3)

The gene set HALLMARK_MYC_TARGETS_V2 has the most positive NES score.

-
most_positive_nes_plot <- enrichplot::gseaplot(
-  gsea_results,
-  geneSetID = "HALLMARK_MYC_TARGETS_V2",
-  title = "HALLMARK_MYC_TARGETS_V2",
-  color.line = "#0d76ff"
-)
-most_positive_nes_plot
+
most_positive_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_MYC_TARGETS_V2",
+  title = "HALLMARK_MYC_TARGETS_V2",
+  color.line = "#0d76ff"
+)
+most_positive_nes_plot

-

The red dashed line indicates the peak ES score (the ES is the maximum deviation from zero). As mentioned earlier, an ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. In this case, the most highly positive enrichment score’s data are being displayed.

+

Notice how the genes that are in the gene set, indicated by the black bars, tend to be on the left side of the graph indicating that they have positive gene-level scores. The red dashed line indicates the enrichment score, which is the maximum deviation from zero. As mentioned earlier, an enrichment is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold changes values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway.

The plots returned by enrichplot::gseaplot are ggplots, so we can use ggplot2::ggsave() to save them to file.

Let’s save to PNG.

-
ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"),
-  plot = most_positive_nes_plot
-)
+
ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_positive_plot.png"),
+  plot = most_positive_nes_plot
+)
## Saving 7 x 5 in image

4.7.2 Most Negative NES

-

Let’s look for the gene set with the most negative NES.

-
gsea_result_df %>%
-  # Use `slice_min()` to return the bottom n observation using `NES` as the ordering
-  # variable
-  dplyr::slice_min(NES, n = 3)
+

Let’s look for the 3 gene sets with the most negative NES.

+
gsea_result_df %>%
+  # Return the 3 rows with the smallest (most negative) NES values
+  dplyr::slice_min(NES, n = 3)

The gene set HALLMARK_HYPOXIA has the most negative NES.

-
most_negative_nes_plot <- enrichplot::gseaplot(
-  gsea_results,
-  geneSetID = "HALLMARK_HYPOXIA",
-  title = "HALLMARK_HYPOXIA",
-  color.line = "#0d76ff"
-)
-most_negative_nes_plot
+
most_negative_nes_plot <- enrichplot::gseaplot(
+  gsea_results,
+  geneSetID = "HALLMARK_HYPOXIA",
+  title = "HALLMARK_HYPOXIA",
+  color.line = "#0d76ff"
+)
+most_negative_nes_plot

-

Again, the red dashed line here indicates the maximum deviation from zero, in other words, the most negative ES score. As we know, the ES is calculated by starting with the most highly ranked genes (according to the gene-level log2 fold change values) and increasing the score when a gene is in the pathway and decreasing the score when a gene is not in the pathway. A negative enrichment score will be returned when not many of the genes are found at the top of the list, as this would mean decreasing the score a great deal thus producing a negative value. In this case, the most negative enrichment score’s data are being displayed.

-

Let’s save this plot to PNG.

-
ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"),
-  plot = most_negative_nes_plot
-)
+

This gene set shows the opposite pattern – genes in the pathway tend to be on the right side of the graph. Again, the red dashed line here indicates the maximum deviation from zero, in other words, the enrichment score. A negative enrichment score will be returned when many genes are near the bottom of the ranked list.

+

Let’s save this plot to PNG as well.

+
ggplot2::ggsave(file.path(plots_dir, "SRP123625_gsea_enrich_negative_plot.png"),
+  plot = most_negative_nes_plot
+)
## Saving 7 x 5 in image

4.8 Write results to file

-
readr::write_tsv(
-  gsea_result_df,
-  file.path(
-    results_dir,
-    "SRP123625_gsea_results.tsv"
-  )
-)
+
readr::write_tsv(
+  gsea_result_df,
+  file.path(
+    results_dir,
+    "SRP123625_gsea_results.tsv"
+  )
+)

5 Resources for further learning

6 Session info

At the end of every analysis, before saving your notebook, we recommend printing out your session info. This helps make your code more reproducible by recording what versions of software and packages you used to run this.

-
# Print session info
-sessioninfo::session_info()
+
# Print session info
+sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────
 ##  setting  value                       
 ##  version  R version 4.0.2 (2020-06-22)
@@ -4247,7 +4252,7 @@ 

6 Session info

## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Etc/UTC -## date 2020-12-15 +## date 2020-12-18 ## ## ─ Packages ───────────────────────────────────────────────────────── ## package * version date lib source @@ -4269,7 +4274,7 @@

6 Session info

## data.table 1.13.0 2020-07-24 [1] RSPM (R 4.0.2) ## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0) ## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0) -## DO.db 2.9 2020-12-11 [1] Bioconductor +## DO.db 2.9 2020-12-01 [1] Bioconductor ## DOSE 3.16.0 2020-10-27 [1] Bioconductor ## downloader 0.4 2015-07-09 [1] RSPM (R 4.0.0) ## dplyr 1.0.2 2020-08-18 [1] RSPM (R 4.0.2) @@ -4287,7 +4292,7 @@

6 Session info

## ggraph 2.0.3 2020-05-20 [1] RSPM (R 4.0.2) ## ggrepel 0.8.2 2020-03-08 [1] RSPM (R 4.0.2) ## glue 1.4.2 2020-08-27 [1] RSPM (R 4.0.2) -## GO.db 3.12.1 2020-12-11 [1] Bioconductor +## GO.db 3.12.1 2020-12-01 [1] Bioconductor ## GOSemSim 2.16.1 2020-10-29 [1] Bioconductor ## graphlayouts 0.7.0 2020-04-25 [1] RSPM (R 4.0.2) ## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.0.0) @@ -4308,7 +4313,7 @@

6 Session info

## msigdbr * 7.2.1 2020-10-02 [1] RSPM (R 4.0.2) ## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0) ## optparse * 1.6.6 2020-04-16 [1] RSPM (R 4.0.0) -## org.Mm.eg.db * 3.12.0 2020-12-11 [1] Bioconductor +## org.Mm.eg.db * 3.12.0 2020-12-01 [1] Bioconductor ## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2) ## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0) ## plyr 1.8.6 2020-03-03 [1] RSPM (R 4.0.2) @@ -4331,7 +4336,7 @@

6 Session info

## RSQLite 2.2.1 2020-09-30 [1] RSPM (R 4.0.2) ## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0) ## rvcheck 0.1.8 2020-03-01 [1] RSPM (R 4.0.0) -## S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor +## S4Vectors * 0.28.0 2020-10-27 [1] Bioconductor ## scales 1.1.1 2020-05-11 [1] RSPM (R 4.0.0) ## scatterpie 0.1.5 2020-09-09 [1] RSPM (R 4.0.2) ## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0) @@ -4357,15 +4362,12 @@

6 Session info

References

-
-

Broad Institute Team, 2019 Gene set enrichment analysis (gsea) user guide. https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html

+
+

Ballereau S., M. Dunning, A. Edwards, O. Rueda, and A. Sawle, 2018 RNA-seq analysis in R: Gene set testing for RNA-seq. https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html

-
-

Diego U. S., and B. I. Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

-

Dolgalev I., 2020 Msigdbr: MSigDB gene sets for multiple organisms in a tidy data format. https://cran.r-project.org/web/packages/msigdbr/index.html

@@ -4375,23 +4377,26 @@

References

Khatri P., M. Sirota, and A. J. Butte, 2012 Ten years of pathway analysis: Current approaches and outstanding challenges. PLOS Computational Biology 8: e1002375. https://doi.org/10.1371/journal.pcbi.1002375

-
-

Korotkevich G., V. Sukhov, and A. Sergushichev, 2019 Fast gene set enrichment analysis. bioRxiv. https://doi.org/10.1101/060012

-

Liberzon A., C. Birger, H. Thorvaldsdóttir, M. Ghandi, and J. P. Mesirov et al., 2015 The molecular signatures database hallmark gene set collection. Cell Systems 1. https://doi.org/10.1016/j.cels.2015.12.004

+
+

Liberzon A., A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, and P. Tamayo et al., 2011 Molecular signatures database (MSigDB) 3.0. Bioinformatics 27: 1739–1740. https://doi.org/10.1093/bioinformatics/btr260

+

Love M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biology 15. https://doi.org/10.1186/s13059-014-0550-8

Subramanian A., P. Tamayo, V. K. Mootha, S. Mukherjee, and B. L. Ebert et al., 2005 Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102: 15545–15550. https://doi.org/10.1073/pnas.0506580102

+
+

UC San Diego and Broad Institute Team, GSEA: Gene set enrichment analysis. https://www.gsea-msigdb.org/gsea/index.jsp

+

Yu G., L.-G. Wang, Y. Han, and Q.-Y. He, 2012 clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology 16: 284–287. https://doi.org/10.1089/omi.2011.0118

-

Yu G., clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

+

Yu G., 2020 clusterProfiler: Universal enrichment tool for functional and comparative study. http://yulab-smu.top/clusterProfiler-book/index.html

diff --git a/components/references.bib b/components/references.bib index b86ea70f..2ba441f8 100644 --- a/components/references.bib +++ b/components/references.bib @@ -180,6 +180,13 @@ @manual{gene-id-annotation-rna-seq url = {https://github.com/AlexsLemonade/refinebio-examples/blob/master/03-rnaseq/gene-id-convert_rnaseq_01_ensembl.html} } +@website{gene-set-testing-rnaseq, + author = {Stephane Ballereau and Mark Dunning and Abbi Edwards and Oscar Rueda and Ashley Sawle}, + title = {{RNA-seq} analysis in {R}: Gene Set Testing for {RNA-seq}}, + year = {2018}, + url = {https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2018/RNASeq2018/html/06_Gene_set_testing.nb.html} +} + @manual{Gonzalez2014, title = {Statistical analysis of {RNA-Seq} data}, author = {Ignacio Gonzalez},