Skip to content

Gene Set Enrichment Analysis

kevincjnixon edited this page Jan 12, 2021 · 4 revisions

#Gene Set Enrichment Analysis

One useful tool with transcriptome data is a Gene Set Enrichment Analysis (GSEA). In this analysis, all genes from the differential expression analysis are ranked and input into a Pre-ranked GSEA. There are some R tools that will run a GSEA, but I prefer to use the GSEA software from https://www.gsea-msigdb.org. With this software, there is a nice, friendly GUI that can be run on all operating systems, or you can run it from the command-line to help automate some of the processes. In brief, GSEA uses the gene rankings to apply an 'enrichment score' for a given gene set (a group of genes annotated to a specific pathway, biological function, disease phenotype, etc) to determine if genes within the gene set are commonly (dys)regulated in certain conditions. Read more about GSEA here.

GenerateGSEA()

BinfTools' GenerateGSEA() is a function to rank the genes after a differential expression analysis for input into a GSEA analysis. The output of this function is a .rnk file with two columns:

  • NAME The gene name - to simplify the GSEA analysis, this should be the gene symbol, whenever possible
  • Rank The assigned gene ranking

The input for GenerateGSEA() is the results object. The function can rank genes using two different metrics:

  • Statistics (bystat=T)
  • Fold-Change (byFC=T)

If using statistics, the function will look for the column named stat. In the DESeq2 results object, this represents the Wald statistic and will be used for ranking genes using the absolute value of the Wald statistic * log2FoldChange sign. Essentially, the more significant the change in gene expression is, the higher, the ranking of the gene, however, if the gene shows a negative fold-change in expression, the gene receives a stronger negative ranking. If there is no column named stat in the results object, GenerateGSEA() with use -log10(p-value) in place of the Wald Statistic.

If using Fold-Change, the function will use the log2FoldChange value from the results object. The more change in expression a gene exhibits, the stronger the ranking. Since the log2FoldChange already includes the positive and negative rankings, there are no calculations performed when generating the ranked list. Keep in mind that if you use the raw log2FoldChange values, you will likely have some genes that show a large change in expression, but are not significant due to high variation in gene expression within the conditions. It is recommended that if you are using fold-change to rank the genes, that you use a results object with shrunken log-fold change results (see DESeq2's lfcshrink()).

Both statistics and fold-change can be used if you would like to incoporate both into the ranking system. Just set bystat=T and byFC=T in the arguments.

#Rank genes by significance:
GenerateGSEA(res, filename="GSEA_stat.rnk", bystat=T, byFC=F) #Default settings
#Rank genes by fold-change:
GenerateGSEA(res, filename="GSEA_FC.rnk", bystat=F, byFC=T)
#Rank genes using both:
GenerateGSEA(res, filename="GSEA_both.rnk", bystat=T, byFC=T)

Running GSEA

Once you have your .rnk file, you can run the GSEA software. It will generate useful reports indicating which gene sets are enriched positively and negatively. You can also use the Cytoscape App EnrichmentMap and AutoAnnotate to visualize the output from GSEA. A simple step-by-step description on all of this can be found here. Useful, manually-curated gene sets can also be downloaded here.

If you follow the protocol outlined in the above link, you may find that there is a group of similar pathways clustered in the enrichment map that are related to the phenotype that you see in your organism. You can subset the gene sets that define those pathways out of the original .gmt files and make your own smaller gmt file containing only the pathways in the cluster and use this to perform a more focused analysis.

gsva_plot()

Once you have your focused .gmt gene set file, we can run gsva_plot(). This function runs a single sample gene set enrichment analysis (ssGSEA) using the normalized count data and the custom .gmt file through the gsva() function from the GSVA package. This function uses the normalized count data for each sample to determine an enrichment score for each gene set from the .gmt file for each individual sample. The normalized enrichment scores are then grouped based on each condition and plotted in a violin plot (from ggplot2) where pairwise t-tests (corrected for multiple comparisons) are performed to compare the mean normalized enrichment scores between samples. gsva_plot() has the following arguments:

  • counts The normalized gene counts (rows=genes, columns=samples)
  • geneset A list object where each entry is a gene set. Gene nomenclature should match that of the rownames in counts
  • method The method for gsva to run. Default is 'ssgsea'. see ?gsva_plot() for more options.
  • condition The character vector describing the biological condition for each column in the normalized gene counts
  • title The title of your plot
  • compare A list of character vectors describing specific pairwise comparisons to make. These should match the strings defining condition. If left NULL, all possible comparisons will be made.
#You can read in a .gmt file into the proper format for this function using:
library(qusage)
geneSet<-read.gmt("geneSet.gmt")
#Run the gsva_plot()
gsva_plot(counts=norm_counts, geneset=geneSet, method="ssgsea", condition=cond, title="ssGSEA Results", compare=list(c("WT","KO")))

gsva plot