NOTE: This section is based on the code provided in the DESeq2 vignette, which can be checked for extra information.
A basic task in the analysis of expression data is the detection of differentially expressed genes. The DESeq2 package provides a method to test for differential expression by using a generalised linear model in which counts are modeled with a negative binomial distribution. It expects a matrix of count values where each column corresponds to a sample and each line to a feature (e.g. a gene). Typically, a DESeq2 analysis is performed in three steps: count normalisation, dispersion estimation and differential expression test, although the authors also provide a wrapper function for those steps.
Since we have already generated a matrix with the normalised counts in the previous section (see Normalising counts with DESeq2), we will use it directly as input for the next step.
An important step in differential expression analysis is to figure out how much variability we can expect in the expression measurements within the same condition. Unless this is known, we cannot make inferences about whether the change in expression observed for a given gene is big enough to be considered significant, or whether it corresponds to the variability that we would expect by chance. This is why it is so important to have replicates: they show how much variation occurs without a difference in the condition.
In DESeq2, in order to estimate the dispersion for each gene, we can use the function estimateDispersions:
dds=estimateDispersions(dds)
The result of the estimation can be visualised with the plotDispEsts function:
pdf(file="./de_dispersion.pdf")
plotDispEsts(dds)
dev.off()
Finally, we will use the function nbinomWaldTest to contrast the two studied conditions:
dds=nbinomWaldTest(dds)
results=results(dds)
The padj column in the table dds
contains the p-values adjusted for multiple testing with the Benjamini-Hochberg procedure (i.e. FDR). This is the information that we will use to decide whether the expression of a given gene differs significantly across conditions (e.g. we can arbitrarily decide that genes with an FDR<0.10 are differentially expressed).
Exercise: How would you select those genes that pass a given FDR threshold (e.g. FDR<0.10)? Which are the most significant? Solution
Let us generate an MA plot to evaluate the results of the differential expression analysis:
pdf(file="./de_ma.pdf")
plotMA(dds,ylim=c(-2,2))
dev.off()
The three steps detailed above can be performed with just one single function, which takes as input a DESeqDataSet object like the one we have generated in the previous section (see Normalising counts with DESeq2).
# first create dds object
dds=DESeq(dds)
results=results(dds)