This repository has been archived by the owner on Apr 8, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
quality_control.Rmd
215 lines (129 loc) · 7.19 KB
/
quality_control.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
title: "Quality Control"
author: "`r getOption('author')`"
date: "`r Sys.Date()`"
bibliography: bibliography.bib
params:
bcb_file: "data/2018-04-25/bcb.rda"
data_dir: !r file.path("data", Sys.Date())
results_dir: "results/counts"
---
```{r setup, message=FALSE}
# Last modified 2018-04-25
bcbioRNASeq::prepareRNASeqTemplate()
source("_setup.R")
# Directory paths ==============================================================
invisible(mapply(
FUN = dir.create,
path = c(params$data_dir, params$results_dir),
MoreArgs = list(showWarnings = FALSE, recursive = TRUE)
))
# Load object ==================================================================
bcb_name <- load(params$bcb_file)
bcb <- get(bcb_name, inherits = FALSE)
stopifnot(is(bcb, "bcbioRNASeq"))
invisible(validObject(bcb))
interestingGroups(bcb) <- "genotype"
```
```{r header, child="_header.Rmd"}
```
```{r sample_data}
sampleData(bcb, return = "data.frame")
```
[bcbio][] run data was imported from **`r metadata(bcb)$uploadDir`**.
# Count matrices
```{r counts}
raw_counts <- counts(bcb, normalized = FALSE)
# DESeq2 normalized counts
normalized_counts <- counts(bcb, normalized = TRUE)
# Transcripts per million
tpm <- counts(bcb, normalized = "tpm")
saveData(raw_counts, normalized_counts, tpm, dir = params$data_dir)
writeCounts(raw_counts, normalized_counts, tpm, dir = params$results_dir)
```
The results are saved as gzip-compressed comma separated values (CSV). Gzip compression is natively supported on [macOS][] and Linux-based operating systems. If you're running Windows, we recommend installing [7-Zip][]. CSV files can be opened in [Excel][] or [RStudio][].
- [`normalized_counts.csv.gz`](`r file.path(params$results_dir, "normalized_counts.csv.gz")`): Use to evaluate individual genes and/or generate plots. These counts are normalized for the variation in sequencing depth across samples.
- [`tpm.csv.gz`](`r file.path(params$results_dir, "tpm.csv.gz")`): Transcripts per million, scaled by length and also suitable for plotting.
- [`raw_counts.csv.gz`](`r file.path(params$results_dir, "raw_counts.csv.gz")`): Only use to perform a new differential expression analysis. These counts will vary across samples due to differences in sequencing depth, and have not been normalized. Do not use this file for plotting genes.
# Read metrics {.tabset}
## Total reads
```{r plot_total_reads}
plotTotalReads(bcb)
```
## Mapped reads
The number of mapped reads should correspond to the number of total reads.
```{r plot_mapped_reads}
plotMappedReads(bcb)
```
## Mapping rate
The genomic mapping rate represents the percentage of reads mapping to the reference genome. Low mapping rates are indicative of sample contamination, poor sequencing quality or other artifacts.
```{r plot_mapping_rate}
plotMappingRate(bcb)
```
## Number of genes detected
```{r plot_genes_detected}
plotGenesDetected(bcb)
```
## Gene detection saturation
We should observe a linear trend in the number of genes detected with the number of mapped reads, which indicates that the sample input was not overloaded.
```{r plot_gene_saturation}
plotGeneSaturation(bcb)
```
## Exonic mapping rate
Ideally, at least 60% of total reads should map to exons.
```{r plot_exonic_mapping_rate}
plotExonicMappingRate(bcb)
```
## Intronic mapping rate
The majority of reads should map to exons and not introns.
```{r plot_intronic_mapping_rate}
plotIntronicMappingRate(bcb)
```
## rRNA mapping rate
Samples should have a ribosomal RNA (rRNA) contamination rate below 10%.
```{r plot_rrna_mapping_rate}
plotRRNAMappingRate(bcb)
```
## 5'->3' bias
```{r plot_5prime_3prime_bias}
plot5Prime3PrimeBias(bcb)
```
## Counts per gene
Generally, we expect similar count spreads for all genes between samples unless the library sizes or total RNA expression are different. The log10 TMM-normalized counts per gene normalization method [@Robinson:2010dd] equates the overall expression levels of genes between samples under the assumption that the majority of them are not differentially expressed. Therefore, by normalizing for total RNA expression by sample, we expect the spread of the log10 TMM-normalized counts per gene to be similar for every sample.
```{r plot_counts_per_gene}
plotCountsPerGene(bcb, normalized = "tmm")
```
## Count density
Generally, we expect similar count spreads for all genes between samples unless the total expressed RNA per sample is different.
```{r plot_count_density}
plotCountDensity(bcb, normalized = "tmm", style = "line")
```
# Fit modeling
Several quality metrics are first assessed to explore the fit of the model, before differential expression analysis is performed.
## Variance stabilization
The plots below show the standard deviation of normalized counts (`normalized_counts`) using `log2()`, `rlog()`, and variance stabilizing (`vst()`) transformations by `rank(mean)`. The transformations greatly reduce the standard deviation, with `rlog()` stabilizing the variance best across the mean.
```{r plot_mean_sd, fig.height=18, fig.width=6}
plotMeanSD(bcb, orientation = "vertical")
```
## Dispersion
The following plot shows the dispersion by mean of normalized counts. We expect the dispersion to decrease as the mean of normalized counts increases.
```{r plot_disp_ests}
plotDispEsts(bcb)
```
# Sample similarity analysis
Before performing similarity analysis, we transform counts to log2, which acts to minimize large differences in sequencing depth and helps normalize all samples to a similar dynamic range. For RNA-seq count data, variance increases with the mean. Logarithmic transformation of normalized count values with a small pseudocount will account for large variations seen between the highest expressing genes so that these genes won't dominate the PCA plots. However, due to the strong noise among low count values due to Poisson, the general log2 transformation will amplify this noise, and instead, low count genes will now dominate the PCA plots. So instead, we use a regularized log ("rlog"; base 2) transformation that gives similar results for high counts as a log2 transformation but also shrinks the values of low counts towards the genes’ average across samples. We do this with the `rlog()` function in the [DESeq2][] package [@DESeq2], which we will later use for differential gene expression analysis.
## Principal component analysis (PCA) {.tabset}
PCA [@Jolliffe:2002wx] is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. PCA takes the expression levels for genes and transforms it in principal component space, reducing each sample into one point. Thereby, we can separate samples by expression variation, and identify potential sample outliers. The PCA plot is a way to look at how samples are clustering.
```{r plot_pca}
markdownHeader("unlabeled", level = 3)
plotPCA(bcb, label = FALSE)
markdownHeader("labeled", level = 3)
plotPCA(bcb, label = TRUE)
```
## Hierarchical clustering
Inter-correlation analysis (ICA) is another way to look at how well samples cluster by plotting the correlation between the expression profiles of the samples.
```{r plot_correlation_heatmap}
plotCorrelationHeatmap(bcb, normalized = "rlog", method = "pearson")
```
```{r footer, child="_footer.Rmd"}
```