forked from iaradsouza1/intro-single-cell
-
Notifications
You must be signed in to change notification settings - Fork 0
/
01_import_qc.qmd
183 lines (137 loc) · 6.27 KB
/
01_import_qc.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
title: "Importing data and performing quality control"
format: html
---
## Motivation
Single-cell quality control is a crucial step in single-cell analysis. In this step, we identify and filter out poor-quality cells, which may arise due to technical artifacts, biological variability, or experimental conditions. Effective quality control helps to improve downstream analysis accuracy and interpretation by ensuring that only high-quality cells are included in the next steps. It is important to note that quality check is an interactive process integrated into the entire scRNA-seq analysis workflow. As data processing steps may influence quality metrics, re-assessment of quality after each processing step is essential.
There are some essential things to check on your data before doing more robust analyses:
- **Gene Detection Rate**: The number of genes detected in each cell, indicating the cell's transcriptional activity. Low gene detection rates may suggest poor-quality cells or technical issues.
- **Unique Molecular Identifier (UMI) Counts**: UMI counts reflect the depth of sequencing and can indicate the quality of library preparation.
- **Mitochondrial Gene Expression**: Higher expression of mitochondrial genes relative to nuclear genes can indicate stress or cell damage.
- **Library Size**: The total number of UMIs or reads per cell, providing an overall measure of data quality and sequencing depth.
- **Doublet Detection**: Identification of potential cell doublets, where two cells are erroneously captured and sequenced together as one. Doublets can skew downstream analyses.
## Create Seurat object from scratch
```{r}
library(Seurat)
library(Matrix)
library(SingleCellExperiment)
library(scDblFinder)
library(tidyverse)
```
```{r}
samples <- c("GSM5102900", "GSM5102901", "GSM5102902", "GSM5102903")
group <- c("healthy", "healthy", "non_survival_sepsis", "non_survival_sepsis")
# Matrices directories
dir <- paste0("data/sepsis/", samples)
names(dir) <- samples
# Read each matrix into a separate object
data <- lapply(dir, Read10X)
# Create an SeuratObject for each matrix
ls_sc <- pmap(list(data, samples, group), function(x, y, z) {
CreateSeuratObject(x, project = y, min.cells = 3, min.features = 200,
meta.data = data.frame(cells = colnames(x),
group = z) %>%
tibble::column_to_rownames("cells"))
})
# Merge them all
sc <- merge(ls_sc[[1]], y = ls_sc[2:length(ls_sc)], project = "all_samples")
# Save object
save(sc, file = "results/sc_noqc_unmerged_layers.rda")
```
## Quality check
### Remove doublets with `scDblFinder`
```{r}
# This step was not run as the datasets was shared with a already preprocessed version of
# the count table. Standard Seurat workflow for single-cell analysis was performed.
# The SingleCellExperiment does not handle the layers attribute implemented on Seurat v5.
# We need to join the layers to perform QC analysis of all samples as a whole
sc <- JoinLayers(sc)
# Convert Seurat object into SingleCellExperiment
sce <- as.SingleCellExperiment(sc)
# Remove doublets using the scDblFinder
set.seed(123)
results <- scDblFinder(sce, returnType = 'table') %>%
as.data.frame() %>%
dplyr::filter(type == "real")
# Count how many doublets are
results %>%
dplyr::count(class)
# Keep only the singlets
keep <- results %>%
dplyr::filter(class == "singlet") %>%
rownames()
sc <- sc[, keep]
```
### Calculate the percentage of mitochondrial genes
```{r}
# Identfy the percentage of mitocondrial genes in each cell
sc[["percent.mt"]] <- PercentageFeatureSet(sc, pattern = "^MT-")
# Violin plot of feature counts and molecules counts
VlnPlot(sc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```
### Compare the number of features (genes) and the number of molecules by cell
```{r}
# Use the FetchData to retrieve a dataframe with counts information
sc.qc <- FetchData(sc, vars = c("nFeature_RNA", "nCount_RNA", "percent.mt"))
# Distribution of the number of molecules
sc.qc %>%
ggplot() +
geom_vline(aes(xintercept = median(sc.qc$nCount_RNA)), color = "red") +
geom_histogram(aes(x = nCount_RNA), bins = 100)
# Mean number of RNA molecules per cell
summary(sc.qc$nCount_RNA)
# Plot the distribution of features (genes) - all samples
sc.qc %>%
ggplot() +
geom_vline(aes(xintercept = median(sc.qc$nFeature_RNA)), color = "red") +
geom_histogram(aes(x = nFeature_RNA), bins = 200)
# Mean number of genes per cell
summary(sc.qc$nFeature_RNA)
# Plot the distribution of features (genes) - by samples
sc.qc %>%
mutate(group = sc@meta.data$orig.ident) %>%
ggplot() +
geom_histogram(aes(x = nFeature_RNA, fill = group), bins = 200) +
scale_x_log10()
# Plot the distribution of mitochondrial genes
sc.qc %>%
ggplot() +
geom_histogram(aes(x = percent.mt), bins = 100) +
geom_vline(xintercept = 10, color = "red")
summary(sc.qc$percent.mt)
# Scatter plot of the relationship of number of molecules and the percent of MT genes
FeatureScatter(sc, feature1 = "nCount_RNA", feature2 = "percent.mt", group.by = "orig.ident")
# Scatter plot of the relationship of number of molecules and the number of features
FeatureScatter(sc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "orig.ident")
sc.qc %>%
ggplot() +
geom_point(aes(nCount_RNA, nFeature_RNA, colour = percent.mt), alpha = .50) +
scale_x_log10() +
scale_y_log10()
```
### Filter cells
```{r}
# Keep only the cells with nCount_RNA > 500 & nFeature_RNA < 4000 & percent.mt < 10
sc.qc <- sc.qc %>%
mutate(keep = if_else(nCount_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 10, "keep", "remove")) %>%
mutate(er = ifelse(nCount_RNA > 3000 & nFeature_RNA < 600 & percent.mt < 10, "eritro", "no_eritro"))
sc.qc %>%
dplyr::count(keep)
sc.qc %>%
ggplot() +
geom_point(aes(nCount_RNA, nFeature_RNA, colour = keep), alpha = .30) +
scale_x_log10() +
scale_y_log10()
sc.qc %>%
ggplot() +
geom_point(aes(nCount_RNA, nFeature_RNA, colour = er), alpha = .30) +
scale_x_log10() +
scale_y_log10()
sc_qc <- subset(sc, nCount_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 10)
save(sc_qc, file = "results/sc_qc.rda")
```
### Split the layers (for integration step)
```{r}
sc_qc_split <- split(sc_qc, f = sc_qc$orig.ident)
save(sc_qc_split, file = "results/sc_qc_split.rda")
```