-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscRNA-seq_zebrafish.Rmd
359 lines (247 loc) · 12.5 KB
/
scRNA-seq_zebrafish.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
---
title: "scRNA-seq analysis"
author: "Hsiao-Yu Peng"
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
output:
pdf_document: default
html_document:
df_print: kable
---
The dataset has been downloaded from [GSE276251] (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE276251).
```{r setup, include=FALSE}
all_times <- list() # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
now <- NULL
function(before, options) {
if (before) {
now <<- Sys.time()
} else {
res <- difftime(Sys.time(), now, units = "secs")
all_times[[options$label]] <<- res
}
}
}))
knitr::opts_chunk$set(
tidy = TRUE,
tidy.opts = list(width.cutoff = 95),
message = FALSE,
warning = FALSE,
time_it = TRUE
)
```
```{r}
#install.packages('Seurat')
library(Seurat)
library(tidyverse)
library(patchwork)
# load control and high-glucose data set
ctl.data <- Read10X(data.dir = "GSE276251_RAW/control/")
glucose.data <- Read10X(data.dir = "GSE276251_RAW/glucose/")
# Initialize the Seurat object with the raw (non-normalized data).
ctrl <- CreateSeuratObject(counts = ctl.data, project= 'ctrl'
, min.cells = 3, min.features = 200)
glucose <- CreateSeuratObject(counts = glucose.data, project = 'glucose',
, min.cells = 3, min.features = 200)
ctrl
glucose
```
```{r}
# Explore the metadata
head(ctrl@meta.data)
head(glucose@meta.data)
```
- `orig.ident`: this often contains the sample identity if known, but will default to “SeuratProject”
- `nCount_RNA`: number of UMIs per cell
- `nFeature_RNA`: number of genes detected per cell
## 1. Filter out low-quality cells
- Low-quality cells or empty droplets often have very few genes (low nFeature_RNA and nCount_RNA)
- Cell doublets or multiplets have high values of nFeature_RNA & nCount_RNA
- Low-quality/dying cells often have high percentage of mitochondrial genes (percent.mt)
**Percentage of mitochondrial genes**
Seurat's `PercentageFeatureSet()` function calculates the proportion of transcripts mapping to mitochondrial genes. For each cell, it sums the counts across all genes (features) belonging to the mitochondrial set, which is identified by the "MT-" or "mt-" prefix.
NOTE: The pattern "^MT-" is used for human gene names, while "^mt-" is used for non-human gene names.
```{r}
# Extract percentage of mitochondrial genes
ctrl$percent.mt <- PercentageFeatureSet(object = ctrl, pattern = "^mt-")
glucose$percent.mt <- PercentageFeatureSet(object = glucose, pattern = "^mt-")
```
### Selecting cells for further analysis
**QC Metrics: "nFeature_RNA", "nCount_RNA", "percent.mt" **
```{r}
# Merge control and glucose-treated Seurat objects
combined_seurat <- merge(ctrl, y = glucose, add.cell.ids = c("Control", "Glucose"))
# Visualize QC metrics as a violin plot
VlnPlot(combined_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
#FeatureScatter is typically used to visualize feature-feature relationships
plot1 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
```
**1. nFeature_RNA (Number of genes detected per cell)**
- **Threshold considerations:**
- **Lower threshold:** Cell with `nFeature_RNA < 200` are often considered low quality, as those typically are dying cells or empty droplets.
- **Upper threshold:** Cells with unusually high numbers of detected genes may indicate doublets or multiplets, which confounding the clustering and can form small clusters that are confusing to interpret.
**2. nCount_RNA (Total number of RNA detected per cell)**
- **Threshold considerations:**
- **Lower threshold:** Cells with very few counts may have failed during library preparation or sequencing, or they may be dying cells.
- **Upper threshold:** Cells with very high RNA counts could be doublets or multiplets.
*3. Percentage of mitochondrial gene expression:*
- *Threshold considerations:* A high percentage of mitochondrial gene expression can indicate stressed or dying cells. Typically, a threshold of 5–10% mitochondrial gene expression is used to filter out cells.
These thresholds are guidelines and can vary depending on the nature of the experiment, tissue type, and data quality. We need to tweak these values after visual inspection of the plots.
From the plot, it seems that the percentage of mitochondrial genes is higher in some cells. In this case, the percentage of mitochondrial gene in control samples falls on about 20%, which may indicate the quality of this experiment is not very good.
We follow Wang's paper to filter the cells by these two criteria:
1. Selecting single cells with the number of expressed genes ranging between 500 and 3000.
2. Requiring the percentage of sequencing reads on mitochondrial genome to be less than 5 %.
```{r}
# Filter out cells
ctrl <- subset(ctrl, subset = nFeature_RNA > 500 & nFeature_RNA < 3000 & percent.mt < 5)
glucose <- subset(glucose, subset = nFeature_RNA > 500 & nFeature_RNA < 3000 & percent.mt < 5)
```
Visualize QC metrics again after filtering cells.
```{r}
# Merge control and glucose-treated Seurat objects
combined_seurat <- merge(ctrl, y = glucose, add.cell.ids = c("Control", "Glucose"))
# Visualize QC metrics as a violin plot
VlnPlot(combined_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
#FeatureScatter is typically used to visualize feature-feature relationships
plot1 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
```
The correlation between `percent.mt` and `nCount_RNA` improved (R squared -0.26 vs. -0.06) ater filtering high mitochonrial content cells, which represeting dying cells.
### Normalize, High Variable Features, Scaling Data
Single-cell RNA-seq data often exhibit variability in sequencing depth across cells. To prepare for clustering analysis, we need to perform several pre-processing steps:
1. Normalize the data to account for differences in sequencing depth.
2. Identify highly variable features (genes) that capture biological variation.
3. Scale the data to ensure all genes contribute equally to downstream analyses.
```{r}
# Normalize the data
ctrl <- NormalizeData(ctrl)
glucose <- NormalizeData(glucose)
# Find variable features for each dataset
ctrl <- FindVariableFeatures(ctrl, selection.method = "vst", nfeature = 2000)
glucose <- FindVariableFeatures(glucose, selection.method = "vst", nfeature = 2000)
# Scale the data
ctrl <- ScaleData(ctrl) # by default 2000 identified variable features
all.genes <- rownames(ctrl)
ctrl <- ScaleData(ctrl, features = all.genes)
glucose <- ScaleData(glucose)
all.genes <- rownames(glucose)
glucose <- ScaleData(glucose, features = all.genes)
```
## 2. Dimension Reduction and Cell Clustering
### PCA
**PCA (Principal Component Analysis)** is performed after the pre-processing workflow in scRNA-seq analysis to reduce dimensionality and identify key sources of variability in the dataset.
```{r}
# Perform PCA on the scaled data
ctrl <- RunPCA(ctrl, features = VariableFeatures(object = ctrl))
glucose <- RunPCA(glucose, features = VariableFeatures(object = glucose))
```
###Determine the "Dimension" of the dataset
```{r}
ElbowPlot(ctrl)
ElbowPlot(glucose)
```
In the elbow plots above, and "elbow" around PC10-15. Here I chose PC10 for downstream analysis.
### Cluster the cells
- **Low resolution (0.1):** Produces fewer clusters, meaning the algorithm combines more cells into larger groups. These clusters represent broader, coarser cell communities with fewer distinctions.
- **Higher resolution (0.5):** Results in more clusters, meaning the algorithm splits cells into smaller, more fine-grained clusters. This increases the granularity and helps capture subtle differences between cell types or states.
The Louvain algorithm detects communities (clusters) based on cell similarity, so at different resolutions, it identifies different numbers of clusters.
```{r}
# Clustering for control dataset
ctrl <- FindNeighbors(ctrl, dims = 1:10)
ctrl <- FindClusters(ctrl, resolution = 0.5)
ctrl <- RunUMAP(ctrl, dims = 1:10)
# Clustering for glucose-treated dataset
glucose <- FindNeighbors(glucose, dims = 1:10)
glucose <- FindClusters(glucose, resolution = 0.5)
glucose <- RunUMAP(glucose, dims = 1:10)
```
```{r}
# Visualize clusters
DimPlot(ctrl, reduction = "umap", label = TRUE) + ggtitle("Control")
DimPlot(glucose, reduction = "umap", label = TRUE) + ggtitle("High Glucose")
```
## 3. Integrate the contrl and high glucose-treated datasets
Data integration in single-cell RNA-seq analysis is performed to reduce **batch effects** and ensure that cells from different conditions, samples, or experiments can be compared on the same scale.
*Anchor-based integration*: Anchors between datasets are identified to integrate them based on shared biological variation.
```{r}
# Create list of Seurat objects
zebrafish.list <- list(ctrl, glucose)
# Normalize the data for each dataset independently
zebrafish.list <- lapply(X = zebrafish.list, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
return(x)
})
#Select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = zebrafish.list)
# Identify integration anchors between control and glucose datasets
anchors <- FindIntegrationAnchors(object.list = zebrafish.list,
anchor.features = features)
# Integrate the datasets based on the identified anchors
zebrafish.integrated <- IntegrateData(anchorset = anchors)
# Specify that we will perform downstream analysis on the integrated data
DefaultAssay(zebrafish.integrated) <- "integrated"
# Run PCA on the integrated dataset
zebrafish.integrated <- ScaleData(zebrafish.integrated)
zebrafish.integrated <- RunPCA(zebrafish.integrated)
ElbowPlot(zebrafish.integrated, ndims = 50)
# Perform clustering on the integrated data
zebrafish.integrated <- FindNeighbors(zebrafish.integrated, reduction = "pca", dims = 1:30)
zebrafish.integrated <- FindClusters(zebrafish.integrated, resolution = 0.5)
zebrafish.integrated <- RunUMAP(zebrafish.integrated, dims = 1:30)
# Plot UMAP for the integrated data
DimPlot(zebrafish.integrated, reduction = "umap", label = TRUE, repel = TRUE)
```
```{r}
# Compare
DimPlot(zebrafish.integrated, reduction = "umap", split.by = 'orig.ident')
```
## 4. Identify differentially expressed genes (DEGs) between the control and treated zebrafishes for each cell cluster (0, 1, 2, etc.)
**Differential Expression Analysis**
```{r}
# pseudobulk cells by stimulation condition AND cell type
bulk <- AggregateExpression(zebrafish.integrated, group.by = c("orig.ident", "seurat_clusters"), return.seurat = TRUE)
Cells(bulk)
```
```{r}
# Perform differential expression analysis (using DESeq2) on pseudobulk data
# Extract count data and metadata from the pseudobulk Seurat object
counts <- GetAssayData(bulk, assay= "RNA", slot = "counts")
counts_dense <- as.matrix(counts)
# Extract columnn names
column_names <- colnames(counts_dense)
# Split the column names to create metadata for treatment and cluster
meta_data <- data.frame(
orig.ident = sapply(strsplit(column_names, "_"), '[', 1), # First word before control or glucose
seurat_clusters = sapply(strsplit(column_names, "_"), '[', 2) # Second part is the cluster number
)
# Ensure that row names of the meatadata match the col names of counts matrix
rownames(meta_data) <- column_names
```
```{r}
library(DESeq2)
# Convert seurat_clusters to a factor
meta_data$seurat_clusters <- as.factor(meta_data$seurat_clusters)
# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts_dense,
colData = meta_data,
design = ~ orig.ident + seurat_clusters)
# filter
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
# Run DESeq2 differential expression analysis
dds <- DESeq(dds)
# Perform pairwise comparison for control vs glucose within each cluster
# Loop over unique clusters to extract DEGs per cluster
for (cluster in levels(meta_data$seurat_clusters)) {
# Subset the DESeq2 dataset by cluster
dds_cluster <- dds[ , meta_data$seurat_clusters == cluster]
# Perform differential expression analysis for this cluster
res <- results(dds_cluster, contrast = c("orig.ident", "ctrl", "glucose"))
# Print the DEGs for this cluster
cat(paste("Cluster:", cluster, "\n"))
print(head(res))
}
```