-
Notifications
You must be signed in to change notification settings - Fork 0
/
scRNAseq_long_read_part_1_with_questions.Rmd
584 lines (409 loc) · 21.1 KB
/
scRNAseq_long_read_part_1_with_questions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
---
title: "scRNAseq long read analysis - Part 1"
author: "Morgane Fierville & Ali Hamraoui, SincellTE 2024"
date: "2024-10-24"
output:
html_document:
code_folding: show
code_download: true
toc: true
toc_float: true
number_sections: false
---
<style>
body {
text-align: justify}
</style>
<!-- Set default parameters for all chunks -->
```{r, setup, include = FALSE}
set.seed(1337L)
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
fold_output = FALSE,
fold_plot = FALSE,
fig.align = 'center',
fig.width = 15,
fig.height = 8)
```
```{r message=FALSE, include=FALSE}
data_dir <- "/shared/projects/sincellte2024/workshop_long_read_analysis/data/"
# Create directory if not exist
checkDir <- function(output_dir){
if (! dir.exists(output_dir)){
dir.create(output_dir, recursive = TRUE)
cat(paste0(output_dir, " CREATED\n"))
}
}
```
# Transcript isoform diversity in embryonic mouse brain.
Droplet-based high throughput single cell sequencing techniques tremendously advanced our insight into cell-to-cell heterogeneity. However, those approaches only allow analysis of one extremity of the transcript after short read sequencing. In consequence, information on splicing and sequence heterogeneity is lost. To overcome this limitation, several approaches that use long-read sequencing were introduced recently. Yet, those techniques are limited by low sequencing depth and/or lacking or inaccurate assignment of unique molecular identifiers (UMIs), which are critical for elimination of PCR bias and artifacts. We introduce ScNaUmi-seq, an approach that combines the high throughput of Oxford Nanopore sequencing with an accurate cell barcode and UMI assignment strategy. UMI guided error correction allows to generate high accuracy full length sequence information with the 10x Genomics single cell isolation system at high sequencing depths. We analyzed transcript isoform diversity in embryonic mouse brain and show that ScNaUmi-seq allows defining splicing and SNVs (RNA editing) at a single cell level.
Lebrigand, K., Magnone, V., Barbry, P. et al. High throughput error corrected Nanopore single cell transcriptome sequencing.
Nat Commun 11, 4025 (2020). https://doi.org/10.1038/s41467-020-17800-6
## 1.1 Introduction
The aim of these materials is to demonstrate how to use the `Seurat` and `Isoswitch` R packages to process long read scRNA-seq data. Here we use the Lebrigand et al. datasets and follow the same general steps used in the scRNA-seq course, but using `Isoswitch` package to perform DE isoforms. In this first part we will show how to preprocess scRNA-seq isoform datasets and create a multi Assay Seurat object. Then, in a second part, we'll use this Seurat object to perform differential isoform expression analysis and explore pertinent biological questions.
## 1.2 Dataset
Some context for this dataset is given in the Lebrigand et al. introduction. This dataset includes two independent technical replicates: the 951-cell dataset and the 190-cell dataset. For each replicate, we have both a Gene and an Isoform expression matrix.
## 1.3 Reading in the data
We start the analysis by loading the necessary packages:
```{r message=FALSE, include=FALSE}
library(Seurat)
library(dplyr)
library(isoswitch)
library(scCustomize)
options(Seurat.object.assay.version = "v5")
```
```{r message=FALSE, include=FALSE}
source('/shared/projects/sincellte2024/workshop_long_read_analysis/imports/cell_annot_custom.R')
rename_iso <- function(matrix){
matrix$transcriptId <- sub("\\..*", "", matrix$transcriptId)
matrix$iso_id <-apply(matrix[, c("geneId","transcriptId")], 1 , paste , collapse = ".." )
rownames(matrix) <- matrix$iso_id
matrix <- matrix[, !colnames(matrix)%in%c("geneId","transcriptId","iso_id")]
}
annotate <- function (ScObject, cell_markers) {
if ("cell_type" %in% colnames(ScObject@meta.data)) {
ScObject$cell_type = NULL
}
ScObject = cell_annot_custom(ScObject,
assay = "RNA", # which assay to use to perform annotation ?
newname = "cell_type", # new column name in meta.data
markers = cell_markers) # markers for cell type
return(ScObject)
}
```
We will load in two replicate from each data type (gene and isoform counts):
### 1.3.1 Gene-level matrix count
We then load the gene counts data, using a Seurat function that reads cellranger output, for 190c and 951c replicates:
```{r load_genes_cm}
c190_gene <- Seurat::Read10X(data.dir = paste0(data_dir, "GSM3748086_190c/"))
c951_gene <- Seurat::Read10X(data.dir = paste0(data_dir, "GSM3748088_951c/"))
```
**Questions:**
How many cells and genes do we have in this both datasets ? Are these genes all common between the two replicates ?
```{r}
# Write your code here
# ...
```
### 1.3.2 Isoform-level matrix count
We load and filter the Sicelore CSV file of isoform count matrix for 190c and 951c replicates:
There are an important component of the Isoswitch to be aware of:
- Isoswich expects the row names of the isoform count matrix to follow the format `[gene_name]..[transcript_id]`,
example: `Eva1c..ENSMUST00000231280`.
- We use the rename_iso function described on top to rename Isoform-level matrix rownames.
```{r}
c190_iso <- read.csv(paste0(data_dir,'GSM3748087_190c.isoforms.matrix.txt'), sep = '\t')
c951_iso <- read.csv(paste0(data_dir,'GSM3748089_951c.isoforms.matrix.txt'), sep = '\t')
```
**Exercise: **
Apply the function `rename_iso` for each replicates and try to explain it.
```{r}
# Write your code here
# ...
```
### 1.3.3 Homogeneization
To homogenize both 190c matrices (genes + isoforms), we ensure to have the same barcodes in the columns of both matrices:
```{r homogenize_190c}
colnames(c190_gene) = gsub("-1$", "", colnames(c190_gene)) # remove the "-1" for each cellular barcodes
com.cells = intersect(colnames(c190_iso), colnames(c190_gene))
length(com.cells) # 190
c190_gene = c190_gene[, colnames(c190_gene) %in% com.cells]
c190_iso = c190_iso[, colnames(c190_iso) %in% com.cells]
# dim(c190_gene)[2]
# dim(c190_iso)[2]
```
**Exercise: **
Do the same for 951c matrices:
```{r homogenize_951c_a}
# Write your code here
# ...
```
## 1.4 The Seurat Object
We can use `CreateAssay5Object` to make Seurat assay for each dataset. Then, we use the gene assay to create a Seurat Object that will be used in the analysis:
### 1.4.1 Create Seurat assay:
We create a gene and isoform assays for replicate 190c:
```{r create_assays_190c}
c190_gene_assay = CreateAssay5Object(counts = c190_gene)
c190_isoform_assay = CreateAssay5Object(counts = c190_iso)
```
**Exercise: **
Create the assay objects for the replicate 951c:
```{r create_assays_951c_a}
# Write your code here
# ...
```
### 1.4.2 Create Seurat object
Now, we initialize a Seurat object with gene assay. Then, we add the isoform assay to the same Seurat object. We will name our Seurat objects **c190_Sobj** and **c951_Sobj**
```{r init_sobj}
assay = "RNA"
sample_name = "190c"
c190_Sobj = Seurat::CreateSeuratObject(counts = c190_gene_assay,
assay = assay,
project = sample_name)
c190_Sobj[["ISO"]] = c190_isoform_assay
c190_Sobj[[paste0('log_nCount_', assay)]] = log(c190_Sobj[[paste0('nCount_', assay)]])
c190_Sobj
head(rownames(c190_Sobj@assays$RNA$counts))
head(rownames(c190_Sobj@assays$ISO$counts))
head(c190_Sobj@meta.data)
```
**Exercise: **
Create a Seurat object 'c951_Sobj' for replicate 951c:
```{r init_sobj_951c_a}
# Write your code here
# ...
```
**Questions:**
How many cells and features are there within each sample ?
```{r}
# Write your code here
# ...
```
## 1.5 Merge, normalise and dimension reduction
### 1.5.1 Merge
Now, we merge our datasets (with batch correction) and see whether there still a batch effect using UMAP:
We use the `Seurat::merge` function to merge the c190_Sobj and c951_Sobj objects. We will name the merged Seurat objects **seurat_obj**
```{r}
seurat_obj <- merge(c951_Sobj, c190_Sobj, add.cell.ids = c("c951", "c190"))
# seurat_obj <- merge(c951_Sobj, c190_Sobj) # Avis : Some cell names are duplicated across objects provided. Renaming to enforce unique cell names.
seurat_obj
```
**Questions:**
How many assays and layers you have within the object ? and what are they?
### 1.5.2 Normalization
Next, to generate a projection based on the genes count. We first normalize the genes count matrix using the `LogNormalize` method.
**Warning:** Being aware of the active assay is important when doing different types of analysis because tools will try to use the active assay by default if they can.
```{r set_assay_before}
Seurat::DefaultAssay(seurat_obj) = "RNA"
```
```{r normalization}
seurat_obj = Seurat::NormalizeData(seurat_obj,
normalization.method = "LogNormalize",
assay = "RNA")
seurat_obj = Seurat::FindVariableFeatures(seurat_obj,
assay = "RNA",
normalization.method = "vst",
nfeatures = 3000)
seurat_obj
```
### 1.5.3 Dimentianality reduction
Let's generate a tSNE plot to visualize cells.
First, we perform a PCA:
```{r pca_before}
seurat_obj = Seurat::ScaleData(seurat_obj,
features = rownames(seurat_obj), verbose = F)
var_features = Seurat::VariableFeatures(object = seurat_obj)
seurat_obj = Seurat::RunPCA(seurat_obj, features = var_features,
verbose = F, reduction.name = "RNA_pca", max_dims = 100)
```
It’s often useful to know how many PCs can be used without losing a lot of information. So we need to pick a number of PCs which capture most of the variance in the data. These will be used to generate a UMAP plot. The elbow plot shows the part of the variance captured by each PC.
**Questions:**
How many PCs should we use to describe the data ?
```{r elbowPlot, fig.width = 12, fig.height = 4}
Seurat::ElbowPlot(seurat_obj, ndims=100, reduction = "RNA_pca")
```
**Answer:**
A large proportion of the variation is captured by 10 components. In our case a big drop happens at 10, so seems like a good initial choice:
Non-linear dimension reduction methods such as UMAP and TSNE take the PCA data as a starting point. We generate a tSNE with 10 principal components :
```{r tsne_before}
ndims = 10
seurat_obj = Seurat::RunTSNE(seurat_obj,
reduction = "RNA_pca",
dims = 1:ndims,
seed.use = 1337L,
reduction.name = paste0("RNA_pca_", ndims, "_tsne"))
seurat_obj
```
```{r replicates, fig.height=4, fig.width=8}
DimPlot(seurat_obj, reduction = "RNA_pca_10_tsne") +
ggplot2::theme(aspect.ratio = 1)
```
**Questions:**
Are the technical replicates overlap well ? Do you see any sample bias ?
### 1.5.4 Clustering
Now, We can do clustering. Incease resolution parameter leads to more clusters (default is 0.8). It's very important to find the correct cluster resolution, since cell type markers depends on cluster definition.
```{r}
seurat_obj <- FindNeighbors(seurat_obj, dims = 1:ndims, verbose = F, reduction = 'RNA_pca')
seurat_obj <- FindClusters(seurat_obj, resolution = 0.7, verbose = F)
```
```{r clusteres, fig.height=4, fig.width=8}
DimPlot(seurat_obj, reduction = "RNA_pca_10_tsne") +
ggplot2::theme(aspect.ratio = 1)
```
## 1.6 Identifying cell types
Here, we will use a set of markers based on Lebriband et al. paper to annotate each cell. `Seurat::AddModuleScore` function allows to give a cell type score to each cell. Then, cell type class is assigned to a cell according to the highest score. Finally, we will use cell annotation to determine which cell type is in each cluster.
### 1.6.1 Cell markers:
```{r markers}
cell_markers <- list(Cajal_Retzius = c('Lhx5', 'Reln', 'Snhg11', 'Meg3'),
GABAergic = c('Arx', 'Maf', 'Dlx2'),
imature_GABAergic = c('Gad2', 'Meis2', 'Gad2', 'Dlx6os1'),
Glutamatergic = c('Grin2b', 'Opcml', 'Camk2b', 'Mef2c'),
imature_Glutamatergic = c('Sox11', 'Pou3f3', 'Pou3f2', 'Neurod6', 'Neurod2'),
intermediate_progenitors = c('Snhg11', 'Neurog2', 'Eomes', 'Neurod1', 'Rbfox3', 'Tcf4', 'Meis2', 'H3f3b', 'Neurod6', 'Neurod2', 'Sox11', 'Pax6'),
cyclin_radial_glia = c('Top2a', 'Ccnd2', 'Cenpf', 'Mki67'),
radial_glia = c('Fabp7', 'Vim', 'Dbi' , 'Pax6'))
color_markers <- list(Cajal_Retzius = "#EFE64C",
GABAergic = "#E88578",
imature_GABAergic = "#ABA73F",
Glutamatergic = "#79C488",
imature_Glutamatergic = "#4A9EC0",
intermediate_progenitors = "#4978B4",
cyclin_radial_glia = "#825B8D",
radial_glia = "#D04C3A")
```
```{r fig.height=1, fig.width=15}
data.frame(cell_type = names(color_markers),
color = unlist(color_markers)) %>%
ggplot2::ggplot(., ggplot2::aes(x = cell_type, y = 0, fill = cell_type)) +
ggplot2::geom_point(pch = 21, size = 5) +
ggplot2::scale_fill_manual(values = unlist(color_markers), breaks = names(color_markers)) +
ggplot2::theme_classic() +
ggplot2::theme(legend.position = "none",
axis.line = ggplot2::element_blank(),
axis.title = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank(),
axis.text.x = ggplot2::element_text(size=12),
axis.text.y = ggplot2::element_blank())
```
### 1.6.2 Annotate cells:
```{r}
seurat_obj <- JoinLayers(seurat_obj)
# Create join matrix
# seurat_obj@assays$RNA@layers$data
# seurat_obj@assays$RNA@layers$counts
# seurat_obj@assays$RNA@layers$scale.data
seurat_obj <- annotate(seurat_obj, cell_markers)
table(seurat_obj$cell_type)
```
```{r cell_type, fig.height=4, fig.width=8}
DimPlot(seurat_obj, reduction = "RNA_pca_10_tsne", group.by = 'cell_type')+
ggplot2::scale_color_manual(values = unlist(color_markers),
breaks = names(color_markers)) +
Seurat::NoAxes() +
ggplot2::theme(aspect.ratio = 1)
```
### 1.6.3 Annotate clusters:
We have our individual cell type, now let’s annotate clusters based on major cell type:
```{r}
cell_data <- data.frame(
cluster = Idents(seurat_obj),
cell_type = seurat_obj$cell_type
)
major_cell_type_per_cluster <- cell_data %>%
group_by(cluster) %>%
summarise(cluster_cell_type = names(which.max(table(cell_type))))
seurat_obj$cluster_cell_type <- major_cell_type_per_cluster$cluster_cell_type[match(Idents(seurat_obj), major_cell_type_per_cluster$cluster)]
table(seurat_obj$seurat_clusters, seurat_obj$cell_type)
```
```{r cluster_cell_type, fig.height=5, fig.width=8}
DimPlot(seurat_obj, reduction = "RNA_pca_10_tsne", group.by = 'cluster_cell_type', label = F) +
ggplot2::scale_color_manual(values = unlist(color_markers),
breaks = names(color_markers)) +
Seurat::NoAxes() +
ggplot2::theme(aspect.ratio = 1)
```
## 1.7 Differentially expressed isoforms
Now that we have our annotated cell clusters, we can ask questions about differences between cell types we identified at the transcript level.
Here, we will look for events where two isoforms of the same gene are considered markers of different clusters.
### 1.7.1 Visualising gene and isoform counts
We can use the `VlnPlot()` function to visualize data stored in our metadata slot. This function will generate violin plots with clusters on the x-axis and the variable of interest on the y-axis.
For example, to plot raw counts per cell type:
```{r ncount, fig.height=5, fig.width=10}
VlnPlot(seurat_obj, features = c("nCount_RNA", "nCount_ISO"), group.by = 'cluster_cell_type', log=TRUE)
```
**Questions:**
Does the gene counts correlate well with the isoform count? To do this, you can use the `FeatureScatter()` function.
```{r featurescatter, fig.height=5, fig.width=12}
# Write your code here
# ...
```
### 1.7.2 Convert Seurat assay to V4
Now, to use `Isoswitch`, we have to make sure that we have a **Seurat V4 object**. If not, we have to convert a V5 object using `Convert_Assay` function from `scCustomize` package.
We name the new object "seurat_obj4"
```{r}
seurat_obj4 <- scCustomize::Convert_Assay(seurat_object = seurat_obj, convert_to = "assay", assay = Assays(seurat_obj))
seurat_obj4
```
### 1.7.3 Normalize isoform counts
To work on the isoform level, We will use the **ISO assay**. We normalize and scale the isoform count matrix:
```{r process_iso}
seurat_obj4 = Seurat::NormalizeData(seurat_obj4, assay = "ISO",
normalization.method = "LogNormalize", verbose = F)
seurat_obj4 = Seurat::ScaleData(seurat_obj4, assay = "ISO", verbose = FALSE)
seurat_obj4
```
### 1.7.4 Generate a "multi" assay
We generate a "multi" assay using `isoswitch` package, the method `iso_preprocess()` prepares the isoform matrix by removing low-expression transcripts as well as removing single-isoform genes which are irrelevant for the isoform switch analysis. The threshold for filtering transcripts is 5, this means that we remove the transcript if it represents less than 5% of the gene counts.
The resulting "multi-isoform" matrix is stored as a new assay on the input Seurat object.
```{r isoswitch_multi}
seurat_obj4 = isoswitch::iso_preprocess(seurat_obj4, assay = "ISO", new_assay = "multi", filter_threshold = 5)
head(rownames(seurat_obj4@assays$multi@counts))
```
**Questions:**
How many transcripts did we remove ?
We finally have three assays with distinct dimensions:
```{r dim_assays}
dim(seurat_obj4@assays$RNA@counts)
dim(seurat_obj4@assays$ISO@counts)
dim(seurat_obj4@assays$multi@counts)
```
### 1.7.5 Isoform characterization
The method `iso_compute_stats()` parses the isoform raw count matrix returning a data frame with stats on the expression of each transcript
- `feature`, `gene_id`, `transcript_id` gene/transcript identifiers
- `sum` Total UMI counts for the transcript
- `total_gene` Total UMI counts for the gene
- `n_isofs` Number of distinct isoforms detected for the gene
- `max_sum` Max sum value for the isoform with the highest expression
- `perc` Relative percentage of isoform UMI count vs gene total.
- `is_major` (Boolean) is this considered a major isoform
- `is_top` (Boolean) is this highest expressed isoform
```{r iso_compute_stats}
Idents(seurat_obj4) <- 'cluster_cell_type'
stats = isoswitch::iso_compute_stats(seurat_obj4@assays$multi@counts) %>%
dplyr::arrange(gene_id)
head(stats, n=4)
```
**Exercise:**
Which gene have the highest number of isoforms ? Find the gene name and the number of isoforms.
```{r}
# Write your code here
# ...
```
**Exercise:**
Which isoform have the highest number of counts ? Find this isoform, its corresponding gene name and its number of counts.
```{r}
# Write your code here
# ...
```
Now, we plot a summary of the data. The method `plot_assay_stats()` plot a summary with the number of genes, the number of transcripts, the distribution of isoforms and the number of genes per cell type that can be used to describe succintly the isoform distribution in the dataset.
```{r plot_assay_stats, fig.height=6, fig.width=15}
isoswitch::plot_assay_stats(seurat_obj4, "ISO")
```
**Questions:**
Which cell type does express the largest number of genes ?
### 1.7.6 Isoform switch search
The term “isoform switch” refers to an event where two isoforms of the same gene are considered markers of different clusters. The marker search is implemented on the method `ISO_SWITCH_ALL()`. Any extra parameters are passed on to the underlying Seurat’s FindMarkers call to fine tune the search space. `ISO_SWITCH_ALL()` returns a data frame of transcripts identified as markers of a given cluster for a given gene, one transcript per row.
**Warnings:** This function may take time to run, so you can skip this step and you find the output here : `/shared/projects/sincellte2024/workshop_long_read_analysis/output/datasets/switch_markers_combined_all.rds`
```{r}
clusters <- levels(seurat_obj4@active.ident)
switch_markers <- ISO_SWITCH_ALL(seurat_obj4, clusters, assay="multi",
min.pct=0, logfc.threshold=0.40)
switch_markers
```
## 1.8 Save
We save the annotated and filtered Seurat object and also the dataframe of the switch markers :
```{r save_seurat_obj_filtered_annotated}
out_dir <- "./output/datasets/"
checkDir(out_dir)
saveRDS(seurat_obj4, file = paste0(out_dir, "seurat_obj_annotated.rds"))
saveRDS(color_markers, file = paste0(out_dir, "color_markers.rds"))
saveRDS(switch_markers, file = paste0(out_dir, "switch_markers_combined_all.rds"))
```
## 1.9 Ressources
- [Seurat tutorials](https://satijalab.org/seurat/)
- [Isoswitch R package](https://github.com/ucagenomix/isoswitch)
- [Read more about Single-cell long-read](https://www.isomics.eu/index.html)
# R session
```{r sessioninfo, echo = FALSE, fold_output = TRUE}
sessionInfo()
```