01_import_qc.qmd

---
title: "Importing data and performing quality control"
format: html
---

## Motivation

Single-cell quality control is a crucial step in single-cell analysis. In this step, we identify and filter out poor-quality cells, which may arise due to technical artifacts, biological variability, or experimental conditions. Effective quality control helps to improve downstream analysis accuracy and interpretation by ensuring that only high-quality cells are included in the next steps. It is important to note that quality check is an interactive process integrated into the entire scRNA-seq analysis workflow. As data processing steps may influence quality metrics, re-assessment of quality after each processing step is essential.

There are some essential things to check on your data before doing more robust analyses:

 - **Gene Detection Rate**: The number of genes detected in each cell, indicating the cell's transcriptional activity. Low gene detection rates may suggest poor-quality cells or technical issues.
 - **Unique Molecular Identifier (UMI) Counts**: UMI counts reflect the depth of sequencing and can indicate the quality of library preparation.
 - **Mitochondrial Gene Expression**: Higher expression of mitochondrial genes relative to nuclear genes can indicate stress or cell damage.
 - **Library Size**: The total number of UMIs or reads per cell, providing an overall measure of data quality and sequencing depth.
 - **Doublet Detection**: Identification of potential cell doublets, where two cells are erroneously captured and sequenced together as one. Doublets can skew downstream analyses.

## Create Seurat object from scratch

```{r}
library(Seurat)
library(Matrix)
library(SingleCellExperiment)
library(scDblFinder)
library(tidyverse)
```

```{r}
samples <- c("GSM5102900", "GSM5102901", "GSM5102902", "GSM5102903")
group <- c("healthy", "healthy", "non_survival_sepsis", "non_survival_sepsis")

# Matrices directories
dir <- paste0("data/sepsis/", samples)
names(dir) <- samples

# Read each matrix into a separate object
data <- lapply(dir, Read10X)

# Create an SeuratObject for each matrix
ls_sc <- pmap(list(data, samples, group), function(x, y, z) {
  CreateSeuratObject(x, project = y, min.cells = 3, min.features = 200, 
                     meta.data = data.frame(cells = colnames(x), 
                                            group = z) %>%
                       tibble::column_to_rownames("cells"))
})

# Merge them all
sc <- merge(ls_sc[[1]], y = ls_sc[2:length(ls_sc)], project = "all_samples")

# Save object
save(sc, file = "results/sc_noqc_unmerged_layers.rda")
```

## Quality check

### Remove doublets with `scDblFinder`

```{r}
# This step was not run as the datasets was shared with a already preprocessed version of
# the count table. Standard Seurat workflow for single-cell analysis was performed. 

# The SingleCellExperiment does not handle the layers attribute implemented on Seurat v5. 
# We need to join the layers to perform QC analysis of all samples as a whole
sc <- JoinLayers(sc)

# Convert Seurat object into SingleCellExperiment
sce <- as.SingleCellExperiment(sc)

# Remove doublets using the scDblFinder
set.seed(123)
results <- scDblFinder(sce, returnType = 'table') %>% 
  as.data.frame() %>%
  dplyr::filter(type == "real")

# Count how many doublets are
results %>% 
  dplyr::count(class)

# Keep only the singlets
keep <- results %>% 
  dplyr::filter(class == "singlet") %>% 
  rownames()

sc <- sc[, keep]
```

### Calculate the percentage of mitochondrial genes

```{r}
# Identfy the percentage of mitocondrial genes in each cell
sc[["percent.mt"]] <- PercentageFeatureSet(sc, pattern = "^MT-")

# Violin plot of feature counts and molecules counts
VlnPlot(sc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
```

### Compare the number of features (genes) and the number of molecules by cell

```{r}
# Use the FetchData to retrieve a dataframe with counts information
sc.qc <- FetchData(sc, vars = c("nFeature_RNA", "nCount_RNA", "percent.mt"))

# Distribution of the number of molecules
sc.qc %>%
  ggplot() +
  geom_vline(aes(xintercept = median(sc.qc$nCount_RNA)), color = "red") +
  geom_histogram(aes(x = nCount_RNA), bins = 100)

# Mean number of RNA molecules per cell
summary(sc.qc$nCount_RNA)

# Plot the distribution of features (genes) - all samples
sc.qc %>%
  ggplot() +
  geom_vline(aes(xintercept = median(sc.qc$nFeature_RNA)), color = "red") +
  geom_histogram(aes(x = nFeature_RNA), bins = 200)

# Mean number of genes per cell
summary(sc.qc$nFeature_RNA)

# Plot the distribution of features (genes) - by samples
sc.qc %>%
  mutate(group = sc@meta.data$orig.ident) %>% 
  ggplot() +
  geom_histogram(aes(x = nFeature_RNA, fill = group), bins = 200) +
  scale_x_log10()

# Plot the distribution of mitochondrial genes 
sc.qc %>%
  ggplot() +
  geom_histogram(aes(x = percent.mt), bins = 100) +
  geom_vline(xintercept = 10, color = "red")

summary(sc.qc$percent.mt)

# Scatter plot of the relationship of number of molecules and the percent of MT genes
FeatureScatter(sc, feature1 = "nCount_RNA", feature2 = "percent.mt", group.by = "orig.ident")

# Scatter plot of the relationship of number of molecules and the number of features
FeatureScatter(sc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA", group.by = "orig.ident")

sc.qc %>%
  ggplot() +
  geom_point(aes(nCount_RNA, nFeature_RNA, colour = percent.mt), alpha = .50) +
  scale_x_log10() +
  scale_y_log10()
```

### Filter cells  

```{r}
# Keep only the cells with nCount_RNA > 500 & nFeature_RNA < 4000 & percent.mt < 10
sc.qc <- sc.qc %>%
  mutate(keep = if_else(nCount_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 10, "keep", "remove")) %>% 
  mutate(er = ifelse(nCount_RNA > 3000 & nFeature_RNA < 600 & percent.mt < 10, "eritro", "no_eritro"))

sc.qc %>% 
  dplyr::count(keep)
  
sc.qc %>% 
  ggplot() +
  geom_point(aes(nCount_RNA, nFeature_RNA, colour = keep), alpha = .30) +
  scale_x_log10() +
  scale_y_log10()

sc.qc %>% 
  ggplot() +
  geom_point(aes(nCount_RNA, nFeature_RNA, colour = er), alpha = .30) +
  scale_x_log10() +
  scale_y_log10()

sc_qc <- subset(sc, nCount_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 10)

save(sc_qc, file = "results/sc_qc.rda")
```

### Split the layers (for integration step)

```{r}
sc_qc_split <- split(sc_qc, f = sc_qc$orig.ident)
save(sc_qc_split, file = "results/sc_qc_split.rda")
```