scRNA-seq_zebrafish.Rmd

---
title: "scRNA-seq analysis"
author: "Hsiao-Yu Peng"
date: 'Compiled: `r format(Sys.Date(), "%B %d, %Y")`'
output:
  pdf_document: default
  html_document:
    df_print: kable
---

The dataset has been downloaded from [GSE276251] (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE276251). 


```{r setup, include=FALSE}
all_times <- list()  # store the time for each chunk
knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      now <<- Sys.time()
    } else {
      res <- difftime(Sys.time(), now, units = "secs")
      all_times[[options$label]] <<- res
    }
  }
}))
knitr::opts_chunk$set(
  tidy = TRUE,
  tidy.opts = list(width.cutoff = 95),
  message = FALSE,
  warning = FALSE,
  time_it = TRUE
)
```

```{r}
#install.packages('Seurat')
library(Seurat)
library(tidyverse)
library(patchwork)

# load control and high-glucose data set
ctl.data <- Read10X(data.dir = "GSE276251_RAW/control/")
glucose.data <- Read10X(data.dir = "GSE276251_RAW/glucose/")

# Initialize the Seurat object with the raw (non-normalized data).
ctrl <- CreateSeuratObject(counts = ctl.data, project= 'ctrl'
                         , min.cells = 3, min.features = 200)
glucose <- CreateSeuratObject(counts = glucose.data, project = 'glucose',
                         , min.cells = 3, min.features = 200)

ctrl
glucose

```


```{r}
# Explore the metadata
head(ctrl@meta.data)
head(glucose@meta.data)
```

- `orig.ident`: this often contains the sample identity if known, but will default to “SeuratProject”
- `nCount_RNA`: number of UMIs per cell
- `nFeature_RNA`: number of genes detected per cell


## 1. Filter out low-quality cells

- Low-quality cells or empty droplets often have very few genes (low nFeature_RNA and nCount_RNA)
- Cell doublets or multiplets have high values of nFeature_RNA & nCount_RNA
- Low-quality/dying cells often have high percentage of mitochondrial genes (percent.mt)


**Percentage of mitochondrial genes**

Seurat's `PercentageFeatureSet()` function calculates the proportion of transcripts mapping to mitochondrial genes. For each cell, it sums the counts across all genes (features) belonging to the mitochondrial set, which is identified by the "MT-" or "mt-" prefix.

NOTE: The pattern "^MT-" is used for human gene names, while "^mt-" is used for non-human gene names.

```{r}
# Extract percentage of mitochondrial genes 
ctrl$percent.mt <- PercentageFeatureSet(object = ctrl, pattern = "^mt-")
glucose$percent.mt <- PercentageFeatureSet(object = glucose, pattern = "^mt-")
```

### Selecting cells for further analysis

**QC Metrics: "nFeature_RNA", "nCount_RNA", "percent.mt" **

```{r}
# Merge control and glucose-treated Seurat objects
combined_seurat <- merge(ctrl, y = glucose, add.cell.ids = c("Control", "Glucose"))


# Visualize QC metrics as a violin plot
VlnPlot(combined_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

#FeatureScatter is typically used to visualize feature-feature relationships
plot1 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot2 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") 
plot1 + plot2

```

**1. nFeature_RNA (Number of genes detected per cell)**
- **Threshold considerations:**
  - **Lower threshold:** Cell with `nFeature_RNA < 200` are often considered low quality, as those typically are dying cells or empty droplets.
  - **Upper threshold:** Cells with unusually high numbers of detected genes may indicate doublets or multiplets, which confounding the clustering and can form small clusters that are confusing to interpret. 

**2. nCount_RNA (Total number of RNA detected per cell)**
- **Threshold considerations:**
  - **Lower threshold:** Cells with very few counts may have failed during library preparation or sequencing, or they may be dying cells.
  - **Upper threshold:** Cells with very high RNA counts could be doublets or multiplets.

*3. Percentage of mitochondrial gene expression:*
- *Threshold considerations:* A high percentage of mitochondrial gene expression can indicate stressed or dying cells. Typically, a threshold of 5–10% mitochondrial gene expression is used to filter out cells.

These thresholds are guidelines and can vary depending on the nature of the experiment, tissue type, and data quality. We need to tweak these values after visual inspection of the plots. 

From the plot, it seems that the percentage of mitochondrial genes is higher in some cells. In this case, the percentage of mitochondrial gene in control samples falls on about 20%, which may indicate the quality of this experiment is not very good.

We follow Wang's paper to filter the cells by these two criteria: 
1. Selecting single cells with the number of expressed genes ranging between 500 and 3000.
2. Requiring the percentage of sequencing reads on mitochondrial genome to be less than 5 %.

```{r}
# Filter out cells
ctrl <- subset(ctrl, subset = nFeature_RNA > 500 & nFeature_RNA < 3000 & percent.mt < 5)
glucose <- subset(glucose, subset = nFeature_RNA > 500 & nFeature_RNA < 3000 & percent.mt < 5)
```

Visualize QC metrics again after filtering cells.

```{r}
# Merge control and glucose-treated Seurat objects
combined_seurat <- merge(ctrl, y = glucose, add.cell.ids = c("Control", "Glucose"))

# Visualize QC metrics as a violin plot
VlnPlot(combined_seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

#FeatureScatter is typically used to visualize feature-feature relationships
plot1 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "percent.mt") 
plot2 <- FeatureScatter(combined_seurat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") 
plot1 + plot2

```
 
 The correlation between `percent.mt` and `nCount_RNA` improved (R squared -0.26 vs. -0.06) ater filtering high mitochonrial content cells, which represeting dying cells.


### Normalize, High Variable Features, Scaling Data
Single-cell RNA-seq data often exhibit variability in sequencing depth across cells. To prepare for clustering analysis, we need to perform several pre-processing steps:
1. Normalize the data to account for differences in sequencing depth.  
2. Identify highly variable features (genes) that capture biological variation.  
3. Scale the data to ensure all genes contribute equally to downstream analyses.  

```{r}
# Normalize the data
ctrl <- NormalizeData(ctrl)
glucose <- NormalizeData(glucose)

# Find variable features for each dataset
ctrl <- FindVariableFeatures(ctrl, selection.method = "vst", nfeature = 2000)
glucose <- FindVariableFeatures(glucose, selection.method = "vst", nfeature = 2000)

# Scale the data
ctrl <- ScaleData(ctrl) # by default 2000 identified variable features
all.genes <- rownames(ctrl)
ctrl <- ScaleData(ctrl, features = all.genes)

glucose <- ScaleData(glucose)
all.genes <- rownames(glucose)
glucose <- ScaleData(glucose, features = all.genes)

```


## 2. Dimension Reduction and Cell Clustering

### PCA

**PCA (Principal Component Analysis)** is performed after the pre-processing workflow in scRNA-seq analysis to reduce dimensionality and identify key sources of variability in the dataset. 

```{r}
# Perform PCA on the scaled data
ctrl <- RunPCA(ctrl, features = VariableFeatures(object = ctrl))
glucose <- RunPCA(glucose, features = VariableFeatures(object = glucose))

```

###Determine the "Dimension" of the dataset

```{r}
ElbowPlot(ctrl)
ElbowPlot(glucose)

```

In the elbow plots above, and "elbow" around PC10-15. Here I chose PC10 for downstream analysis.


### Cluster the cells

- **Low resolution (0.1):** Produces fewer clusters, meaning the algorithm combines more cells into larger groups. These clusters represent broader, coarser cell communities with fewer distinctions.
- **Higher resolution (0.5):** Results in more clusters, meaning the algorithm splits cells into smaller, more fine-grained clusters. This increases the granularity and helps capture subtle differences between cell types or states.

The Louvain algorithm detects communities (clusters) based on cell similarity, so at different resolutions, it identifies different numbers of clusters.


```{r}
# Clustering for control dataset
ctrl <- FindNeighbors(ctrl, dims = 1:10)
ctrl <- FindClusters(ctrl, resolution = 0.5)
ctrl <- RunUMAP(ctrl, dims = 1:10)

# Clustering for glucose-treated dataset
glucose <- FindNeighbors(glucose, dims = 1:10)
glucose <- FindClusters(glucose, resolution = 0.5)
glucose <- RunUMAP(glucose, dims = 1:10)

```

```{r}
# Visualize clusters
DimPlot(ctrl, reduction = "umap", label = TRUE) + ggtitle("Control")
DimPlot(glucose, reduction = "umap", label = TRUE) + ggtitle("High Glucose")

```


## 3. Integrate the contrl and high glucose-treated datasets

Data integration in single-cell RNA-seq analysis is performed to reduce **batch effects** and ensure that cells from different conditions, samples, or experiments can be compared on the same scale.

*Anchor-based integration*: Anchors between datasets are identified to integrate them based on shared biological variation.

```{r}
# Create list of Seurat objects
zebrafish.list <- list(ctrl, glucose)

# Normalize the data for each dataset independently
zebrafish.list <- lapply(X = zebrafish.list, FUN = function(x) {
  x <- NormalizeData(x)
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  return(x)
})


#Select features that are repeatedly variable across datasets for integration
features <- SelectIntegrationFeatures(object.list = zebrafish.list)

# Identify integration anchors between control and glucose datasets
anchors <- FindIntegrationAnchors(object.list = zebrafish.list, 
                                  anchor.features = features)

# Integrate the datasets based on the identified anchors
zebrafish.integrated <- IntegrateData(anchorset = anchors)

# Specify that we will perform downstream analysis on the integrated data 
DefaultAssay(zebrafish.integrated) <- "integrated"

# Run PCA on the integrated dataset
zebrafish.integrated <- ScaleData(zebrafish.integrated)
zebrafish.integrated <- RunPCA(zebrafish.integrated)
ElbowPlot(zebrafish.integrated, ndims = 50)


# Perform clustering on the integrated data
zebrafish.integrated <- FindNeighbors(zebrafish.integrated, reduction = "pca", dims = 1:30)
zebrafish.integrated <- FindClusters(zebrafish.integrated, resolution = 0.5)
zebrafish.integrated <- RunUMAP(zebrafish.integrated, dims = 1:30)


# Plot UMAP for the integrated data
DimPlot(zebrafish.integrated, reduction = "umap", label = TRUE, repel = TRUE)


```

```{r}
# Compare
DimPlot(zebrafish.integrated, reduction = "umap", split.by = 'orig.ident')

```


## 4. Identify differentially expressed genes (DEGs) between the control and treated zebrafishes for each cell cluster (0, 1, 2, etc.)

**Differential Expression Analysis**

```{r}
# pseudobulk cells by stimulation condition AND cell type
bulk <- AggregateExpression(zebrafish.integrated, group.by = c("orig.ident", "seurat_clusters"), return.seurat = TRUE)
Cells(bulk)

```

```{r}
# Perform differential expression analysis (using DESeq2) on pseudobulk data 

# Extract count data and metadata from the pseudobulk Seurat object
counts <- GetAssayData(bulk, assay= "RNA", slot = "counts")
counts_dense <- as.matrix(counts)

# Extract columnn names
column_names <- colnames(counts_dense)

# Split the column names to create metadata for treatment and cluster
meta_data <- data.frame(
  orig.ident = sapply(strsplit(column_names, "_"), '[', 1), # First word before control or glucose
  seurat_clusters = sapply(strsplit(column_names, "_"), '[', 2) # Second part is the cluster number
)

# Ensure that row names of the meatadata match the col names of counts matrix
rownames(meta_data) <-  column_names

```


```{r}
library(DESeq2)

# Convert seurat_clusters to a factor
meta_data$seurat_clusters <- as.factor(meta_data$seurat_clusters)

# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = counts_dense, 
                              colData = meta_data,
                              design = ~ orig.ident + seurat_clusters)


# filter
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]

# Run DESeq2 differential expression analysis
dds <- DESeq(dds)

# Perform pairwise comparison for control vs glucose within each cluster
# Loop over unique clusters to extract DEGs per cluster
for (cluster in levels(meta_data$seurat_clusters)) {
  # Subset the DESeq2 dataset by cluster
  dds_cluster <- dds[ , meta_data$seurat_clusters == cluster]
  
  # Perform differential expression analysis for this cluster
  res <- results(dds_cluster, contrast = c("orig.ident", "ctrl", "glucose"))
  
  # Print the DEGs for this cluster
  cat(paste("Cluster:", cluster, "\n"))
  print(head(res))
}


```