scRNA sequencing analysis

This repo presents steps needed to make sense of single-cell RNA sequencing (scRNA) data. I used a scRNA dataset coming from Zhaohui Chen et al. paper published in Nature Communications 11, Article number: 5077 (2020). The cohort consisted of eight primary bladder tumor tissues (2 low-grade bladder urothelial tumors, and six high-grade bladder urothelial tumors) along with 3 adjacent normal mucosae. In SRA datasets are under BioProject PRJNA662018 and SRA-explorer can be used to download the data. For practical scRNA-seq analysis, I followed this elegant tutorial from Harvard Chan Bioinformatics Core.

NOTE If you like to work with cellranger count outputs and start analysis using Seurat you can download them from this link.

Contents:

Data download on ComputeCanada
Generating feature-sample expression matrix
Background noise removal
Loading single-cell RNA-seq count data
Quality control
Normalization, regressing out unwanted variation and clustering
Marker identification and cell type assignment
Comparing muscle invasive BLCA vs. Non-muscle invasive BLCA
Analyzing epithelial (EPCAM +) cells
DE and enrichment analysis
Trajectory inference
RNA velocity analysis. Materials related to RNA velocity analysis have moved here

1) Data download on ComputeCanada

#!/usr/bin/env bash

#!/bin/bash
#SBATCH --account=#
#SBATCH --job-name=scRNA__DL
#SBATCH --qos=privileged
#SBATCH --nodes=1                # number of Nodes
#SBATCH --tasks-per-node=4        # number of MPI processes per node
#SBATCH --mem 8g
#SBATCH --time 12:00:00
#SBATCH --output=scRNA_fastq_DL.%J.out
#SBATCH --error=scRNA_fastq.%J.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=#


curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/089/SRR12603789/SRR12603789_1.fastq.gz -o SRR12603789_urinary_bladder_cancer_fresh_sample_of_patient_number_02_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/089/SRR12603789/SRR12603789_2.fastq.gz -o SRR12603789_urinary_bladder_cancer_fresh_sample_of_patient_number_02_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/087/SRR12603787/SRR12603787_1.fastq.gz -o SRR12603787_urinary_bladder_cancer_fresh_sample_of_patient_number_03_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/087/SRR12603787/SRR12603787_2.fastq.gz -o SRR12603787_urinary_bladder_cancer_fresh_sample_of_patient_number_03_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/086/SRR12603786/SRR12603786_1.fastq.gz -o SRR12603786_urinary_bladder_cancer_fresh_sample_of_patient_number_04_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/086/SRR12603786/SRR12603786_2.fastq.gz -o SRR12603786_urinary_bladder_cancer_fresh_sample_of_patient_number_04_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/090/SRR12603790/SRR12603790_1.fastq.gz -o SRR12603790_urinary_bladder_cancer_fresh_sample_of_patient_number_01_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/090/SRR12603790/SRR12603790_2.fastq.gz -o SRR12603790_urinary_bladder_cancer_fresh_sample_of_patient_number_01_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/088/SRR12603788/SRR12603788_1.fastq.gz -o SRR12603788_normal_bladder_mucosa_3_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/088/SRR12603788/SRR12603788_2.fastq.gz -o SRR12603788_normal_bladder_mucosa_3_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/084/SRR12603784/SRR12603784_1.fastq.gz -o SRR12603784_urinary_bladder_cancer_fresh_sample_of_patient_number_06_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/084/SRR12603784/SRR12603784_2.fastq.gz -o SRR12603784_urinary_bladder_cancer_fresh_sample_of_patient_number_06_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/085/SRR12603785/SRR12603785_1.fastq.gz -o SRR12603785_urinary_bladder_cancer_fresh_sample_of_patient_number_05_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/085/SRR12603785/SRR12603785_2.fastq.gz -o SRR12603785_urinary_bladder_cancer_fresh_sample_of_patient_number_05_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/083/SRR12603783/SRR12603783_1.fastq.gz -o SRR12603783_urinary_bladder_cancer_fresh_sample_of_patient_number_07_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/083/SRR12603783/SRR12603783_2.fastq.gz -o SRR12603783_urinary_bladder_cancer_fresh_sample_of_patient_number_07_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/082/SRR12603782/SRR12603782_1.fastq.gz -o SRR12603782_urinary_bladder_cancer_fresh_sample_of_patient_number_08_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/082/SRR12603782/SRR12603782_2.fastq.gz -o SRR12603782_urinary_bladder_cancer_fresh_sample_of_patient_number_08_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/081/SRR12603781/SRR12603781_1.fastq.gz -o SRR12603781_normal_bladder_mucosa_1_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/081/SRR12603781/SRR12603781_2.fastq.gz -o SRR12603781_normal_bladder_mucosa_1_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/080/SRR12603780/SRR12603780_1.fastq.gz -o SRR12603780_normal_bladder_mucosa_2_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR126/080/SRR12603780/SRR12603780_2.fastq.gz -o SRR12603780_normal_bladder_mucosa_2_2.fastq.gz

It is always recommended to run QC on samples; so using the following code we can run fastqc on all samples

#!/bin/bash
#SBATCH --account= # add your account name 
#SBATCH --job-name=fastqc
#SBATCH --qos=privileged
#SBATCH --nodes=12                # number of Nodes
#SBATCH --tasks-per-node=5        # number of MPI processes per node
#SBATCH --mem 16g
#SBATCH --time 24:00:00
#SBATCH --output=fastqc.%J.out
#SBATCH --error=fastqc.%J.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=# add your email address 


module load fastqc

fastqc ./fastq/*.fastq.gz --outdir ./qc_raw -f fastq -t 12 --nogroup

There are differences between the sequence length of read 1 and read 2 for each sample; read 1 provides data on Cell barcode & UMI and read 2 has the insert sequence. Files coming from SRA usually come with names like something_1_fastq.gz and something_2_fastq.gz . This blog is helpful to see what are the naming requirements of fastq files for cellranger tool. Briefly;

incompatible file name: SRR9291388_1.fastq.gz
compatible file name: SRR9291388_S1_L001_R1_001.fastq.g

So we need to change file names and also set up the directories for cellranger count;

for file in *.fastq.gz
do
  dir=${file%_*}
  echo $dir
  mkdir $dir
  mv $file ./$dir
done

2) Generating feature-sample expression matrix

The next step is to run cellranger count on each sample. There are different types of scRNA libs with different fastq files. This blog provides details on the type of libs from 10X genomics.

Running cellranger on a cluster with SLURM as job scheduler is not an easy task. The following is what I came up with working best for me working on ComputeCanada cluster:

#!/bin/bash
#SBATCH --account=#
#SBATCH -J test_cellranger
#SBATCH --export=ALL
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --signal=2
#SBATCH --no-requeue
### Alternatively: --ntasks=1 --cpus-per-task={NUM_THREADS}
###   Consult with your cluster administrators to find the combination that
###   works best for single-node, multi-threaded applications on your system.
#SBATCH --mem=250G
#SBATCH --time 48:00:00
#SBATCH --output=test_cellranger.%J.out
#SBATCH --error=test_cellranger.%J.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=#


module load cellranger

cd /home/ghaedi/projects/def-gooding-ab/ghaedi/sc/raw
for d in *
do
echo $d
ID=${d:0:11}
cellranger count --id=$ID \
                 --transcriptome=/home/ghaedi/projects/def-gooding-ab/ghaedi/sc/refdata-gex-GRCh38-2020-A \
                 --fastqs=$d
                 --chemistry=SC3Pv3
done

The next steps are mainly based on Harvard Chan Bioinformatics Core materials on scRNA-seq analysis training.

3) Background noise removal

Background removal is crucial in single-cell RNA sequencing (scRNA-seq) analysis due to the presence of contaminants such as cell-free "ambient" RNA and chimeric cDNA molecules resulting from barcode swapping. Barcode swapping can happen when cDNA molecules from different beads, which carry different barcodes, mix together during the amplification step. This can occur due to unremoved oligonucleotides from other beads or incomplete extension of PCR products. As a result, chimeric cDNA molecules are formed, where the barcode and UMI sequences are "swapped" between different beads. When these chimeric molecules are sequenced, the assigned barcode does not match the original bead, leading to the incorrect assignment of the cDNA to the corresponding cell.

These contaminants contribute to background noise, which can negatively impact data analysis. Background noise reduces the separability of cell-type clusters, interferes with differential expression analysis, and confounds comparisons between samples. To address this, algorithms like SoupX, DecontX, and CellBender estimate and correct for background noise using marker genes, empty droplets, and modeling of barcode swapping. Evaluating method performance involves assessing if the model effectively removes expression from other cell types using datasets with known cell type profiles and exclusive marker genes. However, the impact of background correction on expression level shifts induced by background noise remains a challenge in scRNA-seq analysis.

As recommended by Philipp Janssen et al,2023:

In summary, for marker gene analysis, we would always recommend background removal, but for classification, clustering and pseudotime analyses, we would only recommend background removal when background noise levels are high

In this repo, I won't be using the cellbender filtered matrix, as the main focus of the analysis is cell clustering, rather than marker identification. The following is a fully functional cellbender code for the given samples

# Running cellbender image on co pute canada 
##pulling the image (creating sif file):
module load apptainer
apptainer pull cellbender.sif docker://us.gcr.io/broad-dsde-methods/cellbender:0.2.2

Sbatch script to run cellbender on Graham with GPU

#!/bin/bash
#SBATCH --account=def-gooding-ab
#SBATCH --job-name=cellBender
#SBATCH --nodes=1                # number of Nodes
#SBATCH --gpus-per-node=p100:2
#SBATCH --ntasks-per-node=32       
#SBATCH --mem=127000M
#SBATCH --time 08:00:00
#SBATCH --output=cellBender.%J.out
#SBATCH --error=cellBender.%J.err
#SBATCH --mail-type=ALL
#SBATCH --mail-user=qaedi.65@gmail.com

module load apptainer

# Define the input and output directories
input_dir="/home/ghaedi/projects/def-gooding-ab/ghaedi/sc/filtered"
output_dir="/home/ghaedi/projects/def-gooding-ab/ghaedi/sc/cellBender_filtered"

# Iterate over the sample directories
for sample_dir in "$input_dir"/*; do
    if [[ -d "$sample_dir" ]]; then
        sample_name=$(basename "$sample_dir")
        echo "Processing sample: $sample_name"

        # Create a directory for the current sample within the output directory
        sample_output_dir="$output_dir/$sample_name"
        mkdir -p "$sample_output_dir"

        # Run cellbender remove-background command for the current sample
        apptainer run -C -W ${SLURM_TMPDIR} --nv --rocm \
        --bind "$input_dir":"/mnt/filtered" \
        --bind "$sample_output_dir":"/mnt/sample_output" cellbender.sif \
        cellbender remove-background \
        --input "/mnt/filtered/$sample_name" \
        --output "/mnt/sample_output/$sample_name.h5" \
        --expected-cells 7000 \
        --total-droplets-included 20000 \
        --fpr 0.01 \
        --epochs 150 \
        --cuda

        echo "Completed processing sample: $sample_name"
        echo
    fi
done

4) Loading single-cell RNA-seq count data

# Setup the Seurat Object

library(tidyverse)
library(Seurat)
library(patchwork)
library(cowplot)

# create list of samples
samples <- list.files("~/scRNA/filtered/")
#samples <- samples[grepl('^filtered',samples,perl=T)]

# read files inot Seurat objects
for (file in samples){
  print(paste0(file))
  seurat_data <- Read10X(data.dir = paste0("~/scRNA/filtered/", file))
  seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                   min.features = 100, 
                                   project = file)
  assign(file, seurat_obj)
}

# now merging all objects inot one Seurat obj

merged_seurat <- merge(x = SRR12603780, 
                       y = c(SRR12603781,
                             SRR12603782,
                             SRR12603783,
                             SRR12603784,
                             SRR12603785,
                             SRR12603786,
                             SRR12603787,
                             SRR12603788,
                             SRR12603789,
                             SRR12603790),
                       #Because the same cell IDs can be used for different samples, we add a sample-specific prefix 
                       # to each of our cell IDs using the add.cell.id argument.
                       add.cell.id = samples)

5) Quality control

There are columns in the metadata:

orig.ident: this column will contain the sample identity if known. It will default to the value we provided for the project argument when loading in the data
nCount_RNA: represents the number of UMIs per cell. UMI (unique molecular identifiers) is used to determine whether a read is a biological or technical duplicate (PCR duplicate). Reads with different UMIs mapping to the same transcript were derived from different molecules and are biological duplicates - each read should be counted. Reads with the same UMI originated from the same molecule and are technical duplicates - the UMIs should be collapsed to be counted as a single read.
nFeature_RNA: represents the number of genes detected per cell

Recommended features to add to metadata:

number of genes detected per UMI (or novelty score): more genes detected per UMI, more complex our data
mitochondrial ratio: this metric will give us a percentage of cell reads originating from the mitochondrial genes (coming from dying cells)

# Explore merged metadata
View(merged_seurat@meta.data)


#Add number of genes per UMI for each cell to metadata
merged_seurat$log10GenesPerUMI <- log10(merged_seurat$nFeature_RNA) / log10(merged_seurat$nCount_RNA)

# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat@meta.data$mitoRatio / 100

# Create metadata dataframe
metadata <- merged_seurat@meta.data
# Add cell IDs to metadata
metadata$cells <- rownames(metadata)
# adding sample type to metadata. The orginal file could be download from SRA explorer

SampleType <- c("BLCA", "BLCA", "Normal", "BLCA", "BLCA", "BLCA", "BLCA", "BLCA", "BLCA", "Normal", "Normal")
# sample type with grade (Not tested)
#SampleType <- c("BLCA_LG", "BLCA_LG", "Normal", "BLCA_HG", "BLCA_HG", "BLCA_HG", "BLCA_HG", "BLCA_HG", "BLCA_HG", "Normal", "Normal")

names(SampleType) <- c("SRR12603789", "SRR12603790", "SRR12603788", "SRR12603787", "SRR12603786", "SRR12603785", "SRR12603784", "SRR12603783", "SRR12603782", "SRR12603781", "SRR12603780")

metadata$sampleType <- stringr::str_replace_all(metadata$orig.ident, SampleType)


# Rename columns
metadata <- metadata %>%
  dplyr::rename(seq_folder = orig.ident,
                nUMI = nCount_RNA,
                nGene = nFeature_RNA,
                sample = sampleType)

# Add metadata back to Seurat object
merged_seurat@meta.data <- metadata

# Create .RData object to load at any time
save(merged_seurat, file="~/scRNA/merged_filtered_seurat.RData")

#

Cell counts per sample

# Visualize the number of cell counts per sample
bqcc <- metadata %>% 
  ggplot(aes(x=seq_folder, fill=sample)) + 
  geom_bar() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  theme(plot.title = element_text(hjust=0.5, face="bold")) +
  ggtitle("NCells before QC")

UMI per cell

Typically, we expect the UMI counts per cell to be higher than 500, which is the lower limit of the expected range. If the UMI counts range between 500-1000, the data is still usable, but deeper sequencing may have been beneficial for these cells.

# Visualize the number UMIs/transcripts per cell
metadata %>% 
  ggplot(aes(x=nUMI, fill= sample)) + 
  geom_density(alpha = 0.5) + 
  scale_x_log10() + 
  theme_classic() +
  ylab("Cell density") +
  facet_wrap(~seq_folder) +
  geom_vline(xintercept = 1000) +
  labs(fill = "Sample")

So the cells have way more than 1K UMI!

Genes detected per cell

In scRNA-seq, the number of genes detected per cell is a crucial quality metric that we expect to be similar to the UMI detection, albeit slightly lower. For high-quality data, the proportional histogram of genes detected per cell should show a single peak that represents encapsulated cells. However, if there is a small shoulder or a bimodal distribution to the left of the main peak, this could indicate a few things. It could be due to some failed cells or biologically different cell types, such as quiescent cell populations or less complex cells of interest. For instance, larger cells or different cell types may have higher gene counts.

# Visualize the distribution of genes detected per cell via histogram
metadata %>% 
  ggplot(aes(x=nGene, fill= sample)) + 
  geom_density(alpha = 0.5) + 
  scale_x_log10() + 
  theme_classic() +
  ylab("Cell density") +
  facet_wrap(~seq_folder) +
  geom_vline(xintercept = 500) +
  labs(fill = "Sample")

Novelty score

The novelty score, computed as the ratio of nGenes over nUMI, measures the complexity of RNA species in each cell. A low number of genes detected in a cell with many captured transcripts (high nUMI) indicates low complexity or novelty. This could be due to an artifact, contamination, or a specific cell type (e.g. red blood cells). A good-quality cell typically has a novelty score above 0.80.

# Visualize the overall complexity of the gene expression by visualizing the genes detected per UMI (novelty score)
metadata %>%
  ggplot(aes(x=log10GenesPerUMI, fill=sample)) +
  geom_density(alpha = 0.5) +
  theme_classic() +
  facet_wrap(~seq_folder) +
  xlab("Novelty Score") +
  geom_vline(xintercept = 0.8)

Mitochondrial gene expression detected per cell

High levels of expression from mitochondria indicate dying or dead cells. Basically, quality samples are those that surpass 0.2 mitochondria ratio mark.

# Visualize the distribution of mitochondrial gene expression detected per cell
metadata %>%
  ggplot(aes(x=mitoRatio, fill=sample)) + 
  geom_density(alpha = 0.5) + 
  scale_x_log10() + 
  scale_x_continuous(labels = function(x) sprintf("%.1f", x)) + 
  theme_classic() +
  facet_wrap(~seq_folder) +
  geom_vline(xintercept = 0.2)

Joint filtering: nUMI, nGene and mitoRatio

# Visualize the correlation between genes detected and number of UMIs and determine whether strong presence of cells with low numbers of genes/UMIs

metadata %>% 
  	ggplot(aes(x=nUMI, y=nGene, color=mitoRatio)) + 
  	geom_point() + 
  	scale_colour_gradient(low = "gray90", high = "black") +
  	stat_smooth(method=lm) +
  	scale_x_log10() + 
  	scale_y_log10() + 
  	theme_classic() +
  	geom_vline(xintercept = 1000) +
  	geom_hline(yintercept = 500) +
  	facet_wrap(~seq_folder)

There are samples that shows high-quality cells; high nUMI, high nGene, low number of cells with high mitoRatio and also there are some samples that would clearly benefit from filtering, as they have low-quality cells. We expect to see that dying cells to show high level of mitoRatio and low nUMI and nGene .

Basically, it is not uncommon to observe cells with high numbers of UMIs and nGene, but also high mitoRatio. These cells may be stressed or damaged, but they could also represent a heterogeneous population of cells with distinct metabolic states.

To investigate the potential cause of high mitochondrial expression ratios, it is important to examine the expression of specific mitochondrial genes and compare them to other genes in the cell. If the expression of mitochondrial genes is elevated relative to other genes, this could suggest mitochondrial dysfunction. Additionally, examining the expression of other stress or damage markers, such as heat shock proteins or cell cycle genes, can also provide insight into the health and state of the cell.

Filtering

Cell-level filtering

-nUMI > 1000

-nGene > 500

-log10GenesPerUMI > 0.8

-mitoRatio < 0.2

# Filter out low-quality cells using selected thresholds - these will change with the experiment
filtered_seurat <- subset(merged_seurat, 
                          subset= nUMI >= 1000 &
                          nGene >= 500 &
                          nGene <= 6000 & 
                          log10GenesPerUMI > 0.80 & 
                          mitoRatio < 0.10)

Gene-level filtering

Keep only genes that are expressed in 100 or more cells (usually that is 10)

# Extract counts
counts <- GetAssayData(object = filtered_seurat, slot = "counts")

# Output a logical matrix specifying for each gene whether or not there are more than zero counts per cell
nonzero <- counts > 0
# Sums all TRUE values and returns TRUE if more than 100 TRUE values per gene
keep_genes <- Matrix::rowSums(nonzero) >= 100

# Only keeping those genes expressed in more than 100 cells
filtered_counts <- counts[keep_genes, ]
# Reassign to filtered Seurat object
filtered_seurat <- CreateSeuratObject(filtered_counts, meta.data = filtered_seurat@meta.data)

# Create.RData object to load at any time
save(filtered_seurat, file="seurat_filtered.RData")

Re-assess QC metrics

# Save filtered subset to new metadata
metadata_clean <- filtered_seurat@meta.data

# to see a drop in filtering cells:

met_before <- data.frame(unclass(table(metadata$seq_folder)))
met_before$QCgroup <- "before"
met_before$cell<- rownames(met_before)
names(met_before)[1] <- 'count'

met_after <- data.frame(unclass(table(metadata_clean$seq_folder)))
met_after$QCgroup <- "after"
met_after$cell<- rownames(met_after)
names(met_after)[1] <- 'count'
# count
cell_count <- data.frame(rbind(met_before, met_after))

                                
# visualization :
cell_count %>% ggplot(aes(x=cell, y=count, fill=QCgroup)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  theme(plot.title = element_text(hjust=0.5, face="bold")) +
  scale_fill_manual(values = c("#808080", "#FFBF00")) +
  xlab("samples") +
  ggtitle("nCells count before and after QC")

# Visualize the correlation between genes detected and the number of UMIs and determine whether the strong presence of cells with low numbers of genes/UMIs

metadata_clean %>% 
  	ggplot(aes(x=nUMI, y=nGene, color=mitoRatio)) + 
  	geom_point() + 
  	scale_colour_gradient(low = "gray90", high = "black") +
  	stat_smooth(method=lm) +
  	scale_x_log10() + 
  	scale_y_log10() + 
  	theme_classic() +
  	geom_vline(xintercept = 1000) +
  	geom_hline(yintercept = 500) +
  	facet_wrap(~seq_folder)

6) Normalization, regressing out unwanted variation and clustering

The ultimate goal is to define clusters of cells and identify cell types in the samples. To achieve this, there are several steps:

1-Identify unwanted variability by exploring data and covariates such as cell cycle and mitochondrial gene expression. Both biological sources of variation (e.g. effect of cell cycle on transcriptome) and technical sources should be explored and accounted for. In an early version of Seurat, one needs to normalize data, find variable features, and then scale data while setting a variable like mitochondrial contamination or cell cycle stage to be regressed out. So here is the code for doing these steps to mitigate cell cycle stage effects on the dataset, however, a newer function in Seurat has automated all of these steps (SCTtansform()).

2-Normalize and remove unwanted variability using Seurat's sctransform function. The normalization step is necessary to make expression counts comparable across genes and/or samples. The counts of mapped reads for each gene are proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting” such as sequencing depth and gene length). Normalization is the process of adjusting raw count values to account for the “uninteresting” factors. For simplicity, normalization is assumed as two a two-step process: scaling and transforming. In scaling the goal is to multiply each UMI count by a cell-specific factor to get all cells to have the same UMI counts. For transformation, simple approaches like log-transformation showed to be not that useful, especially in the case of genes with high expression but showed decent performance for low/intermediate expressed genes. So we cannot treat all genes the same. The proposed solution for data transformation is Pearson residuals (implemented in Seurat's SCTransform function), which applies a gene-specific weight to each measurement based on the evidence of non-uniform expression across cells. This weight is higher for genes expressed in a smaller fraction of cells, making it useful for detecting rare cell populations. The weight takes into account not just the expression level but also the distribution of expression.

3- Integrate data using Seurat's method to compare cell type expression between groups.

4-Cluster cells based on similarity of gene expression profiles using Seurat's PCA scores.

5-Evaluate cluster quality by checking for sources of uninteresting variation, and principal component influence, and exploring cell type identities using known markers.

Exploring sources of unwanted variation

Evaluating effects of cell cycle and mitochondrial expression

We will store the cells for cell cycle genes, and then determine whether cell cycle is a major source of variation in our dataset using PCA.

# Normalize the counts
# This normalization method is solely for the purpose of exploring the sources of variation in our data.
seurat_phase <- NormalizeData(filtered_seurat, normalization.method = "LogNormalize", scale.factor = 10000)

# Load cell cycle markers
load("C:/Users/qaedi/OneDrive - Queen's University/Documents/scRNA/cycle.rda")

# Score cells for cell cycle
seurat_phase <- CellCycleScoring(seurat_phase, 
                                 g2m.features = g2m_genes, 
                                 s.features = s_genes)

# View cell cycle scores and phases assigned to cells                                 
#View(seurat_phase@meta.data) 
table(seurat_phase$Phase)

Cells in different cell cycle stages:

G1	G2M	S
54106	12680	25441

So most of the cells are in G1 and then S, which make sense.

# Identify the most variable genes and scale them
seurat_phase <- FindVariableFeatures(seurat_phase, 
                     selection.method = "vst",
                     nfeatures = 2000, 
                     verbose = TRUE)
                     
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(seurat_phase), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(seurat_phase)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2

# Check quartile values for mitoRatio, we will use this variable later to mitigate the unwanted sources of variation in the dataset
summary(seurat_phase@meta.data$mitoRatio)

# Turn mitoRatio into categorical factor vector based on quartile values
seurat_phase@meta.data$mitoFr <- cut(seurat_phase@meta.data$mitoRatio, 
                   breaks=c(-Inf, 0.015, 0.025, 0.045, Inf), 
                   labels=c("Low","Medium","Medium high", "High"))



# Scale the counts
# This step is essential for PCA , clustering and heatmap generation
seurat_phase <- ScaleData(seurat_phase)
#saveRDS(seurat_phase, "seurat_phase.rds")

# Perform PCA
seurat_phase <- RunPCA(seurat_phase)

# Plot the PCA colored by cell cycle phase
no_split <- DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "Phase")
        
with_split <- DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "Phase",
        split.by= "Phase")
        
no_split + with_split

For mitochondrial expression:

# Plot the PCA colored by mitochondrial expression
no_split <- DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "mitoFr")
        
with_split <- DimPlot(seurat_phase,
        reduction = "pca",
        group.by= "mitoFr",
        split.by= "mitoFr")
        
no_split + with_split

Based on the above plots, we can see that cells are scattered regardless of their cell cycle phase and mitochondrial genes expression level. So there is no need to regress out the effect of cell cycle and mitochondrial expression in this dataset.

SCTransform

This function is useful for normalization and regressing sources of unwanted variation at the same time. The method constructs a generalized linear model (GLM) for each gene, using UMI counts as the response variable and sequencing depth as the explanatory variable. To handle the fact that different genes have different levels of expression, information is pooled across genes with similar abundances, resulting in more accurate parameter estimates.

This regularization process yields residuals, which represent effectively normalized data values that are no longer correlated with sequencing depth.

This method is a more accurate method of normalizing, estimating the variance of the raw filtered data, and identifying the most variable genes. In practice SCTransform single command replaces NormalizeData(), ScaleData(), and FindVariableFeatures(). Since we have two groups of samples we will run SCTransform on each group after doing "integration".

Integration

To improve clustering and downstream analyses, it can be beneficial to integrate or align samples across groups using shared highly variable genes. If cells cluster by sample, condition, batch, dataset, or modalities(scRNA, scATAC-seq), integration can help to remove these unwanted sources of variation. For example, if we want to integrate normal samples together and BLCA samples together, we should keep each sample as a separate object and transform them accordingly for integration. This is necessary to ensure that the samples are properly aligned and that downstream analyses are meaningful. If cell types are present in one dataset, but not the other, then the cells will still appear as a separate sample-specific cluster.

# adjust the limit for allowable object sizes within R
options(future.globals.maxSize = 4000 * 1024^2)

# Split seurat object by group
split_seurat <- SplitObject(seurat_phase, split.by = "sample")

# then normalize by SCTansform
for (i in 1:length(split_seurat)) {
    split_seurat[[i]] <- SCTransform(split_seurat[[i]], vars.to.regress = c("mitoRatio", "S.Score", "G2M.Score"))
    }

# to see what the component of the object are. 

split_seurat    
#$Normal
#An object of class Seurat
#47302 features across 21519 samples within 2 assays
#Active assay: SCT (23608 features, 3000 variable features)
# 1 other assay present: RNA
# 1 dimensional reduction calculated: pca

#$BLCA
#An object of class Seurat
#47388 features across 70708 samples within 2 assays
#Active assay: SCT (23694 features, 3000 variable features)
# 1 other assay present: RNA
# 1 dimensional reduction calculated: pca

    
# Select the most variable features to use for integration
integ_features <- SelectIntegrationFeatures(object.list = split_seurat, 
                                            nfeatures = 3000) 


# Prepare the SCT list object for integration
split_seurat <- PrepSCTIntegration(object.list = split_seurat, 
                                   anchor.features = integ_features)
                                   
# Find best buddies (using canonical correlation analysis or CCA) - can take a while to run
integ_anchors <- FindIntegrationAnchors(object.list = split_seurat, 
                                        normalization.method = "SCT", 
                                        anchor.features = integ_features)
# Integrate across conditions
seurat_integrated <- IntegrateData(anchorset = integ_anchors, 
                                   normalization.method = "SCT")
                                   
# Check assays in the object:
split_seurat$Normal@assays

Integration check

After normalization and integration, we can proceed to PCA and UMAP/t-SNE to see effect of integration.

# Run PCA
seurat_integrated <- RunPCA(object = seurat_integrated, verbose = TRUE)
# PC_ 1
# Positive:  IGFBP7, MGP, SPARC, VIM, SPARCL1, CALD1, IFITM3, A2M, COL4A1, COL1A2
#            COL4A2, TAGLN, COL6A2, BGN, TCF4, COL3A1, CCL2, MT2A, NNMT, COL1A1
#            MYL9, LGALS1, DCN, SELENOM, CCN1, ADAMTS9, CCN2, TIMP1, IGFBP4, LUM
# Negative:  KRT19, KRT13, CD52, S100P, RPS19, CLDN4, FXYD3, PSCA, TACSTD2, CSTB
#            PTPRC, KRT7, RPS18, KRT17, LYPD3, RPL41, CD3D, CCL5, RPS27, SFN
#            FABP5, SPINK1, GDF15, AQP3, HCST, TRAC, SNCG, ADIRF, ELF3, TRBC2
# PC_ 2
# Positive:  PLVAP, CALCRL, PCAT19, MCTP1, AQP1, VWF, CD74, PECAM1, RAMP2, LDB2
#            RAMP3, FLT1, ZNF385D, TCF4, CLDN5, ACKR1, HSPG2, SPARCL1, EMCN, ADGRL4
#            SLCO2A1, HLA-DRA, SELE, DOCK4, ECSCR, CCL14, HLA-DRB1, ERG, PCDH17, GNG11
# Negative:  COL1A2, COL3A1, COL1A1, TAGLN, DCN, LUM, BGN, COL6A2, C1R, SOD3
#            MFAP4, RARRES2, C1S, MYL9, TPM2, PRKG1, CRYAB, ACTA2, COL6A3, LGALS1
#            CALD1, COL6A1, TIMP1, SERPINF1, PCOLCE, AEBP1, C11orf96, MEG3, FBLN1, GPC6
# PC_ 3
# Positive:  CCL5, CD52, B2M, PTPRC, IL32, SRGN, CD3D, HSPA1A, NKG7, GZMA
#            TRAC, ARHGAP15, HCST, RGS1, CXCR4, SAMSN1, CCL4, CD7, RGS2, CORO1A
#            CD2, CRIP1, CST7, CD69, STAT4, FYN, TMSB4X, DNAJB1, PTPN22, CD3E
# Negative:  ADIRF, SPINK1, IFI27, CSTB, S100P, KRT19, FXYD3, CLDN4, CCT2, CCND1
#            PSCA, S100A6, SNCG, UCA1, KRT17, YEATS4, KRT7, KRT13, TACSTD2, GDF15
#            RAB3IP, KRT18, HES1, GAPDH, S100A2, PLVAP, RAMP2, S100A14, FABP5, TM4SF1
# PC_ 4
# Positive:  IL32, CD3D, CCL5, CRIP1, TRAC, FYN, COL4A1, COL4A2, CD2, GZMA
#            CD7, CALD1, IGFBP7, TRBC2, MCAM, NDUFA4L2, CD3E, NKG7, COL18A1, RGS5
#            SKAP1, MYL9, ITGA1, CYTOR, CD247, SPARC, TRBC1, MAP1B, ACTA2, CACNA1C
# Negative:  HLA-DRA, CD74, TYROBP, HLA-DRB1, HLA-DPB1, FCER1G, HLA-DPA1, AIF1, HLA-DQA1, FTL
#            LYZ, IFI30, C1QA, HLA-DQB1, C1QB, C1QC, MS4A6A, LST1, APOE, CD14
#            TMEM176B, S100A9, HLA-DMA, FCGR2A, CXCL8, SPI1, FTH1, CD68, PSAP, CST3
# PC_ 5
# Positive:  LUM, MMP2, PTGDS, EMP1, DCN, RARRES2, KRT13, SERPINF1, CXCL1, CXCL8
#            LSAMP, COL8A1, TM4SF1, AREG, C1S, PDPN, CFD, APOD, SOD2, CTSK
#            CLMP, KRT17, MFAP4, LYPD3, VCAN, PLAUR, TSHZ2, PLAT, PDGFRA, C1R
# Negative:  RGS5, NDUFA4L2, ACTA2, PPP1R14A, MYL9, TAGLN, FRZB, CRIP1, CALD1, GJA4
#            MCAM, COL18A1, MYH11, TYROBP, TPPP3, COX4I2, PRKG1, IGFBP7, COL4A1, COL4A2
#            CDH6, HIGD1B, AIF1, TPM2, PTP4A3, FCER1G, WFDC1, HEYL, MYLK, HLA-DRA



# Plot PCA
png(filename = "PCA_integrated.png", width = 16, height = 8.135, units = "in", res = 300)
PCAPlot(seurat_integrated,
        split.by = "sample")
dev.off()

# Run UMAP
seurat_integrated <- RunUMAP(seurat_integrated, 
                             dims = 1:40,
                             reduction = "pca",
                             verbose = TRUE)

# Plot UMAP 
png(filename = "UMAP_integrated.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated, split.by = "sample")
dev.off()

For the future: considering tumor grade and using Harmony for integration

Clustering cells based on top PCs (metagenes)

Identify significant PCs

For new methods like SCTransform it is not needed to calculate the number of PCs for clustering. However, older methods could not efficiently remove technical biases, so using them it was necessary to have some idea about the number of PCs that can capture most of the information in the dataset.

# Explore the heatmap of PCs
png(filename = "heatmap_integrated_2.png", width = 16, height = 8.135, units = "in", res = 300)
DimHeatmap(seurat_integrated, 
           dims = 1:9, 
           cells = 500, 
           balanced = TRUE)
dev.off()

# Printing out the most variable genes driving PCs
print(x = seurat_integrated[["pca"]], 
      dims = 1:10, 
      nfeatures = 5)

PC_ 1
Positive:  KRT19, CD52, KRT13, S100P, RPS19
Negative:  IGFBP7, MGP, SPARC, VIM, SPARCL1
PC_ 2
Positive:  PLVAP, PCAT19, CALCRL, MCTP1, AQP1
Negative:  COL1A2, COL3A1, COL1A1, TAGLN, DCN
PC_ 3
Positive:  CCL5, CD52, PTPRC, B2M, SRGN
Negative:  ADIRF, SPINK1, CSTB, S100P, KRT19
PC_ 4
Positive:  HLA-DRA, CD74, TYROBP, HLA-DRB1, HLA-DPB1
Negative:  CD3D, IL32, CCL5, CRIP1, TRAC
PC_ 5
Positive:  LUM, MMP2, PTGDS, RARRES2, DCN
Negative:  RGS5, NDUFA4L2, ACTA2, PPP1R14A, MYL9
PC_ 6
Positive:  SPINK1, CCT2, LCN15, UCA1, PLA2G2A
Negative:  KRT13, PLAUR, LYPD3, OLFM4, EMP1
PC_ 7
Positive:  FABP5, PLA2G2A, LCN15, FABP4, RPS19
Negative:  CCT2, ADIRF, SPINK1, HSPA1A, CCND1
PC_ 8
Positive:  FOS, HSPA1A, JUN, ZFP36, DNAJB1
Negative:  SPARC, COL4A1, INSR, CCL5, COL4A2
PC_ 9
Positive:  LCN15, PLA2G2A, FABP4, CRTAC1, LINC01088
Negative:  H19, RPS19, CRH, AP005230.1, PSCA
PC_ 10
Positive:  HSPA1A, HSPA1B, DNAJB1, HSP90AA1, MALAT1
Negative:  CRH, CCL5, LY6D, ACKR1, RPS19

# To determine how many Pcs should be considered for clustering:
# Plot the elbow plot
png(filename = "elbow.png", width = 16, height = 8.135, units = "in", res = 300)
ElbowPlot(object = seurat_integrated, 
          ndims = 40)
dev.off()

# to make it more quantitative :
# Determine percent of variation associated with each PC
pct <- seurat_integrated[["pca"]]@stdev / sum(seurat_integrated[["pca"]]@stdev) * 100

#pct
# [1] 7.9623240 6.3833277 5.7806955 4.4000340 3.9930564 3.6021757 3.5527516
# [8] 3.0283830 2.9108397 2.6331792 2.4437350 2.3524902 2.2441047 2.1753633
#[15] 2.0485619 2.0020779 1.9396254 1.8168114 1.7548723 1.6539166 1.6410801
#[22] 1.5884590 1.5223992 1.4950181 1.4627909 1.3921263 1.3815652 1.3046432
#[29] 1.2605721 1.2382860 1.2051052 1.1972413 1.1821032 1.1741361 1.1577815
#[36] 1.1294799 1.1190593 1.0968174 1.0591574 1.0519859 1.0396189 1.0211281
#[43] 1.0034793 0.9976230 0.9786685 0.9497183 0.9415772 0.9171901 0.9103623
#[50] 0.9025012


# Calculate cumulative percents for each PC
cumu <- cumsum(pct)

#cumu
# [1]   7.962324  14.345652  20.126347  24.526381  28.519438  32.121613
# [7]  35.674365  38.702748  41.613588  44.246767  46.690502  49.042992
#[13]  51.287097  53.462460  55.511022  57.513100  59.452725  61.269537
#[19]  63.024409  64.678326  66.319406  67.907865  69.430264  70.925282
#[25]  72.388073  73.780199  75.161764  76.466408  77.726980  78.965266
#[31]  80.170371  81.367612  82.549715  83.723852  84.881633  86.011113
#[37]  87.130172  88.226990  89.286147  90.338133  91.377752  92.398880
#[43]  93.402359  94.399982  95.378651  96.328369  97.269946  98.187137


# Determine which PC exhibits cumulative percent greater than 90% and % variation associated with the PC as less than 5
co1 <- which(cumu > 90 & pct < 5)[1]

co1

# Determine the difference between variation of PC and subsequent PC
co2 <- sort(which((pct[1:length(pct) - 1] - pct[2:length(pct)]) > 0.1), decreasing = T)[1] + 1

# last point where change of % of variation is more than 0.1%.
co2

# Minimum of the two calculation is the optimal number of PC to pick.
pcs <- min(co1, co2)

pcs

Cluster the cells

# to check what is active assay
DefaultAssay(object = seurat_integrated)

# Determine the K-nearest neighbor graph
seurat_integrated <- FindNeighbors(object = seurat_integrated, 
                                dims = 1:18)
                                
Find clusters
# Determine the clusters for various resolutions                                
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.4, 0.6, 0.8, 1.0, 1.4))
                               

# Explore resolutions
head(seurat_integrated@meta.data)

# Assign the identity of clusters
Idents(object = seurat_integrated) <- "integrated_snn_res.0.4"

# Plot the UMAP
png(filename = "umap_cluster_with_label.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

Clustering quality control

After clustering, we need to make sure that the assigned clusters are true representative of biological clusters (cell clusters) not due to technical or unwanted source of variation (like cell cycle stages). Also, in this step, we need to identify cell type for each cluster based on the known cell type markers.

Segregation of clusters by sample

# Extract identity and sample information from Seurat object to determine the number of cells per cluster per sample

library(dplyr)
library(tidyr)

# n_cells <- FetchData(seurat_integrated, 
#                      vars = c("ident", "orig.ident")) %>%
#         dplyr::count(ident, orig.ident) %>%
#         tidyr::spread(ident, n)

n_cells <- FetchData(seurat_integrated, 
                      vars = c("ident", "orig.ident"))
n_cells <- dplyr::count(n_cells, ident, orig.ident)
n_cells <- tidyr::spread(n_cells, ident, n)



#Ading sample data from paper; we expect to see samples from the same group have more or less similar number of cells in each cluster. 
#So normal samples should show similar patterns: SRR12603780, SRR12603781, and SRR12603788.



sampleData<- data.frame(tibble::tribble(
     ~sample_id, ~gender, ~age, ~Grade, ~Invasiveness, ~Surgery_Type, ~Tumor_size_cm,
  "SRR12603790",     "M",  67L,  "low", "Noninvasive",       "TURBT",          "1.9",
  "SRR12603789",     "M",  70L,  "low", "Noninvasive",       "TURBT",          "2.5",
  "SRR12603787",     "M",  63L, "high", "Noninvasive",  "Cystectomy",          "3.5",
  "SRR12603786",     "F",  59L, "high", "Noninvasive",  "Cystectomy",          "4.7",
  "SRR12603785",     "M",  57L, "high",    "Invasive",  "Cystectomy",          "5.1",
  "SRR12603784",     "M",  75L, "high",    "Invasive",  "Cystectomy",          "4.3",
  "SRR12603783",     "M",  77L, "high",    "Invasive",  "Cystectomy",          "4.5",
  "SRR12603782",     "F",  72L, "high",    "Invasive",  "Cystectomy",          "4.1",
  "SRR12603781",     "M",  67L, "normal",   "normal",       "TURBT",            "-",
  "SRR12603788",     "M",  75L, "normal",   "normal",  "Cystectomy",            "-",
  "SRR12603780",     "M",  63L, "normal",   "normal",  "Cystectomy",            "-"
  ))


# View table
head(n_cells)
# saving objects (to mark where and when we stored the file)
#seurat_cluster <- seurat_integrated
#saveRDS(seurat_cluster, "seurat_cluster.RDS")


# UMAP of cells in each cluster by sample
# This would allow us to see condition specific clusters
png(filename = "umap_cluster_sample.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated, 
        label = TRUE, 
        split.by = "sample")  + NoLegend()
dev.off()

# Segregation of clusters by cell cycle phase (unwanted source of variation) 
# Explore whether clusters segregate by cell cycle phase

png(filename = "umap_cluster_cell_cucle.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated,
        label = TRUE, 
        split.by = "Phase")  + NoLegend()
dev.off()

Segregation of clusters by various sources of uninteresting variation

We expect to see a uniform coloring for all variables in all clusters. Sometimes this is not the case. For here nUMI and nGene show higher values is some clusters. We have to watch these clusters and inspect them in terms of the type of cell therein. So that may explain some of the variation that we are seeing.

# Determine metrics to plot present in seurat_integrated@meta.data
metrics <-  c("nUMI", "nGene", "S.Score", "G2M.Score", "mitoRatio")
png(filename = "umap_unwanted_source_clustering.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(seurat_integrated, 
            reduction = "umap", 
            features = metrics,
            pt.size = 0.4, 
            order = TRUE,
            min.cutoff = 'q10',
            label = TRUE)
dev.off()

Exploration of the PCs driving the different clusters

We hope that the defined PCs could separate clusters well.We can see how the clusters are represented by the different PCs.Then we could look back at our genes driving this PC to get an idea of what the cell types might be in each cluster.

# Defining the information in the seurat object of interest
columns <- c(paste0("PC_", 1:18),
            "ident",
            "UMAP_1", "UMAP_2")

# Extracting this data from the seurat object
pc_data <- FetchData(seurat_integrated, 
                     vars = columns)
                     
                     
# Adding cluster label to center of cluster on UMAP
umap_label <- FetchData(seurat_integrated, 
                        vars = c("ident", "UMAP_1", "UMAP_2"))  %>%
  group_by(ident) %>%
  summarise(x=mean(UMAP_1), y=mean(UMAP_2))



# Plotting a UMAP plot for each of the PCs
library(cowplot)
library(tidyverse)
library(HGNChelper)

png(filename = "umap_on_pcs.png", width = 16, height = 8.135, units = "in", res = 300)
map(paste0("PC_", 1:18), function(pc){
        ggplot(pc_data, 
               aes(UMAP_1, UMAP_2)) +
                geom_point(aes_string(color=pc), 
                           alpha = 0.7) +
                scale_color_gradient(guide = FALSE, 
                                     low = "grey90", 
                                     high = "blue")  +
                geom_text(data=umap_label, 
                          aes(label=ident, x, y)) +
                ggtitle(pc)
}) %>% 
        plot_grid(plotlist = .)
dev.off()

# Examine PCA results 
print(seurat_integrated[["pca"]], dims = 1:5, nfeatures = 5)

# Normalize RNA data for visualization purposes
seurat_integrated <- NormalizeData(seurat_integrated, verbose = FALSE)

png(filename = "umap_fibroblast.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(seurat_integrated, 
            reduction = "umap", 
            features = c("IGFBP7", "MGP"), 
            order = TRUE,
            min.cutoff = 'q10', 
            label = TRUE)
dev.off()

png(filename = "umap_endothelial.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(seurat_integrated, 
            reduction = "umap", 
            features = c("PLVAP", "CALCRL"), 
            order = TRUE,
            min.cutoff = 'q10', 
            label = TRUE)
dev.off()

png(filename = "umap_t_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(seurat_integrated, 
            reduction = "umap", 
            features = c("CCL5", "CD52", "IL32"), 
            order = TRUE,
            min.cutoff = 'q10', 
            label = TRUE)
dev.off()

7) Marker identification and cell type assignment

This can be the last step in our pipeline which aims to determine the gene markers for each of the clusters and identify cell types of each cluster using markers. Also, this step helps to determine whether there's a need to re-cluster based on cell type markers, or maybe clusters need to be merged or split.

For marker identification, there are three functions in the Seurat package, each with a different application:

FindAllMarkers(): It should only be used when comparing a cluster against other clusters belonging to the same group. i.e. this function should be used only when we have one group/condition.

FindConservedMarkers(): When we have two groups like tumor vs normal or invasive vs. noninvasive identifying conserved markers is the best approach. In this way, we find DE genes for a given cluster in one condition (e.g. invasive) comparing the cluster against the rest of the cluster in the same condition group. We do the same for that given cluster in the other condition (non-invasive). Finally, the two lists will be merged to give us the conserved marker for a given cluster.

FindMarkers(): This is helpful with identifying gene markers for each cluster. In practice sometimes the list of markers returned doesn’t sufficiently separate some of the clusters. We can use this function to further differentiate between those clusters.

#______________________________ NOT TO BE RUN________________________________
# Find markers for every cluster compared to all remaining cells, report only the positive ones
markers <- FindAllMarkers(object = seurat_integrated, 
                          only.pos = TRUE,
                          logfc.threshold = 0.25)

When we have two groups like tumor vs normal or invasive vs. non-invasive identifying conserved markers is the best approach. In this way, we find DE genes for a given cluster in one condition (e.g. invasive) by comparing the cluster against the rest of the cluster in the same condition group. We do the same for that given cluster in the other condition (non-invasive). Finally, the two lists will be merged to give us the conserved marker for a given cluster.

# explecity set the default object to normalized values
DefaultAssay(seurat_integrated) <- "RNA"


cluster0_conserved_markers <- FindConservedMarkers(seurat_integrated,
                              ident.1 = 0,
                              grouping.var = "Invasiveness",
                                only.pos = TRUE,
                             logfc.threshold = 0.60)

To add more annotation to the results

# Connect to AnnotationHub
ah <- AnnotationHub()

# Access the Ensembl database for organism
ahDb <- query(ah, 
              pattern = c("Homo sapiens", "EnsDb"), 
              ignore.case = TRUE)

# Acquire the latest annotation files
id <- ahDb %>%
        mcols() %>%
        rownames() %>%
        tail(n = 1)

# Download the appropriate Ensembldb database
edb <- ah[[id]]

# Extract gene-level information from database
annotations <- genes(edb, 
                     return.type = "data.frame")

# Select annotations of interest
annotations <- annotations %>%
        dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)

To find conserved markers for all clusters:

# Create function to get conserved markers for any given cluster
get_conserved <- function(cluster){
  tryCatch({
    FindConservedMarkers(seurat_integrated,
                         ident.1 = cluster,
                         grouping.var = "Invasiveness",
                         only.pos = TRUE,
                         logfc.threshold = 0.60) %>%
      rownames_to_column(var = "gene") %>%
      left_join(y = unique(annotations[, c("gene_name", "description")]),
                 by = c("gene" = "gene_name")) %>%
      cbind(cluster_id = cluster, .)
    },
    error = function(e) {
      message(paste0("Error: ", e$message))
      return(NULL)
    }
  )
}
# this function can be an argument for 'map_dfr' function :
# Iterate function across desired clusters
conserved_markers <- map_dfr(c(0:20), get_conserved)

# Extract top 10 markers per cluster
top10 <- conserved_markers %>% 
  mutate(avg_fc = (Noninvasive_avg_log2FC + Invasive_avg_log2FC) /2) %>% 
  group_by(cluster_id) %>% 
  top_n(n = 10, 
        wt = avg_fc)
        
head(top10)

data.table::fwrite(top10, "blca_top10_conserved_markers.csv")

There are a number of tools that one may use to assign cell types to a cluster. However almost none of them at the time of writing can help with bladder tissue. So I have to look up genes in the PanglaoDB database manually and assign cell type to each cluster. So to do this job in a more efficient way, let's first identify which markers are associated with more clusters, then assign cell type to those clusters.

top10_mod <- data.frame(unclass(table(top10$gene, top10$cluster_id)))


data.table::fwrite(top10_mod, "top10_mod_conserved_markers.csv")

# markers with highest frequency
M <- c("KRT7", "KRT19", "FCER1G", "AIF1", "AQP3", "CCL5", "CD24", "CD3D", "CD52", "CLDN4", "COL1A1", "COL1A2", "CRTAC1", "CXCL8", "DCN", "FABP4", "FABP5", "FXYD3", "GZMA", "HLA-DRA", "IGHA1", "IGHG1", "IGHG3", "IGHG4", "IGKC", "IGLC1", "IGLC2", "IGLC3", "JCHAIN")

So cell type for top markers:

cell type	genes
Basal cells	KRT7, KRT19, AQP3, CD24,CXCL8,FXYD3
Dendritic cells	FCER1G,AIF1,FABP4
Gamma delta T cells	CCL5,GZMA
NK cell	CCL5,GZMA
T cells	CD3D, CD52
Macrophages	CD52
Luminal epithelial cells	CLDN4
Fibroblasts	COLA1, COLA2,CXCL8, DCN
Epithelial cells	CRTAC1,FXYD3
Endothelial cells	FABP4, FABP5
Plasma cell	IGHA1,IGHG1,IGHG3,IGHG4,IGKC,IGLC1,IGLC3,IGLC3,JCHAIN

Also, there is a table from John P. Sfakianos et al , which summarizes the tumor cell markers:

CellType	Genes
Luminal	CYP2J2,ERBB2,ERBB3,FGFR3,FOXA1,GATA3,GPX2,KRT18,KRT19,KRT20,KRT7,KRT8,PPARG,XBP1,UPK1A,UPK2
EMT and smooth muscle	PGM5,DES,C7,SRFP4,COMP,SGCD
EMT and claudin	ZEB1,ZEB2,VIM,SNAI1,TWIST1,FOXC2,CDH2,CLDN3,CLDN7,CLDN4,CDH1,SNAI2
Basal	CD44,CDH3,KRT1,KRT14,KRT16,KRT5,KRT6A,KRT6B,KRT6C
P53-like	ACTG2,CNN1,MYH11,MFAP4,PGM5,FLNC,ACTC1,DES,PCP4
Squamous	DSC1,DSC2,DSC3,DSG1,DSG2,DSG3,S100A7,S100A8
Immune	CD274,PDCD1LG2,IDO1,CXCL11,L1CAM,SAA1
Neuroendocrine	CHGA,CHGB,SCG2,ENO2,SYP,NCAM1
Neuronal differentiation	MSI1,PLEKHG4B,GNG4,PEG10,RND2,APLP1,SOX2,TUBB2B
Downregulated CIS	CRTAC1,CTSE,PADI3
Upregulated CIS	MSN,NR3C1
Cancer stem cell	CD44,KRT5,RPSA,ALDH1A1

Let's visualize some of the genes and see in which cluster they show expression.

# Plot interesting marker gene expression 
png(filename = "umap_high_freq_basal_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated, 
                        features = c("KRT7", "KRT19", "AQP3", "CD24", "FXYD3", "CXCL8"),
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

# Vln plot - cluster 0
png(filename = "violin_high_freq_basal_cells.png", width = 16, height = 8.135, units = "in", res = 300)
VlnPlot(object = seurat_integrated, 
        features = c("KRT7", "KRT19", "AQP3", "CD24", "FXYD3", "CXCL8"))
dev.off()

So according to "basal cells" visualization, the following clusters may show clusters of basal cells: 0, 2,3,9,10,14,17,18,19(?),20 and 21.

# Plot interesting marker gene expression 
png(filename = "umap_high_freq_Plasma_cells.png", width = 26, height = 15.135, units = "in", res = 600)
FeaturePlot(object = seurat_integrated, 
                        features = c("IGHA1","IGHG1","IGHG3","IGHG4","IGKC","IGLC1","IGLC3","IGLC3","JCHAIN"),
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

# Vln plot - cluster 0
png(filename = "violin_high_freq_basal_cells.png", width = 26, height = 10.135, units = "in", res = 600)
VlnPlot(object = seurat_integrated, 
        features = c("IGHA1","IGHG1","IGHG3","IGHG4","IGKC","IGLC1","IGLC3","IGLC3","JCHAIN"))
dev.off()

Cluster 19 looks to be hard to assign it a cell type. Will keep an eye on it.

# Plot interesting marker gene expression 
png(filename = "umap_high_freq_dc.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated, 
                        features = c("FCER1G","AIF1","FABP4"),
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

# Vln plot - cluster 0
png(filename = "violin_high_freq_dc.png", width = 16, height = 8.135, units = "in", res = 300)
VlnPlot(object = seurat_integrated, 
        features = c("FCER1G","AIF1","FABP4"))
dev.off()

For cluster1 and 4

d <- data.table::fread("top10_conserved_markers.csv")

png(filename = "umap_cluster4_markers.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated, 
                        features = d$gene[d$cluster_id == "4"],
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

# Vln plot - cluster 0
png(filename = "violin_cluster4_markers.png", width = 16, height = 8.135, units = "in", res = 300)
VlnPlot(object = seurat_integrated, 
        features = d$gene[d$cluster_id == "4"])
dev.off()

png(filename = "umap_cluster12_markers.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated, 
                        features = d$gene[d$cluster_id == "12"],
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

So it seems clusters 1,4,8,12,13 and 16 are the same cell type with minor subtypes. However cluster 5 and 7 show some level of expression for markers in cluster 1 + 4, but they need separate inspection. Conserved markers in these two clusters are as follows

[1] "LDB2"    "SPARCL1" "GNG11"   "FLT1"    "IFI27"   "RAMP2"   "PECAM1"
[8] "TCF4"    "PLVAP"   "PCAT19"  "ACKR1"   "SELE"    "CALCRL"  "ZNF385D"
[15] "ADAMTS9" "MCTP1"   "AQP1"    "VWF"     "CCL14"   "PCAT19"

cell type	genes
Endothelial cells	LDB2, SPARCL1, FLT1,IFI27,RAMP2,PECAM1,TCF4,PLVAP,PCAT19,ACKR1, SELE,CALCRL,ZNF385D,ADAMTS9,MCTP1,AQP1,VWF,
Fibroblast	SPARCL1,TCF4,PLVAP,ADAMTS9

png(filename = "umap_cluster5_7_markers.png", width = 26, height = 18.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated, 
                        features = d$gene[d$cluster_id %in% c("5", "7")],
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

# Vln plot - cluster 0
png(filename = "violin_cluster5_7_markers.png", width = 16, height = 8.135, units = "in", res = 300)
VlnPlot(object = seurat_integrated, 
        features = d$gene[d$cluster_id == "4"])
dev.off()

So as expected clusters 7 and 5 represent the same cell types. In almost all cases cluster 17 showed some level of expression for EC markers.

Let's now have a look at markers for clusters 6,15 and 11:

png(filename = "umap_cluster11_15_6_markers.png", width = 26, height = 18.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated, 
                        features = unique(d$gene[d$cluster_id %in% c("6", "11","15")]),
                         sort.cell = TRUE,
                         min.cutoff = 'q10', 
                         label = TRUE,
                         repel = TRUE)
dev.off()

According to the markers such as different collagen, clusters 11 and 15 seem to be fibroblasts while cluster 6 shows a mix of fibroblast and non-fibroblast markers.

Now we can generate UMAP with cell type as labels;

# Rename all identities
seurat_integrated <- RenameIdents(object = seurat_integrated, 
                               "0" = "Basal cells",
                               "1" = "Gamma delta T cell", # NKG7 and GZMK
                               "2" = "Basal cells",
                               "3" = "Basal cells",
                               "4" = "T cells",
                               "5" = "Endothelial cells",
                               "6" = "Fibroblast",
                               "7" = "Endothelial cells",
                               "8" = "DC",
                               "9" = "Basal cells",
                               "10" = "Basal cells",
                               "11" = "Fibroblast",
                               "12" = "T cells",
                               "13" = "T cells",
                               "14" = "Basal cells",
                               "15" = "Fibroblast",
                               "16" = "T cells",
                               "17" = "?", 
                               "18" = "Basal cells", 
                               "19" = "Plasma cells",
                               "20" = "Basal cells",
                               "21" = "Basal cells")

# Plot the UMAP withy new labells
png(filename = "umap_with_label.png", width = 16, height = 8.135, units = "in", res = 600)
DimPlot(object = seurat_integrated, 
        reduction = "umap", 
        label = TRUE,
        label.size = 3,
        repel = TRUE)
dev.off()

OK then it's time to write the Seurat object for later analysis, if any:

# Save final R object
write_rds(seurat_integrated,
          path = "seurat_labelled.rds")

As a final note I tried to assign cell types to clusters using the scType tool, however, there was no bladder tissue profile in their database. So the generated labels using this approach should be only considered as a guide.

# Cell maker assignemnt using scType

# load libraries
library(HGNChelper)


# load gene set preparation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
# load cell type annotation function
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

db_ = "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx";
#tissue = c("Immune system") # e.g. Immune system,Pancreas,Liver,Eye,Kidney,Brain,Lung,Adrenal,Heart,Intestine,Muscle,Placenta,Spleen,Stomach,Thymus 
tissue = c("Immune system", "Pancreas", "Liver","Kidney","Intestine","Placenta","Spleen",
           "Stomach") 

# prepare gene sets
gs_list = gene_sets_prepare(db_, tissue)

# get cell-type by cell matrix
es.max = sctype_score(scRNAseqData = seurat_integrated[["integrated"]]@scale.data, scaled = TRUE, 
                      gs = gs_list$gs_positive, gs2 = gs_list$gs_negative) 

# merge by cluster
cL_resutls = do.call(
  "rbind", lapply(unique(seurat_integrated@meta.data$seurat_clusters), function(cl){
  es.max.cl = sort(rowSums(es.max[ ,rownames(seurat_integrated@meta.data[seurat_integrated@meta.data$seurat_clusters==cl, ])]), decreasing = !0)
  head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum(seurat_integrated@meta.data$seurat_clusters==cl)), 10)
}))


sctype_scores = cL_resutls %>% group_by(cluster) %>% top_n(n = 1, wt = scores)

# set low-confident (low ScType score) clusters to "unknown"
sctype_scores$type[as.numeric(as.character(sctype_scores$scores)) < sctype_scores$ncells/4] = "Unknown"
print(sctype_scores[,1:3])



# overlay the identified cell types on UMAP plot:

seurat_integrated@meta.data$customclassif = ""
for(j in unique(sctype_scores$cluster)){
  cl_type = sctype_scores[sctype_scores$cluster==j,]; 
  seurat_integrated@meta.data$customclassif[seurat_integrated@meta.data$seurat_clusters == j] = as.character(cl_type$type[1])
}

png(filename = "umap_with_label_scType.png", width = 16, height = 8.135, units = "in", res = 600)
DimPlot(seurat_integrated, 
        reduction = "umap", 
        label = TRUE, 
        repel = TRUE, 
        group.by = 'customclassif')        

dev.off()

8) MIBC vs. NMIBC

There are four NMIBC samples(SRR12603790,SRR12603789,SRR12603787,SRR12603786) and four MIBC samples(SRR12603785,SRR12603784,SRR12603783,SRR12603782) in the dataset. Comparing these two group against each other could have given a clue on the invasion cell type signature.

# load object
load("~/scRNA/seurat_filtered.RData")


# reading sample information
sampleInformation <- read.csv("~/scRNA/sampleInfo.csv")

filtered_seurat@meta.data <- left_join(filtered_seurat@meta.data, sampleInformation, by = "orig.ident")
# save meta data as a df
metaData <- filtered_seurat@meta.data


# setting ident
Idents(filtered_seurat) <- "sample"

# keeping only cancer samples
blca_seurat <- subset(filtered_seurat, idents = "BLCA")
rm(filtered_seurat)

# Repair the meta.data
blca_seurat@meta.data$cells <- blca_seurat@assays$RNA@data@Dimnames[[2]]
blca_seurat@meta.data$orig.ident <- substr(blca_seurat@assays$RNA@data@Dimnames[[2]],1,11)
blca_seurat@meta.data$seq_folder <- substr(blca_seurat@assays$RNA@data@Dimnames[[2]],1,11)
blca_seurat@meta.data$sample <- "BLCA"

# set rownames
rownames(blca_seurat@meta.data) <- blca_seurat@meta.data$cells


# remove columns fillied by NA
col_remove <- c("nUMI", "nGene", "log10GenesPerUMI", "gender", "age", "Grade", "Invasiveness", "Surgery_Type", "Tumor_size_cm", "mitoRatio")

# 
blca_seurat@meta.data <- blca_seurat@meta.data[, !(colnames(blca_seurat@meta.data) %in% col_remove)]

# adding more meta.data again
blca_seurat@meta.data <- left_join(blca_seurat@meta.data, sampleInformation, by = "orig.ident")


# Setting metadata rownames as the column names of expression matrix
rownames(blca_seurat@meta.data) <- blca_seurat@meta.data$cells

# SCTransform

# Split seurat object by group

# setting ident
Idents(blca_seurat) <- "Invasiveness"
blca_split <- SplitObject(blca_seurat)


blca_split 
#$Invasive
#An object of class Seurat 
#29686 features across 41242 samples within 1 assay 
#Active assay: RNA (29686 features, 0 variable features)
#
#$Noninvasive
#An object of class Seurat 
#29686 features across 35891 samples within 1 assay 
#Active assay: RNA (29686 features, 0 variable features)


# Repairing meta.data for each object in the splitted obj:

for(i in 1:length(blca_split)){
     print(blca_split[[i]])
     obj = blca_split[[i]]
     obj@meta.data$cells <- obj@assays$RNA@data@Dimnames[[2]]
     obj@meta.data <- obj@meta.data[, c(2,3,5)]
     obj@meta.data$orig.ident <- substr(obj@assays$RNA@data@Dimnames[[2]],1,11)
     obj@meta.data$seq_folder <- substr(obj@assays$RNA@data@Dimnames[[2]],1,11)
     obj@meta.data <- left_join(obj@meta.data, sampleInformation, by = "orig.ident")
     obj@meta.data <- left_join(obj@meta.data, metaData[,c(8,9)], by = "cells")
     # repairng metadata rownames
     if(all(obj@meta.data$cells == colnames(obj@assays$RNA@counts))){
     rownames(obj@meta.data) <- obj@meta.data$cells
     }
     blca_split[[i]] = obj
 }

#saveRDS(blca_split, "blca_split.rds")
# then normalize by SCTansform
# orig.ident is slected ti be regressed out because each sample was sequenced in seperate batch

for (i in 1:length(blca_split)) {
    blca_split[[i]] <- SCTransform(blca_split[[i]], vars.to.regress = c("orig.ident"), vst.flavor = "v2")
    }

# Integration : integrating samples belong to one group

# Select the most variable features to use for integration
integ_features <- SelectIntegrationFeatures(object.list = blca_split, 
                                            nfeatures = 3000) 


# Prepare the SCT list object for integration
split_seurat <- PrepSCTIntegration(object.list = blca_split, 
                                   anchor.features = integ_features)
                                   
# Find best buddies (using canonical correlation analysis or CCA) - can take a while to run
integ_anchors <- FindIntegrationAnchors(object.list = split_seurat, 
                                        normalization.method = "SCT", 
                                        anchor.features = integ_features)
# Integrate across conditions
seurat_integrated <- IntegrateData(anchorset = integ_anchors, 
                                   normalization.method = "SCT")
                                   

# Integration check
# Run PCA
seurat_integrated <- RunPCA(object = seurat_integrated, verbose = TRUE)

# Plot PCA
png(filename = "blca_PCA_integrated.png", width = 16, height = 8.135, units = "in", res = 300)
PCAPlot(seurat_integrated,
        split.by = "Invasiveness")
dev.off()

# Run UMAP
seurat_integrated <- RunUMAP(seurat_integrated, 
                             dims = 1:40,
                             reduction = "pca",
                             verbose = TRUE)

# Plot UMAP 
png(filename = "blca_UMAP_integrated.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated, split.by = "Invasiveness")
dev.off()

# Cluster the cells

# to check what is active assay
DefaultAssay(object = seurat_integrated)

# Determine the K-nearest neighbor graph
seurat_integrated <- FindNeighbors(object = seurat_integrated, 
                                dims = 1:18)
                                
#Find clusters
# Determine the clusters for various resolutions                                
seurat_integrated <- FindClusters(object = seurat_integrated,
                               resolution = c(0.2,0.4, 0.6, 0.8, 1.0, 1.4))
                               

# Explore resolutions
head(seurat_integrated@meta.data)

# Assign identity of clusters
Idents(object = seurat_integrated) <- "integrated_snn_res.0.2"

# Plot the UMAP
png(filename = "blca_umap_cluster_with_label.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

# UMAP of cells in each cluster by invasiveness group
# This would allow us to see condition specefic clusters
png(filename = "blca_umap_cluster_sample.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated, 
        label = TRUE, 
        split.by = "Invasiveness")  + NoLegend()
dev.off()

# Explore whether clusters segregate by cell cycle phase

png(filename = "grade_umap_cluster_cell_cucle.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(seurat_integrated,
        label = TRUE, 
        split.by = "Grade")  + NoLegend()
dev.off()

#Marker identification
                  

# adding annotation to the genes:
# Connect to AnnotationHub
ah <- AnnotationHub()

# Access the Ensembl database for organism
ahDb <- query(ah, 
              pattern = c("Homo sapiens", "EnsDb"), 
              ignore.case = TRUE)

# Acquire the latest annotation files
id <- ahDb %>%
        mcols() %>%
        rownames() %>%
        tail(n = 1)

# Download the appropriate Ensembldb database
edb <- ah[[id]]

# Extract gene-level information from database
annotations <- genes(edb, 
                     return.type = "data.frame")

# Select annotations of interest
annotations <- annotations %>%
        dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)

Finding markers for all clusters:

# Create function to get conserved markers for any given cluster
get_conserved <- function(cluster){
  tryCatch({
  FindConservedMarkers(seurat_integrated,
                       ident.1 = cluster,
                       grouping.var = "Invasiveness",
                       only.pos = TRUE,
                       logfc.threshold = 0.60) %>%
    rownames_to_column(var = "gene") %>%
    left_join(y = unique(annotations[, c("gene_name", "description")]),
               by = c("gene" = "gene_name")) %>%
    cbind(cluster_id = cluster, .)
  },
  error = function(e){
  message(paste0("Error: ", e$message))
    }
   )
  }
  
# this function can be an argument for 'map_dfr' function :
# Iterate function across desired clusters
conserved_markers <- map_dfr(c(0:20), get_conserved)

# Extract top 10 markers per cluster
top10 <- conserved_markers %>% 
  mutate(avg_fc = (Normal_avg_log2FC + BLCA_avg_log2FC) /2) %>% 
  group_by(cluster_id) %>% 
  top_n(n = 10, 
        wt = avg_fc)
        
head(top10)

#data.table::fwrite(top10, "top10_conserved_markers.csv")
#group based on the clsuter id to define cell types:
data_grouped <- top10 %>%
  group_by(cluster_id) %>%
  summarize(gene = paste(gene, collapse = ","))

cluster_id	genes	cell_type (PanglaoDB + ChatGPT)
0	SFN,S100A2,CXCL8,CLDN4,AQP3,PHLDA2,EMP1,KRT13,KRT17,SLPI	basal cells
1	SPINK1,CD24,UBE2C,UCA1,ADIRF,C15orf48,CSTB,S100A9,CCND1,HIST1H4C	CCND1 + cells
2	CD52,CXCR4,CD3D,CD69,IFNG,TRAC,IL32,CCL5,CCL4,NKG7	T cells/NK cells
3	PSCA,LY6D,KRT7,DHRS2,KRT13,HES1,KRT20,CRH,GCLC,RPS19	basal/squamous-like cells
4	HSPG2,LDB2,IGFBP7,SPARCL1,GNG11,VWF,FLT1,RAMP2,INSR,PLVAP	ND
5	CD52,CD2,LINC01871,GZMA,TRBC2,CD3D,TRAC,CORO1A,CCL5,NKG7	cytotoxic T cell
6	FABP4,CTSE,AKR1C3,PLA2G2A,FABP5,CRTAC1,GSTM3,KRT7,S100A6,CLU	CAFs
7	SNTG1,ZFPM2-AS1,RHEX,MT-ATP6,MT-ND5,CCSER1,FRY,SPINK1	ND
8	RGS5,RGS5,IGFBP7,CALD1,ACTA2,TAGLN,THY1,NDUFA4L2,MYL9,BGN	fibroblasts/myofibroblasts
10	C1QA,C1QB,CD74,HLA-DRA,HLA-DRB1,HLA-DPA1,HLA-DPB1,LYZ,TYROBP,APOE	APCs
11	COL3A1,CXCL14,COL1A2,RARRES2,PTGDS,MGP,LUM,DCN,MT2A,COL1A1	Fbroblasts(?)
13	CPA3,HPGDS,LTC4S,MS4A2,CTSG,TPSB2,TPSAB1,CD69,SRGN,RGS1	Mast cells
14	AFF3,BANK1,LTB,HLA-DRA,CD79A,CD37,CD74,CD52,CXCR4,CD83	B cells
17	JCHAIN,IGHG1,IGHG2,IGKC,IGHG3,IGLC3,IGLC2,IGHA1,IGHM,IGHA2	B cells

# Plot interesting marker gene expression 
png(filename = "cd24_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated,
            features = c('SPINK1','CD24','UBE2C','UCA1','ADIRF','C15orf48','CSTB','S100A9','CCND1','HIST1H4C'),
            order = TRUE,
            min.cutoff = "q10",
            label = TRUE,
            repel = TRUE)

dev.off()

png(filename = "lyz_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = seurat_integrated,
            features = c('C1QA','C1QB','CD74','HLA-DRA','HLA-DRB1','HLA-DPA1','HLA-DPB1','LYZ','TYROBP','APOE'),
            order = TRUE,
            min.cutoff = "q10",
            label = TRUE,
            repel = TRUE)

dev.off()

# Rename all identities
seurat_integrated <- RenameIdents(object = seurat_integrated, 
                               "0" = "Basal cells",
                               "1" = "CCND1 + cells", # NKG7 and GZMK
                               "2" = "T cells/NK cells",
                               "3" = "basal/squamous-like cells",
                               "4" = "ND",
                               "5" = "cytotoxic T cell",
                               "6" = "CAFs",
                               "7" = "ND",
                               "8" = "fibroblasts/myofibroblasts",
                               "9" = "ND",
                               "10" = "fibroblasts/myofibroblasts",
                               "11" = "Fibroblast(?)",
                               "12" = "ND",
                               "13" = "Mast cells",
                               "14" = "B cells",
                               "15" = "ND",
                               "16" = "ND",
                               "17" = "B cells", 
                               "18" = "ND", 
                               "19" = "ND",
                               "20" = "ND")

# Plot the UMAP withy new labells
png(filename = "blca_umap_with_label.png", width = 16, height = 8.135, units = "in", res = 600)
DimPlot(object = seurat_integrated, 
        reduction = "umap", 
        label = TRUE,
        label.size = 3,
        repel = TRUE,
        split.by = "Invasiveness")
dev.off()

9) Analyzing epithelial (EPCAM +) cells

This section is based on what I can get from the paper, but I stay focused on cancer samples and will use harmony to do integration. Finally, I do trajectory analysis for all cells which identified as epithelial cells.

"Cells with UMI numbers <1000 or with over 10% mitochondrial-derived UMI counts were considered low-quality cells and were removed. To eliminate potential doublets, single cells with over 6000 genes detected were also filtered out. Finally, 52721 single cells remained, and they were applied in downstream analyses."

"Since samples from eight patients were processed and sequenced in batches, the patient number was used to remove the potential batch effect."

"epithelial (EPCAM+) cells; endothelial (CD31+) cells; two types of fibroblasts (COL1A1+)—iCAFs (PDGFRA+) and myo-CAFs (mCAFs) (RGS5+); B cells (CD79A+); myeloid cells (LYZ+); T cells (CD3D+); and mast cells (TPSAB1+)"

#________________________Reading the files______________________#
# create a list of samples
samples <- list.files("./filtered/")
#samples <- samples[grepl('^filtered',samples,perl=T)]

# read files into Seurat objects
for (file in samples){
  print(paste0(file))
  seurat_data <- Read10X(data.dir = paste0("./filtered/", file))
  seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                   min.features = 100, 
                                   project = file)
  assign(file, seurat_obj)
}

# updated sample name 
samples_blca <- samples[-c(1,2,9)]

# Now merging all objects into one Seurat obj

merged_seurat <- merge(x = SRR12603782,
                       y = c(SRR12603783,
                             SRR12603784,
                             SRR12603785,
                             SRR12603786,
                             SRR12603787,
                             SRR12603789,
                             SRR12603790),
                       add.cell.id = samples_blca)
                       
#________________________Filteration____________________________#
# reading sampleInformation:
sampleInformation <- read.csv("./sampleInfo.csv")

# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat@meta.data$mitoRatio / 100

# adding cell column
merged_seurat$cells <- rownames(merged_seurat@meta.data)
# merging with sample information
merged_seurat@meta.data <- merge(merged_seurat@meta.data, sampleInformation)
# re-setting the rownames
rownames(merged_seurat@meta.data) <- merged_seurat@meta.data$cells


# Filteration

filtered_seurat <- subset(merged_seurat, 
                          subset= nCount_RNA >= 1000 &
                          nFeature_RNA <= 6000 & 
                          mitoRatio < 0.10)

#________________________Integration using Harmony____________________________#
#integration using harmony need several steps to be undertaken:

# Perform log-normalization and feature selection, as well as SCT normalization on the global object
merged_seurat <- filtered_seurat %>%
    NormalizeData() %>%
    FindVariableFeatures(selection.method = "vst", nfeatures = 3000) %>% 
    ScaleData() %>%
    SCTransform(vars.to.regress = c("mitoRatio", "orig.ident"))

# Calculate PCs using variable features determined by SCTransform (3000 by default)
merged_seurat <- RunPCA(merged_seurat, assay = "SCT", npcs = 50)
merged_seurat <- RunTSNE(merged_seurat, assay = "SCT", npcs = 50)

# Integration
#install.packages("harmony")

library(harmony)

harmonized_seurat <- RunHarmony(merged_seurat, 
				group.by.vars = c("orig.ident", "gender", "Surgery_Type"), 
				reduction = "pca", assay.use = "SCT", reduction.save = "harmony")

The above code incorporates an additional reduction of 50 "harmony components" (i.e., corrected principal components) to our Seurat object, which is stored in the harmonized_seurat@reductions$harmony variable.

However, to ensure that the Harmony integration is accurately represented in the data visualization, we must generate a UMAP that is derived from these harmony embeddings instead of the PCs.

harmonized_seurat <- RunUMAP(harmonized_seurat, reduction = "harmony", assay = "SCT", dims = 1:40)
#harmonized_seurat <- RunUTSNE(harmonized_seurat, reduction = "harmony", assay = "SCT", dims = 1:40)

#________________________Cluster identification and Inspect the effects of Harmony batch removel ____________#

# to set reduction to harmony and find the clusters
harmonized_seurat <- FindNeighbors(object = harmonized_seurat, reduction = "harmony")
harmonized_seurat <- FindClusters(harmonized_seurat, resolution = c(0.1, 0.2, 0.4, 0.6, 0.8))

# visualization
Idents(harmonized_seurat) <- harmonized_seurat@meta.data$SCT_snn_res.0.1

# Color cells based on the sample name
# Plot UMAP 
png(filename = "harmony_UMAP_y_sample.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(harmonized_seurat,
        group.by = "orig.ident",
        reduction = "umap")
dev.off()

As the above figure suggests, Harmony did a great job in terms of removing the batch effects.

#________________________SuperCluster Identification____________#

png(filename = "harmony_umap_cluster_with_label.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(harmonized_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

# let's visualize cells expressing supercluster markers:
# CD31: PECAM1
markers <- c("EPCAM", "PECAM1", "COL1A1", "PDGFRA", "RGS5", "CD79A", "LYZ", "CD3D", "TPSAB1")

png(filename = "umap_superCluster_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = harmonized_seurat,
            features = markers,
            order = TRUE,
            min.cutoff = "q10",
            reduction = "umap",
            label = TRUE,
            repel = TRUE)

dev.off()

Marker identification for the 8 superclusters:

#______________________________ All markers________________________________
# Find markers for every cluster compared to all remaining cells, and report only the positive ones
markers <- FindAllMarkers(object = harmonized_seurat, 
                          only.pos = TRUE,
                          logfc.threshold = 0.25) 

saveRDS(markers, "harmony_markers.RDS")

# mutate the markers dataframe
# Extract top 10 markers per cluster

top10 <- markers %>%
  mutate(delta_pct = (pct.1 - pct.2)) %>%
  #filter(avg_log2FC > 1.5) %>%  # only keep rows where avg_log2FC > 1.5
  group_by(cluster) %>%
  top_n(n = 10, wt = delta_pct)

data.table::fwrite(top10, "harmony_blca_top10_all_markers.csv")

Visualization of top markers in each cluster:

Cluster markers:

cluster_markers_10 <- top10 %>% 
                   group_by(cluster) %>% 
                   summarize(genes = paste(gene, collapse = ","))
data.table::fwrite(cluster_markers_10, "cluster_markers_10.csv")

# Feature plot for top markers
plotList = list()

for(cluster in 1:nrow(cluster_markers_10)){
    mkr = unlist(strsplit(cluster_markers_10$genes[cluster], ","))
    plotList[[cluster]] = FeaturePlot(object = harmonized_seurat,
            features = mkr,
            order = TRUE,
            min.cutoff = "q10",
            reduction = "umap",
            label = TRUE,
            repel = TRUE)
}

# Iterate over all clusters
png(filename = "harmony_blca_clsuter_markers_cluster8.png", width = 16, height = 8.135, units = "in", res = 300)
plotList[[9]]
dev.off()
#

Expression of cluster-specific markers:

# LYZ cells
png(filename = "LYZ_harmony_blca_clsuter_marker.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = harmonized_seurat,
            features = c("LYZ"),
            order = TRUE,
            min.cutoff = "q10",
            reduction = "umap",
            label = TRUE,
            repel = TRUE)
dev.off()

To see where there myeloid cells are.

Markers and cell types

cluster_id	genes	cell_type (PanglaoDB + ChatGPT)
0	AGR2,C10orf99,SDC1,SERINC2,SMIM22,CLDN7,TNFRSF21,FAM160A1,CXADR,RAB25	epithelial cell
1	CD52,CD3D,PTPRC,TRAC,CD7,CD2,SKAP1,ARHGAP15,HCST,CD3E	T-cells
2	PLVAP,SPARCL1,HSPG2,VWF,TCF4,LDB2,CALCRL,RAMP2,PCAT19,PECAM1	Endothelial cells
3	HLA-DRA,HLA-DPB1,TYROBP,HLA-DQA1,HLA-DRB1,HLA-DPA1,HLA-DQB1,FCER1G	APCs (Macrophages, B-cells)
4	IGLC2,IGHG1,IGLC1,IGHG3,IGHA1,IGHG4,JCHAIN,IGHGP,MZB1,DERL3	B-cells
5	RGS5,TAGLN,ACTA2,MYL9,CALD1,PPP1R14A,BGN,PRKG1,SOD3,COL6A2	myo-CAFs
6	LUM,DCN,COL1A2,COL3A1,RARRES2,MFAP4,C1S,C1R,COL6A2,COL6A3	i-CAFS
7	TPSB2,TPSAB1,CPA3,HPGDS,LTC4S,MS4A2,SAMSN1,RGS13,FCER1G,TYROBP	Mast cells
8	MIR205HG,COL4A5,IGFBP2,LINC00511,FOSL1,BCAM,RAP2B,SLC35F1,AQP3,DENND2C	Epithelial cells + Mesenchymal cells

# renaming clusters

# Rename all identities
harmonized_seurat <- RenameIdents(object = harmonized_seurat, 
                               "0" = "Epithelial cells",
                               "1" = "T-cells", # impureity with epithelial cells
                               "2" = "Endothelial cells",
                               "3" = "APCs(Macrophages, B-cells)",
                               "4" = "B-cells",
                               "5" = "myo-CAF",
                               "6" = "i-CAF",
                               "7" = "Mast cells",
                               "8" = "Epithelial cells+Mesenchymal cells")
                               
                               
                               


# Plot the UMAP with new labels
png(filename = "harmont_blca_umap_with_label.png", width = 16, height = 8.135, units = "in", res = 600)
DimPlot(object = harmonized_seurat, 
        reduction = "umap", 
        label = TRUE,
        label.size = 3,
        repel = TRUE,
        split.by = "Invasiveness")
dev.off()

Cluster 1, T-cells show some level of impurity by cluster 0, epithelial cells. This Impurity is especially evident in invasive samples.

# Obtaining clusters with epithelial cells

# To see the number of cells in each cluster

table(Idents(harmonized_seurat))

#
#                  epithelial cells                            T-cells
#                             24135                              19850
#                 Endothelial cells         APCs(Macrophages, B-cells)
#                              6699                               5233
#                           B-cells                            myo-CAF
#                              4448                               2877
#                             i-CAF                         Mast cells
#                              2337                               1126
#Epithelial cells+Mesenchymal cells
#                               725
#
epi_cell_ids <- rownames(harmonized_seurat@meta.data)[harmonized_seurat@meta.data$SCT_snn_res.0.1 == '0']

epi_cell_ids <- c(epi_cell_ids,rownames(harmonized_seurat@meta.data)[harmonized_seurat@meta.data$SCT_snn_res.0.1 == '8'])


epi_seurat <- subset(filtered_seurat, subset = cells %in% epi_cell_ids)

# Perform log-normalization and feature selection, as well as SCT normalization on the global object
epi_seurat <- epi_seurat %>%
    NormalizeData() %>%
    FindVariableFeatures(selection.method = "vst", nfeatures = 3000) %>% 
    ScaleData() %>%
    SCTransform(vars.to.regress = c("mitoRatio", "orig.ident"))

# Calculate PCs using variable features determined by SCTransform (3000 by default)
epi_seurat <- RunPCA(epi_seurat, assay = "SCT", npcs = 50)
#epi_seurat <- RunTSNE(epi_seurat, assay = "SCT", npcs = 50)

# Integration
#install.packages("harmony")

library(harmony)

epi_seurat <- RunHarmony(epi_seurat, 
				group.by.vars = c("orig.ident", "gender", "Surgery_Type"), 
				reduction = "pca", assay.use = "SCT", reduction.save = "harmony")

epi_seurat <- RunUMAP(epi_seurat, reduction = "harmony", assay = "SCT", dims = 1:40)
#harmonized_seurat <- RunUTSNE(harmonized_seurat, reduction = "harmony", assay = "SCT", dims = 1:40)

#Cluster identification and Inspect the effects of Harmony batch removel#

# to set reduction to harmony and find the clusters
epi_seurat <- FindNeighbors(object = epi_seurat, reduction = "harmony")
epi_seurat <- FindClusters(epi_seurat, resolution = c(0.1, 0.2, 0.4, 0.6, 0.8))

# visualization
Idents(epi_seurat) <- epi_seurat@meta.data$SCT_snn_res.0.1

# Color cells based on the sample name
# Plot UMAP 
png(filename = "epi_harmony_UMAP_y_sample.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(epi_seurat,
        group.by = "orig.ident",
        reduction = "umap")
dev.off()
#

# color cells based on the cluster
# Plot UMAP 
png(filename = "cluster_epi_harmony_UMAP.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(epi_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

# Marker identification

epi_markers <- FindAllMarkers(object = epi_seurat, 
                          only.pos = TRUE,
                          logfc.threshold = 0.25) 

# mutate the markers dataframe
# Extract top 10 markers per cluster

epi_top10 <- epi_markers %>%
  mutate(delta_pct = (pct.1 - pct.2)) %>%
  #filter(avg_log2FC > 1.5) %>%  # only keep rows where avg_log2FC > 1.5
  group_by(cluster) %>%
  top_n(n = 10, wt = delta_pct)


epi_cluster_markers_10 <- epi_top10 %>% 
                   group_by(cluster) %>% 
                   summarize(genes = paste(gene, collapse = ","))

cluster	genes	cell_type (PanglaoDB + ChatGPT)
0	KRT15,KRT5,BCAM,IGFBP2,IGFBP7,CDH13,FOSL1,LAMB3,COL4A5,SERPINB5,SELENOP	basal cell
1	UCA1,LGALS1,CA9,SPINK1,GJB2,PTPRR,FCRLB,PLA2G2F,GJB6,FBLN1	cancer-associated luminal cells
2	NDUFA4L2,CRH,UPK2,SNX31,KRT20,AC019117.2,UPK1B,LINC02163,H19,TESC	luminal differentiated
3	LY6D,TNNT3,IGF2BP2,GMNN,SYT8,AC068587.4,ITM2C,SGPP2,ZNF750,TNFSF10	unique luminal cells
4	S100A9,C15orf48,MYO16,MMP7,MYEOV,AC025159.1,LTO1,IFI27,IRS2,SLPI	immunomodulatory luminal
5	ALCAM,ATXN1,PCDH7,COBLL1,LIPH,TRIM31,NAALADL2,ITGA2,SERPINB5,ITGA6,TM4SF1	adhesion and signaling luminal cells

# Reading seurat object
epi_seurat <- readRDS("epi_seurat.RDS")

# Rename all identities
epi_seurat <- RenameIdents(object = epi_seurat, 
                               "0" = "basal_cell",
                               "1" = "cancer_associated_luminal_cell",
                               "2" = "differentiated_luminal_cell",
                               "3" = "unique_luminal_cell",
                               "4" = "immunomodulatory_luminal_cell",
                               "5" = "adhesion_signaling_luminal_cell")

# Plot the UMAP with new labels
png(filename = "labelled_epi.png", width = 16, height = 8.135, units = "in", res = 600)
DimPlot(object = epi_seurat, 
        reduction = "umap", 
        label = TRUE,
        label.size = 3,
        repel = TRUE)
dev.off()

Cluster 5 can benefit from re-clustering. It consists of several patches of cells. If we increase the clustering resolution we should get each patches as a cluster.

Idents(epi_seurat) <- epi_seurat$SCT_snn_res.0.2
png(filename = "cluster_epi_harmony_UMAP_ptsize_1_res0.2.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(epi_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

So based on the above graph, cells in cluster 5 are now clustered in 8,9,11, and 12. The current cluster 8 can be split into more clusters.

Idents(epi_seurat) <- epi_seurat$SCT_snn_res.0.6
png(filename = "cluster_epi_harmony_UMAP_ptsize_1_res0.6.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(epi_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

Using the above resolution we can get two different clusters of cells for cluster 8. Now let's generate a customized Ident for the data set:

idt <- epi_seurat@meta.data[, c(14,15,17)]
# convert to character
idt <- data.frame(apply(idt, 2, as.character))
# to see what to do:
# table(epi_seurat$SCT_snn_res.0.2, epi_seurat$SCT_snn_res.0.6)

# Replacing values for cluster 8 with clusters 18 and 19 in res 0.6
idt$SCT_snn_res.0.2[idt$SCT_snn_res.0.6 == "15"] <- "8_a"
idt$SCT_snn_res.0.2[idt$SCT_snn_res.0.6 == "1"] <- "8_a"
idt$SCT_snn_res.0.2[idt$SCT_snn_res.0.6 == "2"] <- "8_a"
idt$SCT_snn_res.0.2[idt$SCT_snn_res.0.6 == "16"] <- "8_b"
idt$SCT_snn_res.0.2[idt$SCT_snn_res.0.6 == "18"] <- "8_b"
idt$SCT_snn_res.0.2[idt$SCT_snn_res.0.6 == "19"] <- "8_b"

# Updating SCT_snn_res.0.1 
# to see what to do:
# table(idt$SCT_snn_res.0.1, idt$SCT_snn_res.0.2)

idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "9"] <- "5_a"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "8_a"] <- "5_b"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "3"] <- "5_b"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "4"] <- "5_b"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "6"] <- "5_a"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "8_b"] <- "5_c"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "11"] <- "5_d"
idt$SCT_snn_res.0.1[idt$SCT_snn_res.0.2 == "12"] <- "5_e"
# adding new idents to Seurat obj
epi_seurat$new_idents <- idt$SCT_snn_res.0.1
# setting ident for plotting
Idents(epi_seurat) <- epi_seurat$new_idents
# plotting
png(filename = "cluster_epi_harmony_UMAP_new_idents.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(epi_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

10) DE and GSEA

library(SCP)


# DE and Enrichement analysis

#Differential expression analysis; this takes a while, to upload the object containing DE result:
epi_seurat <- readRDS("updated_epi_seurat.RDS")

epi_seurat <- RunDEtest(srt = epi_seurat, group_by = "clusters", fc.threshold = 1.5, only.pos = FALSE)

#
png(filename = "vplcano_DE.png", width = 16, height = 8.135, units = "in", res = 300)
VolcanoPlot(srt = epi_seurat, group_by = "clusters")
dev.off()

#
DEGs <- epi_seurat@tools$DEtest_clusters$AllMarkers_wilcox
DEGs <- DEGs[with(DEGs, avg_log2FC > 1.5 & p_val_adj < 0.05), ]
# Annotate features with transcription factors and surface proteins
#epi_seurat <- AnnotateFeatures(epi_seurat, species = "Homo_sapiens", db = c("TF", "SP"))

# png(filename = "heatmap_DE.png", width = 16, height = 8.135, units = "in", res = 300)
# ht <- FeatureHeatmap(
#   srt = epi_seurat, group.by = "clusters", features = DEGs$gene, feature_split = DEGs$group1,
#   species = "Homo_sapiens", db = c("GO_BP", "KEGG", "WikiPathway"), anno_terms = TRUE)
# print(ht$plot)
# dev.off()


# Enrichment analysis
## over-representation analysis

epi_seurat <- RunEnrichment(
  srt = epi_seurat, group_by = "clusters", db = "GO_BP", species = "Homo_sapiens",
  DE_threshold = "avg_log2FC > 1 & p_val_adj < 0.05"
)

png(filename = "enrichment_1.png", width = 16, height = 8.135, units = "in", res = 300)
EnrichmentPlot(
  srt = epi_seurat, group_by = "clusters", group_use = c("basal_cell", "differentiated_luminal_cell"),
  plot_type = "bar"
)
dev.off()

png(filename = "enrichment_2.png", width = 16, height = 8.135, units = "in", res = 300)
EnrichmentPlot(srt = epi_seurat, group_by = "clusters", plot_type = "comparison")
dev.off()

## Enrichment analysis(GSEA)
epi_seurat <- RunGSEA(
  srt = epi_seurat, group_by = "clusters", db = "GO_BP", species = "Homo_sapiens",
  DE_threshold = "p_val_adj < 0.05"
)

# plotting for hypoxia
png(filename = "enrichment_4.png", width = 16, height = 8.135, units = "in", res = 300)
GSEAPlot(srt = epi_seurat, group_by = "clusters", group_use = "differentiated_luminal_cell", geneSetID = "GO:0001666")
dev.off()

png(filename = "enrichment_5.png", width = 16, height = 8.135, units = "in", res = 300)
GSEAPlot(srt = epi_seurat, group_by = "clusters", group_use = "cancer_associated_luminal_cell", geneSetID = "GO:0001666")
dev.off()

png(filename = "enrichment_6.png", width = 16, height = 8.135, units = "in", res = 300)
GSEAPlot(srt = epi_seurat, group_by = "clusters", group_use = "basal_cell", geneSetID = "GO:0001666")
dev.off()

# comparison plot
png(filename = "enrichment_3.png", width = 16, height = 8.135, units = "in", res = 300)
GSEAPlot(srt = epi_seurat, group_by = "clusters", plot_type = "comparison")
dev.off()

11) Trajectory inference

Slingshot is a computational method designed for cell lineage inference and pseudotime estimation in single-cell transcriptomics data. It operates by leveraging gene expression dynamics to reconstruct the developmental trajectories of individual cells and assign them pseudotime values. The method begins by clustering cells based on their gene expression profiles, allowing for the identification of distinct cell populations. It then performs differential expression analysis to identify genes that are dynamically expressed across these populations, capturing the transitions between different cell types or states. Slingshot uses this information to infer lineage relationships between the cell populations, revealing the hierarchical structure of the developmental trajectory. By assigning pseudotime values, Slingshot estimates the relative progression of cells through different stages of development or biological processes. The inferred trajectories and pseudotime values provide insights into the temporal ordering of cellular events and gene expression dynamics. Slingshot's visualization tools facilitate the exploration and interpretation of these results, enabling researchers to gain a deeper understanding of the underlying processes driving cellular differentiation and disease progression in single-cell transcriptomics data.

png(filename = "trajectory.png", width = 16, height = 8.135, units = "in", res = 300)
epi_seurat <- RunSlingshot(srt = epi_seurat, group.by = "clusters", reduction = "UMAP")
dev.off()

Identify temporally dynamic genes

After running slingshot, I am going to explore gene expression changes during development. I aim to discover genes that undergo significant expression shifts. The method involves fitting a General Additive Model (GAM) for each gene, employing a negative binomial noise distribution to capture potential nonlinear connections between gene expression and pseudotime. Then conduct association tests to unveil meaningful links between expression and pseudotime.

library(slingshot)
library(tradeSeq)
library(RColorBrewer)

# convert Seurat to sce
sce <- as.SingleCellExperiment(epi_seurat)
# Running slingshot
sce <- slingshot(sce, clusterLabels = 'clusters', reducedDim = 'UMAP')

## The following analysis is computationally intensive; so only a subset of genes will be passed to the function
# Gene filtration: filter out genes with lower than 5 reads expressed in < 500 cells
geneFilter <- apply(assays(sce)$counts,1,function(x){
  sum(x >= 5) >= 500
})
tdg <- rownames(sce[geneFilter, ])

# Variable feature
vf <- VariableFeatures(epi_seurat)
# Combining the genes
selected_feature <- tdg[tdg %in% vf]
## Using the tradeSeq package
# Number of knot determined by ....
icMat <- evaluateK(counts = sce, sds = crv, k = 3:7, nGenes = 50,
                   verbose = TRUE, plot = TRUE)
# 5 seems to be the right number for knot

# fit negative binomial GAM
tradeseq_res <- fitGAM(counts = cse, nknots = 5, genes = selected_feature, verbose = TRUE)
# test for dynamic expression
ATres <- associationTest(tradeseq_res)

# Visualizing heat map
pst.ord <- order(sce$slingPseudotime_1, na.last = NA)
heatdata <- as.matrix(assays(sce)$counts[topgenes, pst.ord])
heatclus <- sce$clusters[pst.ord]

png("tdg.png", width=10, height = 10, unit = "in", res = 300)
heatmap(log1p(heatdata), Colv = NA,
        ColSideColors = brewer.pal(9,"Set1")[heatclus])
dev.off()

Exploring normal bladder mouse samples

#________________________Reading the files______________________#

samples <- c("SRR12603780", "SRR12603781", "SRR12603788")

# read files inot Seurat objects
for (file in samples){
  print(paste0(file))
  seurat_data <- Read10X(data.dir = paste0("../filtered/", file))
  seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                   min.features = 100, 
                                   project = file)
  assign(file, seurat_obj)
}


# now merging all objects inot one Seurat obj

merged_seurat <- merge(x = SRR12603780,
                       y = c(SRR12603781,
                             SRR12603788),
                       add.cell.id = samples)
                       
#________________________Filteration____________________________#
# reading sampleInformation:
sampleInformation <- read.csv("../sampleInfo.csv")

# Compute percent mito ratio
merged_seurat$mitoRatio <- PercentageFeatureSet(object = merged_seurat, pattern = "^MT-")
merged_seurat$mitoRatio <- merged_seurat@meta.data$mitoRatio / 100

# adding cell column
merged_seurat$cells <- rownames(merged_seurat@meta.data)
# merging with sample information
merged_seurat@meta.data <- merge(merged_seurat@meta.data, sampleInformation)
# re-setting the rownames
rownames(merged_seurat@meta.data) <- merged_seurat@meta.data$cells


# Filteration

filtered_seurat <- subset(merged_seurat, 
                          subset= nCount_RNA >= 1000 &
                          nFeature_RNA <= 6000 & 
                          mitoRatio < 0.10)

#________________________Integration using Harmony____________________________#
#integration using harmony need sevral steps to be undertaken:

# Perform log-normalization and feature selection, as well as SCT normalization on global object
merged_seurat <- filtered_seurat %>%
    NormalizeData() %>%
    FindVariableFeatures(selection.method = "vst", nfeatures = 3000) %>% 
    ScaleData() %>%
    SCTransform(vars.to.regress = c("mitoRatio", "orig.ident"))

# Calculate PCs using variable features determined by SCTransform (3000 by default)
merged_seurat <- RunPCA(merged_seurat, assay = "SCT", npcs = 50)

# Integration
#install.packages("harmony")

library(harmony)

harmonized_seurat <- RunHarmony(merged_seurat, 
				group.by.vars = c("orig.ident", "Surgery_Type"), 
				reduction = "pca", assay.use = "SCT", reduction.save = "harmony")

harmonized_seurat <- RunUMAP(harmonized_seurat, reduction = "harmony", assay = "SCT", dims = 1:40)

#Cluster identification and Inspect the effects of Harmony batch removel ____________#

# to set reduction to harmony and find the clusters
harmonized_seurat <- FindNeighbors(object = harmonized_seurat, reduction = "harmony")
harmonized_seurat <- FindClusters(harmonized_seurat, resolution = c(0.1, 0.2, 0.4, 0.6, 0.8))

# visualization
Idents(harmonized_seurat) <- harmonized_seurat@meta.data$SCT_snn_res.0.1

# Color cells based on the sample name
# Plot UMAP 
png(filename = "normal_harmony_UMAP_y_sample.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(harmonized_seurat,
        group.by = "orig.ident",
        reduction = "umap")
dev.off()

#________________________SuperCluster Identification____________#

png(filename = "normal_harmony_umap_cluster_with_label.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(harmonized_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

# let's visualize cells expressing supercluster markers:
# CD31: PECAM1
markers <- c("EPCAM", "PECAM1", "COL1A1", "PDGFRA", "RGS5", "CD79A", "LYZ", "CD3D", "TPSAB1")

png(filename = "normal_umap_superCluster_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = harmonized_seurat,
            features = markers,
            order = TRUE,
            min.cutoff = "q10",
            reduction = "umap",
            label = TRUE,
            repel = TRUE)

dev.off()

It seems the epithelial cells are clustered together in cluster 4. So for further analysis, I will be focusing on these cells.

epi_cell_ids <- rownames(harmonized_seurat@meta.data)[harmonized_seurat@meta.data$SCT_snn_res.0.1 == '4']

epi_seurat <- subset(filtered_seurat, subset = cells %in% epi_cell_ids)

# Perform log-normalization and feature selection, as well as SCT normalization on the global object
epi_seurat <- epi_seurat %>%
    NormalizeData() %>%
    FindVariableFeatures(selection.method = "vst", nfeatures = 3000) %>% 
    ScaleData() %>%
    SCTransform(vars.to.regress = c("mitoRatio", "orig.ident"))

# Calculate PCs using variable features determined by SCTransform (3000 by default)
epi_seurat <- RunPCA(epi_seurat, assay = "SCT", npcs = 50)

epi_seurat <- RunHarmony(epi_seurat, 
				group.by.vars = c("orig.ident", "Surgery_Type"), 
				reduction = "pca", assay.use = "SCT", reduction.save = "harmony")

epi_seurat <- RunUMAP(epi_seurat, reduction = "harmony", assay = "SCT", dims = 1:40)


#Cluster identification and Inspect the effects of Harmony batch removel#

# to set reduction to harmony and find the clusters
epi_seurat <- FindNeighbors(object = epi_seurat, reduction = "harmony")
epi_seurat <- FindClusters(epi_seurat, resolution = c(0.1, 0.2, 0.4, 0.6, 0.8))

# visualization
Idents(epi_seurat) <- epi_seurat@meta.data$SCT_snn_res.0.1


# color cells based on the cluster
# Plot UMAP 
png(filename = "normal_cluster_epi_harmony_UMAP.png", width = 16, height = 8.135, units = "in", res = 300)
DimPlot(epi_seurat,
        reduction = "umap",
        label = TRUE,
        label.size = 6)
dev.off()

# Marker identification

epi_markers <- FindAllMarkers(object = epi_seurat, 
                          only.pos = TRUE,
                          logfc.threshold = 0.25) 

# mutate the markers dataframe
# Extract top 10 markers per cluster

epi_top10 <- epi_markers %>%
  mutate(delta_pct = (pct.1 - pct.2)) %>%
  #filter(avg_log2FC > 1.5) %>%  # only keep rows where avg_log2FC > 1.5
  group_by(cluster) %>%
  top_n(n = 10, wt = delta_pct)


epi_cluster_markers_10 <- epi_top10 %>% 
                   group_by(cluster) %>% 
                   summarize(genes = paste(gene, collapse = ","))
                   
                   
# Insepecting the expression patterns
png(filename = "normal_markers_cells.png", width = 16, height = 8.135, units = "in", res = 300)
FeaturePlot(object = epi_seurat,
            features = unlist(strsplit(epi_cluster_markers_10[2,]$genes, split = ",")),
            order = TRUE,
            min.cutoff = "q10",
            reduction = "umap",
            label = TRUE,
            repel = TRUE)

dev.off()

cluster	genes	cluster name
0	HS6ST3,SLC9A2,GSTM3,CLMN,GRK3,GRHL1,ZNF710,ERBB3,SLC17A5,ASTN2	luminal_intermediate_stage
1	GMNN,SLC7A11,ANXA10,NRARP,SLC12A6,TFF1,PTGS2,ODC1,DSP,MIR31HG	stromal_associated
2	EGR1,DUSP2,DDIT4,DDIT3,KLF10,GATA2,TXNIP,AC005920.2,AL691403.1,PLA2G6	early_stage_basal_cell
3	NBAT1,BICC1,ZNF350-AS1,SULT1E1,CASC15,KRT20,PTN,B4GALT3,RHEX,SCHLAP1	intermediated_luminal_cell
4	PALLD,SPOCK3,NRG1,MYO1B,NAV2,ETS1,CAV1,KRT5,FLNA,TUBB6	intermediate_stage_basal_cell
5	BCAT1,UPK2,MAL,ARMT1,SDHAF4,PTN,KRT20,SNX31,C4orf48,ISG15	luminal_differentiated
6	ADAMTS9-AS2,EBF1,SPARCL1,PLCB4,CAVIN1,ZEB1,SLC2A3,CAV1,TCF4,KLF2	early_stage_basal_cells

Reason for nomenclature:

cluster 0

This cluster includes genes that are expressed in the luminal layer of bladder tissue. However, the specific combination of markers suggests a relatively early or intermediate stage of luminal cell differentiation.

cluster 1

Based on the gene expression profile of GMNN, SLC7A11, ANXA10, NRARP, SLC12A6, TFF1, PTGS2, ODC1, DSP, and MIR31HG, a possible name for the group of cells could be "Stromal-Associated Cluster" or "Mesenchymal-Enriched Cluster". This suggestion is based on the presence of genes like ANXA10 and DSP, which are typically associated with mesenchymal or stromal cell types. Additionally, genes like PTGS2 and SLC7A11 are commonly expressed in stromal or immune cells and may suggest an association with the tumor microenvironment or immune response.

cluster 2

This cluster consists of genes associated with early differentiation processes and transcriptional regulation. The presence of genes like EGR1 and KLF10 suggests an early stage in the differentiation process.

cluster 3 This cluster also contains markers associated with both basal and luminal cell characteristics. While KRT20 suggests a luminal presence, the combination of markers does not indicate a well-defined stage of luminal differentiation.

cluster 4

This cluster shows a mix of genes associated with both basal and luminal cell characteristics. While it may not represent the most differentiated luminal cells, it could be positioned in an intermediate stage of differentiation.

cluster 5

Luminal Differentiation Cluster: This name is suggested based on the presence of markers such as UPK2 and KRT20. UPK2 (Uroplakin 2) is a marker of luminal/umbrella cells in the urothelium, which lines the bladder and urethra. KRT20 is also a luminal marker and is associated with differentiated epithelial cells in various tissues. This suggests that the cluster may represent a group of cells undergoing luminal differentiation or with a mature luminal phenotype.

cluster 6

Basal-like: This name emphasizes the presence of markers like ZEB1, TCF4, and KLF2, which have been associated with basal-like features in various epithelial tissues. It suggests that the cluster may represent cells with a basal-like differentiation program.

# Rename all identities
epi_seurat <- RenameIdents(object = epi_seurat,
                                "0" = "intermediate_luminal",
                                "1" = "stromal_associated",
                                "2" = "early_basal_1",
                                "3" = "luminal[KRT20]_basal",
                                "4" = "luminal_basal[KRT5]",
                                "5" = "luminal_differentiated",
                                "6" = "early_basal_0")
# Plot the UMAP withy new labells
png(filename = "normal_labelled_epi.png", width = 16, height = 8.135, units = "in", res = 600)
DimPlot(object = epi_seurat, 
        reduction = "umap", 
        label = TRUE,
        label.size = 3,
        repel = TRUE)
dev.off()

Trajectories of normal epithelial cells

epi_seurat$clusters <- Idents(epi_seurat)
png(filename = "normal_trajectory.png", width = 16, height = 8.135, units = "in", res = 300)
epi_seurat <- RunSlingshot(srt = epi_seurat, group.by = "clusters", reduction = "UMAP")
dev.off()

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
images		images
README.md		README.md

hamidghaedi/scRNA_seq-analysis

Folders and files

Latest commit

History

Repository files navigation