Skip to content

Commit

Permalink
Clean up old vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
mvfki committed Feb 29, 2024
1 parent 96a7f48 commit d77c904
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 159 deletions.
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,5 @@ LICENSE
^devdata
^tests/testthat/*\.h5$
^vignettes/articles$
^doc$
^Meta$
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,5 @@ src/CMakeCache.txt
src/Makefile
build/*
vignettes/articles/*_cache/
/doc/
/Meta/
191 changes: 32 additions & 159 deletions vignettes/liger-vignette.Rmd
Original file line number Diff line number Diff line change
@@ -1,181 +1,54 @@
---
title: "Comparing and contrasting heterogeneous single cell profiles using liger"
author: "Joshua D. Welch and Velina Kozareva"
title: "Data Integration with LIGER"
author: "Joshua D. Welch"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Comparing and contrasting heterogeneous single cell profiles using liger}
%\VignetteIndexEntry{Data Integration with LIGER}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
knitr::opts_chunk$set()
```

## Data preprocessing
The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns.
Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell,
select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only
positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function.
```r
dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is
recommended.
dge2 = readRDS("dge2.RDS")
ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets
ligerex = normalize(ligerex)
ligerex = selectGenes(ligerex, var.thresh = 0.1)
ligerex = scaleNotCenter(ligerex)
## Introduction

```

### Loading 10X Data
`liger` also has functions for reading data generated by 10X's `cellranger count` pipeline (including
from 10X V3). We can merge data by data type (most commonly Gene Expression) across multiple samples and
then use this as a single dataset in a new object for integration.
```r
# 10X data to be combined for sample 1
sample1_data1 = "/path/to/10X/output/dir1"
sample1_data2 = "/path/to/10X/output/dir2"
sample1_dge = read10X(sample.dirs = list(sample1_data1, sample1_data2),
sample.names = c("s1_data1", "s1_data2"), min.umis = 500)
# 10X data for sample 2
sample2_data = "/path/to/10X/output/dir3"
sample2_dge = read10X(sample.dirs = list(sample2_data),
sample.names = c("s2_data"), min.umis = 500)
liger10X = createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge))
# continue with other preprocessing steps

```

## Performing the factorization
Next we perform the factorization using an alternating least squares algorithm. After performing the factorization,
we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets.
The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution.
In most cases, the default settings of lambda=5.0 and resolution=1.0 provide reasonable results.
```r
ligerex = optimizeALS(ligerex, k = 20)
ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment
```

## Visualizing the results
We can visualize the results by using dimensionality reduction techniques like t-SNE or UMAP (recommended
for larger datasets). Visualizations can be colored by dataset of origin, cluster assignment, or
any feature included in or added to the cell metadata (`cell.data`).

`plotWordClouds` and `plotGeneLoadings` are useful ways to visualize the most highly loading genes (both shared and dataset specific) for each factor, in conjunction with the factor loadings across cells.
These functions are very similar, but `plotGeneLoadings` displays the top loading genes in ordered scatter plots instead of word clouds.
```r
ligerex = runTSNE(ligerex)
# for larger datasets, may want to use UMAP instead
ligerex = runUMAP(ligerex)
plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot
plotFeature(ligerex, "nUMI")

pdf("word_clouds.pdf")
plotWordClouds(ligerex)
dev.off()
LIGER was initially introduced in [Welch et al. 2019](https://doi.org/10.1016/j.cell.2019.05.006) as a method for integrating single-cell RNA-seq data across multiple technologies, species, and conditions. The method relies on integrative nonnegative matrix factorization (iNMF) to identify shared and dataset-specific factors.

pdf("gene_loadings.pdf")
plotGeneLoadings(ligerex)
dev.off()
LIGER can be used to compare and contrast experimental datasets in a variety of contexts, for instance:

```
- Across experimental batches
- Across individuals
- Across sex
- Across tissues
- Across species (e.g., mouse and human)
- Across modalities (e.g., scRNAseq and spatial transcriptomics data, scMethylation, or scATAC-seq)

## Exploring factors and clusters
Another way to examine factor loadings across cells, and to help visualize the alignment process is
through `plotFactors`; this can be helpful for seeing significant scale differences between the
datasets. We can also visualize the correspondence between clusters and factors in the data with
`plotClusterProportions` and `plotClusterFactors`. These can be especially useful for identifying
clusters associated with certain factors and corresponding marker genes.
```r
plotFactors(ligerex)
plotClusterProportions(ligerex)
plotClusterFactors(ligerex, use.aligned = T)
```
Once multiple datasets are integrated, the package provides functionality for further data exploration, analysis, and visualization. Users can:

## Finding and visualizing marker genes
We can use the factorization to more explicitly identify shared and dataset-specific markers. The function `getFactorMarkers`
returns a list, where the first element contains dataset-specific markers for dataset 1, the second
element contains sharedmarkers, the third element contains dataset-specific markers for dataset 2,
and the last 2 elements indicate the number of factors in which each marker is found. This information
allows the identification of ubiquitous vs. cell-type-specific dataset differences.
```r
markers = getFactorMarkers(ligerex, num.genes = 10)
plotGene(ligerex, gene = "Malat1")
plotGeneViolin(ligerex, gene = "Malat1")
```
- Identify clusters
- Find significant shared (and dataset-specific) gene markers
- Compare clusters with previously identified cell types
- Visualize clusters and gene expression using t-SNE and UMAP

## Comparing different cluster assignments
We can compare and visualize `liger` cluster assignments with existing clusterings (if cluster assignments
are available for the individual datasets).
```r
# published cluster assignments for all cells in dataset 1 and 2
clusters_published1 = readRDS("clusters1.RDS")
clusters_published2 = readRDS("clusters2.RDS")
# calculate adjusted Rand Index between liger cluster assignments and another assignment
calcARI(ligerex, c(clusters_published1, clusters_published2))
# calculate purity between liger cluster assignments and another assignment for just dataset 1
calcPurity(ligerex, clusters_published1)
# visualize joint cluster assignment as related to individual dataset cluster assignments
makeRiverplot(ligerex, clusters_published1, clusters_published2)
```
## Usage

## Checking dataset alignment and individual dataset distortion
`liger` includes methods for estimating the level of integration (alignment) between datasets and
the level of distortion of structure for each individual dataset after factorization and alignment
(compared to before). In general, datasets which have been factorized and aligned in a meaningful way
should show a high degree of integration (high alignment metric), while maintaining a low degree of distortion
(high agreement metric).
```r
calcAlignment(ligerex)
calcAgreement(ligerex)
# see if certain clusters are more integrated than others
calcAlignmentPerCluster(ligerex)
```
We have now made a [documentation website for rliger 2.0.0](https://mvfki.github.io/liger/). Please check it out for detailed introduction.

## Selecting k and lambda
The `suggestK` and `suggestLambda` functions can aid in selecting k and lambda. We want to find the smallest
k for which the increase in entropy metric begins to level off (an "elbow" in the plot). Similarly, we want the
smallest lambda for which the alignment metric stabilizes.
```r
suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors
suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda
```
We have made a number of vignettes for typical types of analysis that can be performed with LIGER.

## Updating the factorization
If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions
below provide an efficient method of updating. This is much faster than the naive approach of
simply re-running the optimizeALS algorithm.
```r
ligerex = optimizeNewK(ligerex, k = 15) #Can also decrease K
#Add new batches from the same condition/technology/species/protocol
ligerex = optimizeNewData(ligerex, newdata = list(name1 = dge1.new, name2 = dge2.new),
which.datasets = list(name1, name2), add.to.existing = T)
#Add completely new datasets. Specify which existing datasets are most similar.
ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4),
which.datasets = list(name1, name2), add.to.existing = F)
#cell.subset is a list of cells to retain from each dataset
ligerex = optimizeSubset(ligerex, cell.subset)
```
* [Integrating Multiple Single-Cell RNA-seq Datasets](https://mvfki.github.io/liger/articles/Integrating_multi_scRNA_data.html)
* [Jointly Defining Cell Types from scRNA-seq and scATAC-seq](https://mvfki.github.io/liger/articles/Integrating_scRNA_and_scATAC_data.html)
* [Iterative Single-Cell Multi-Omic Integration Using Online iNMF](https://mvfki.github.io/liger/articles/online_iNMF_tutorial.html)
* [Integrating unshared features with UINMF](https://mvfki.github.io/liger/articles/UINMF_vignette.html)
* [Integrating spatial transcriptomic and transcriptomic datasets using UINMF](https://mvfki.github.io/liger/articles/STARmap_dropviz_vig.html)
* [scATAC and scRNA Integration using unshared features (UINMF)](https://mvfki.github.io/liger/articles/SNAREseq_walkthrough.html)
* [Cross-species Analysis with UINMF](https://mvfki.github.io/liger/articles/cross_species_vig.html)
* [Jointly Defining Cell Types from Single-Cell RNA-seq and DNA Methylation](https://mvfki.github.io/liger/articles/rna_methylation.html)

## Integration between liger and Seurat
We can easily create `liger` objects from Seurat (V2 or V3) objects (and vice versa), while keeping
calculated features from the original objects.
```r
# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable
# genes from each object
ligerex = seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000)
# Create liger object from single integrated Seurat V2 object, keeping CCA factorization,
# splitting datasets by Seurat meta.var column "original"
ligerex = seuratToLiger(seurat_obj, combined.seurat = T, meta.var = 'original', cca.to.H = T)
# Create liger object from single integrated Seurat V3 object, splitting datasets by two
# available assays in Seurat
ligerex = seuratToLiger(seurat_obj, combined.seurat = T, assays.use = c('RNA', 'ADT'))
# Create Seurat object from liger object, keeping liger highly variable genes
seurat_obj = ligerToSeurat(ligerex, use.liger.genes = T)
Meanwhile, since version 2.0.0, LIGER is massively updated for usability and interoperability with other packages. Below are links to the introduction of new features.

```
* [Introduction to new liger object and other related classes](https://mvfki.github.io/liger/articles/liger_object.html)
* [Running Liger directly on Seurat objects](https://mvfki.github.io/liger/articles/liger_with_seurat.html)

0 comments on commit d77c904

Please sign in to comment.