diff --git a/.Rbuildignore b/.Rbuildignore index ab0cbcfa..cdf1d320 100644 --- a/.Rbuildignore +++ b/.Rbuildignore @@ -21,3 +21,5 @@ LICENSE ^devdata ^tests/testthat/*\.h5$ ^vignettes/articles$ +^doc$ +^Meta$ diff --git a/.gitignore b/.gitignore index 5abf2689..887c6806 100644 --- a/.gitignore +++ b/.gitignore @@ -20,3 +20,5 @@ src/CMakeCache.txt src/Makefile build/* vignettes/articles/*_cache/ +/doc/ +/Meta/ diff --git a/vignettes/liger-vignette.Rmd b/vignettes/liger-vignette.Rmd index a9601e5d..8b7ea5b4 100644 --- a/vignettes/liger-vignette.Rmd +++ b/vignettes/liger-vignette.Rmd @@ -1,181 +1,54 @@ --- -title: "Comparing and contrasting heterogeneous single cell profiles using liger" -author: "Joshua D. Welch and Velina Kozareva" +title: "Data Integration with LIGER" +author: "Joshua D. Welch" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > - %\VignetteIndexEntry{Comparing and contrasting heterogeneous single cell profiles using liger} + %\VignetteIndexEntry{Data Integration with LIGER} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) +knitr::opts_chunk$set() ``` -## Data preprocessing -The algorithm takes a list of two or more digital gene expression (DGE) matrices as input. Genes should be in rows and cells in columns. -Before running the factorization, we need to normalize the data to account for different numbers of UMIs per cell, -select variable genes, and scale the data. Note that we do not center the data because nonnegative matrix factorization accepts only -positive values. The selectGenes function performs variable gene selection on each of the datasets separately, then takes the union. Note that corresponding genes in each dataset need to have the same names (though the genes do not need to be in the same order in each dataset). For cross-species analysis, it may be convenient to convert all gene names to uppercase; you can do this using the capitalize=T option of the selectGenes function. -```r -dge1 = readRDS("dge1.RDS") #genes in rows, cells in columns, rownames and colnames included. Sparse matrix format is -recommended. -dge2 = readRDS("dge2.RDS") -ligerex = createLiger(list(name1 = dge1, name2 = dge2)) #Can also pass in more than 2 datasets -ligerex = normalize(ligerex) -ligerex = selectGenes(ligerex, var.thresh = 0.1) -ligerex = scaleNotCenter(ligerex) +## Introduction -``` - -### Loading 10X Data -`liger` also has functions for reading data generated by 10X's `cellranger count` pipeline (including -from 10X V3). We can merge data by data type (most commonly Gene Expression) across multiple samples and -then use this as a single dataset in a new object for integration. -```r -# 10X data to be combined for sample 1 -sample1_data1 = "/path/to/10X/output/dir1" -sample1_data2 = "/path/to/10X/output/dir2" -sample1_dge = read10X(sample.dirs = list(sample1_data1, sample1_data2), - sample.names = c("s1_data1", "s1_data2"), min.umis = 500) -# 10X data for sample 2 -sample2_data = "/path/to/10X/output/dir3" -sample2_dge = read10X(sample.dirs = list(sample2_data), - sample.names = c("s2_data"), min.umis = 500) -liger10X = createLiger(list(sample1 = sample1_dge, sample2 = sample2_dge)) -# continue with other preprocessing steps - -``` - -## Performing the factorization -Next we perform the factorization using an alternating least squares algorithm. After performing the factorization, -we identify cells that load on corresponding cell factors and quantile normalize their factor loadings across datasets. -The key parameters here are the number of factors (k), the penalty parameter (lambda), and the clustering resolution. -In most cases, the default settings of lambda=5.0 and resolution=1.0 provide reasonable results. -```r -ligerex = optimizeALS(ligerex, k = 20) -ligerex = quantileAlignSNF(ligerex) #SNF clustering and quantile alignment -``` - -## Visualizing the results -We can visualize the results by using dimensionality reduction techniques like t-SNE or UMAP (recommended -for larger datasets). Visualizations can be colored by dataset of origin, cluster assignment, or -any feature included in or added to the cell metadata (`cell.data`). - -`plotWordClouds` and `plotGeneLoadings` are useful ways to visualize the most highly loading genes (both shared and dataset specific) for each factor, in conjunction with the factor loadings across cells. -These functions are very similar, but `plotGeneLoadings` displays the top loading genes in ordered scatter plots instead of word clouds. -```r -ligerex = runTSNE(ligerex) -# for larger datasets, may want to use UMAP instead -ligerex = runUMAP(ligerex) -plotByDatasetAndCluster(ligerex) #Can also pass in different set of cluster labels to plot -plotFeature(ligerex, "nUMI") - -pdf("word_clouds.pdf") -plotWordClouds(ligerex) -dev.off() +LIGER was initially introduced in [Welch et al. 2019](https://doi.org/10.1016/j.cell.2019.05.006) as a method for integrating single-cell RNA-seq data across multiple technologies, species, and conditions. The method relies on integrative nonnegative matrix factorization (iNMF) to identify shared and dataset-specific factors. -pdf("gene_loadings.pdf") -plotGeneLoadings(ligerex) -dev.off() +LIGER can be used to compare and contrast experimental datasets in a variety of contexts, for instance: -``` +- Across experimental batches +- Across individuals +- Across sex +- Across tissues +- Across species (e.g., mouse and human) +- Across modalities (e.g., scRNAseq and spatial transcriptomics data, scMethylation, or scATAC-seq) -## Exploring factors and clusters -Another way to examine factor loadings across cells, and to help visualize the alignment process is -through `plotFactors`; this can be helpful for seeing significant scale differences between the -datasets. We can also visualize the correspondence between clusters and factors in the data with -`plotClusterProportions` and `plotClusterFactors`. These can be especially useful for identifying -clusters associated with certain factors and corresponding marker genes. -```r -plotFactors(ligerex) -plotClusterProportions(ligerex) -plotClusterFactors(ligerex, use.aligned = T) -``` +Once multiple datasets are integrated, the package provides functionality for further data exploration, analysis, and visualization. Users can: -## Finding and visualizing marker genes -We can use the factorization to more explicitly identify shared and dataset-specific markers. The function `getFactorMarkers` -returns a list, where the first element contains dataset-specific markers for dataset 1, the second -element contains sharedmarkers, the third element contains dataset-specific markers for dataset 2, -and the last 2 elements indicate the number of factors in which each marker is found. This information -allows the identification of ubiquitous vs. cell-type-specific dataset differences. -```r -markers = getFactorMarkers(ligerex, num.genes = 10) -plotGene(ligerex, gene = "Malat1") -plotGeneViolin(ligerex, gene = "Malat1") -``` +- Identify clusters +- Find significant shared (and dataset-specific) gene markers +- Compare clusters with previously identified cell types +- Visualize clusters and gene expression using t-SNE and UMAP -## Comparing different cluster assignments -We can compare and visualize `liger` cluster assignments with existing clusterings (if cluster assignments -are available for the individual datasets). -```r -# published cluster assignments for all cells in dataset 1 and 2 -clusters_published1 = readRDS("clusters1.RDS") -clusters_published2 = readRDS("clusters2.RDS") -# calculate adjusted Rand Index between liger cluster assignments and another assignment -calcARI(ligerex, c(clusters_published1, clusters_published2)) -# calculate purity between liger cluster assignments and another assignment for just dataset 1 -calcPurity(ligerex, clusters_published1) -# visualize joint cluster assignment as related to individual dataset cluster assignments -makeRiverplot(ligerex, clusters_published1, clusters_published2) -``` +## Usage -## Checking dataset alignment and individual dataset distortion -`liger` includes methods for estimating the level of integration (alignment) between datasets and -the level of distortion of structure for each individual dataset after factorization and alignment -(compared to before). In general, datasets which have been factorized and aligned in a meaningful way -should show a high degree of integration (high alignment metric), while maintaining a low degree of distortion -(high agreement metric). -```r -calcAlignment(ligerex) -calcAgreement(ligerex) -# see if certain clusters are more integrated than others -calcAlignmentPerCluster(ligerex) -``` +We have now made a [documentation website for rliger 2.0.0](https://mvfki.github.io/liger/). Please check it out for detailed introduction. -## Selecting k and lambda -The `suggestK` and `suggestLambda` functions can aid in selecting k and lambda. We want to find the smallest -k for which the increase in entropy metric begins to level off (an "elbow" in the plot). Similarly, we want the -smallest lambda for which the alignment metric stabilizes. -```r -suggestK(ligerex) # plot entropy metric to find an elbow that can be used to select the number of factors -suggestLambda(ligerex, k) # plot alignment metric to find an elbow that can be used to select the value of lambda -``` +We have made a number of vignettes for typical types of analysis that can be performed with LIGER. -## Updating the factorization -If we want to add new data, change k or lambda, or re-analyze a subset of the data, the functions -below provide an efficient method of updating. This is much faster than the naive approach of -simply re-running the optimizeALS algorithm. -```r -ligerex = optimizeNewK(ligerex, k = 15) #Can also decrease K -#Add new batches from the same condition/technology/species/protocol -ligerex = optimizeNewData(ligerex, newdata = list(name1 = dge1.new, name2 = dge2.new), - which.datasets = list(name1, name2), add.to.existing = T) -#Add completely new datasets. Specify which existing datasets are most similar. -ligerex = optimizeNewData(ligerex, newdata = list(name3 = dge3, name4 = dge4), - which.datasets = list(name1, name2), add.to.existing = F) -#cell.subset is a list of cells to retain from each dataset -ligerex = optimizeSubset(ligerex, cell.subset) -``` +* [Integrating Multiple Single-Cell RNA-seq Datasets](https://mvfki.github.io/liger/articles/Integrating_multi_scRNA_data.html) +* [Jointly Defining Cell Types from scRNA-seq and scATAC-seq](https://mvfki.github.io/liger/articles/Integrating_scRNA_and_scATAC_data.html) +* [Iterative Single-Cell Multi-Omic Integration Using Online iNMF](https://mvfki.github.io/liger/articles/online_iNMF_tutorial.html) +* [Integrating unshared features with UINMF](https://mvfki.github.io/liger/articles/UINMF_vignette.html) +* [Integrating spatial transcriptomic and transcriptomic datasets using UINMF](https://mvfki.github.io/liger/articles/STARmap_dropviz_vig.html) +* [scATAC and scRNA Integration using unshared features (UINMF)](https://mvfki.github.io/liger/articles/SNAREseq_walkthrough.html) +* [Cross-species Analysis with UINMF](https://mvfki.github.io/liger/articles/cross_species_vig.html) +* [Jointly Defining Cell Types from Single-Cell RNA-seq and DNA Methylation](https://mvfki.github.io/liger/articles/rna_methylation.html) -## Integration between liger and Seurat -We can easily create `liger` objects from Seurat (V2 or V3) objects (and vice versa), while keeping -calculated features from the original objects. -```r -# Create liger object from two separate Seurat objects, keeping union of top 2000 highly variable -# genes from each object -ligerex = seuratToLiger(list(seurat1, seurat2), names = c('name1', 'name2'), num.hvg.info = 2000) -# Create liger object from single integrated Seurat V2 object, keeping CCA factorization, -# splitting datasets by Seurat meta.var column "original" -ligerex = seuratToLiger(seurat_obj, combined.seurat = T, meta.var = 'original', cca.to.H = T) -# Create liger object from single integrated Seurat V3 object, splitting datasets by two -# available assays in Seurat -ligerex = seuratToLiger(seurat_obj, combined.seurat = T, assays.use = c('RNA', 'ADT')) -# Create Seurat object from liger object, keeping liger highly variable genes -seurat_obj = ligerToSeurat(ligerex, use.liger.genes = T) +Meanwhile, since version 2.0.0, LIGER is massively updated for usability and interoperability with other packages. Below are links to the introduction of new features. -``` +* [Introduction to new liger object and other related classes](https://mvfki.github.io/liger/articles/liger_object.html) +* [Running Liger directly on Seurat objects](https://mvfki.github.io/liger/articles/liger_with_seurat.html)