-
Notifications
You must be signed in to change notification settings - Fork 13
/
README.Rmd
148 lines (107 loc) · 9.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/dsb)](https://CRAN.R-project.org/package=dsb)
<!-- badges: end -->
# <a href='https://CRAN.R-project.org/package=dsb/'><img src='man/figures/sticker2.png' align="right" width="150" /></a> dsb: a method for normalizing and denoising antibody derived tag data from CITE-seq, ASAP-seq, TEA-seq and related assays.
```{r, include = FALSE}
library(here)
knitr::opts_chunk$set(
#tidy = TRUE,
#tidy.opts = list(width.cutoff = 95),
warning = FALSE,
eval = FALSE,
root.dir = here()
)
```
The dsb R package is available on [**CRAN: latest dsb release**](https://CRAN.R-project.org/package=dsb)
To install in R use `install.packages('dsb')` <br>
[**Mulè, Martins, and Tsang, Nature Communications (2022)**](https://www.nature.com/articles/s41467-022-29356-8) describes our deconvolution of ADT noise sources and development of dsb. <br>
#### Vignettes:
1. [**Using dsb in an end to end CITE-seq workflow including multimodal clustering in Seurat**](https://CRAN.R-project.org/package=dsb/vignettes/end_to_end_workflow.html)
2. [**How the dsb method works**](https://CRAN.R-project.org/package=dsb/vignettes/understanding_dsb.html)
3. [**Normalizing ADTs if empty drops are not available**](https://CRAN.R-project.org/package=dsb/vignettes/no_empty_drops.html)
4. [**Python users: use dsb in Python with *scverse* software *muon***](https://muon.readthedocs.io/en/latest/omics/citeseq.html)
5. [**FAQ etc.**](https://CRAN.R-project.org/package=dsb/vignettes/additional_topics.html) <br>
[**Recent Publications**](#pubications) Check out recent publications that used dsb for ADT normalization.
In the first "end to end" vignette, we demonstrate basic CITE-seq analysis starting from UMI count alignment output files from Cell Ranger though note that dsb is compatible with any alignment tool (see [**using other alignment tools**](#otheraligners)). We load unfiltered UMI data containing cells and empty droplets, perform QC on cells and background droplets, normalize with dsb, and demonstrate protein-based clustering and multimodal RNA+Protein joint clustering using dsb normalized values with Seurat's Weighted Nearest Neighbor method.
## Background and motivation <a name="background_motivation"></a>
Protein data derived from sequencing antibody derived tags (ADTs) in CITE-seq and other related assays has substantial background noise. [**Our paper**](https://www.nature.com/articles/s41467-022-29356-8) outlines experiments and analysis designed to dissect sources of noise in ADT data we used to developed our method. We found all experiments measuring ADTs capture protein-specific background noise because ADT reads in empty / background drops (outnumbering cell-containing droplets > 10-fold in all experiments) were highly concordant with ADT levels in unstained spike-in cells. We therefore utilize background droplets which capture the *ambient component* of protein background noise to correct values in cells. We also remove technical cell-to-cell variations by defining each cell's dsb "technical component", a conservative adjustment factor derived by combining isotype control levels with each cell's specific background level fitted with a single cell model.
## Installation and quick overview <a name="installation"></a>
The method is carried out in a single step with a call to the `DSBNormalizeProtein()` function.
`cells_citeseq_mtx` - a raw ADT count matrix
`empty_drop_citeseq_mtx` - a raw ADT count matrix from non-cell containing empty / background droplets.
`denoise.counts = TRUE` - implement step II to define and remove the 'technical component' of each cell's protein library.
`use.isotype.control = TRUE` - include isotype controls in the modeled dsb technical component.
```{r, eval = FALSE}
# install.packages('dsb')
library(dsb)
adt_norm = DSBNormalizeProtein(
cell_protein_matrix = cells_citeseq_mtx,
empty_drop_matrix = empty_drop_citeseq_mtx,
denoise.counts = TRUE,
use.isotype.control = TRUE,
isotype.control.name.vec = rownames(cells_citeseq_mtx)[67:70]
)
```
<img src="man/figures/multimodal_heatmap.png" />
Please see [**the main vignette on CRAN**](https://CRAN.R-project.org/package=dsb/vignettes/end_to_end_workflow.html) for more details. <br>
### Selected publications using dsb <a name="pubications"></a>
Publications from other investigators <br>
[Izzo et al. *Nature* 2024](https://doi.org/10.1038/s41586-024-07388-y) <br>
[Arieta et al. *Cell* 2023](https://doi.org/10.1016/j.cell.2023.04.007) <br>
[Magen et al. *Nature Medicine* 2023](https://doi.org/10.1038/s41591-023-02345-0) <br>
[COMBAT consortium *Cell* 2021](https://doi.org/10.1016/j.cell.2022.01.012) <br>
[Jardine et al. *Nature* 2021](https://doi.org/10.1038/s41586-021-03929-x) <br>
[Mimitou et al. *Nature Biotechnology* 2021](https://doi.org/10.1038/s41587-021-00927-2) <br>
Publications from the Tsang lab <br>
[Mulè et al. *Immunity* 2024](https://mattpm.net/man/pdf/natural_adjuvant_immunity_2024.pdf) <br>
[Sparks et al. *Nature* 2023](https://doi.org/10.1038/s41586-022-05670-5) <br>
[Liu et al. *Cell* 2021](https://doi.org/10.1016/j.cell.2021.02.018) <br>
[Kotliarov et al. *Nature Medicine* 2020](https://doi.org/10.1038/s41591-020-0769-8) <br>
### using other alignment algorithms <a name="otheraligners"></a>
dsb was developed prior to 10X Genomics supporting CITE-seq or hashing data and we routinely use other alignment pipelines.
A note on alignment and how to use dsb with Cell Ranger is detailed in the main vignette. Cells and empty droplets are used by default by dsb.
<img src="man/figures/readme_cheatsheet.png" />
To use dsb properly with CITE-seq-Count you need to align background. One way to do this is to set the `-cells` argument to ~ 200000. That will align the top 200000 barcodes in terms of ADT library size, making sure you capture the background. Please refer to [**CITE-seq-count documentation**](https://hoohm.github.io/CITE-seq-Count/Running-the-script/)
```{bash, eval = FALSE}
CITE-seq-Count -R1 TAGS_R1.fastq.gz -R2 TAGS_R2.fastq.gz \
-t TAG_LIST.csv -cbf X1 -cbl X2 -umif Y1 -umil Y2 \
-cells 200000 -o OUTFOLDER
```
If you already aligned your mRNA with Cell Ranger or something else but wish to use a different tool like kallisto or Cite-seq-count for ADT alignment, you can provide the latter with whitelist of cell barcodes to align. A simple way to do this is to extract all barcodes with at least k mRNA where we set k to a tiny number to retain cells *and* cells capturing ambient ADT reads:
```{r, eval = FALSE}
library(Seurat)
umi = Read10X(data.dir = 'data/raw_feature_bc_matrix/')
k = 3
barcode.whitelist =
rownames(
CreateSeuratObject(counts = umi,
min.features = k, # retain all barcodes with at least k raw mRNA
min.cells = 800, # this just speeds up the function by removing genes.
)@meta.data
)
write.table(barcode.whitelist,
file =paste0(your_save_path,"barcode.whitelist.tsv"),
sep = '\t', quote = FALSE, col.names = FALSE, row.names = FALSE)
```
With the example dataset in the vignette this retains about 150,000 barcodes.
Now you can provide that as an argument to `-wl` in CITE-seq-count to align the ADTs and then proceed with the dsb analysis example.
```{bash, eval = FALSE}
CITE-seq-Count -R1 TAGS_R1.fastq.gz -R2 TAGS_R2.fastq.gz \
-t TAG_LIST.csv -cbf X1 -cbl X2 -umif Y1 -umil Y2 \
-wl path_to_barcode.whitelist.tsv -o OUTFOLDER
```
This whitelist can also be provided to Kallisto.
[kallisto bustools documentation](https://www.kallistobus.tools/tutorials/kb_kite/python/kb_kite/)
```{bash, eval = FALSE}
kb count -i index_file -g gtf_file.t2g -x 10xv3 \
-t n_cores -w path_to_barcode.whitelist.tsv -o output_dir \
input.R1.fastq.gz input.R2.fastq.gz
```
Next one can similarly define cells and background droplets empirically with protein and mRNA based thresholding as outlined in the main tutorial.
### A note on Cell Ranger --expect-cells <a name="cellranger"></a>
Note *whether or not you use dsb*, if you want to define cells using the `filtered_feature_bc_matrix` file, you should make sure to properly set the Cell Ranger `--expect-cells` argument roughly equal to the estimated cell recovery per lane based on number of cells you loaded in the experiment. see [the note from 10X about this ](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/algorithms/overview#cell_calling). The default value of 3000 is relatively low for modern experiments. Note cells and empty droplets can also be defined directly from the `raw_feature_bc_matrix` using any method, including simple protein and mRNA library size based thresholding because this contains all droplets.
Topics covered in other vignettes on CRAN: **Integrating dsb with Bioconductor, integrating dsb with python/Scanpy, Using dsb with data lacking isotype controls, integrating dsb with sample multiplexing experiments, using dsb on data with multiple batches, advanced usage - using a different scale / standardization based on empty droplet levels, returning internal stats used by dsb, outlier clipping with the quantile.clipping argument, other FAQ.**