The package can be installed from this github repository:
# Install devtools for github installation if not present
require(devtools)
# download required packages from bioconductor if needed for first install
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("zellkonverter", "scater", "ShortRead", "DropletUtils"))
# Install package from github repo
devtools::install_github("https://github.com/TAPE-Lab/splitRtools")
The splitRtools package is a collection of tools that are used to
process SPLiT-seq scRNA-seq data first described in Rosenberg et.al,
2019.
The splitRtools package is designed to take as input data,
the various output files from the zUMIs
package
(paper)
for scRNA-seq cell barcode mapping and alignment.
The zUMIs
package takes raw FASTQ output and cell barcoding information, assigning
and filteing reads to barcodes. It then maps the cDNA reads to a
reference genome using STAR producing a Digital Gene Expression (DGE)
matrix, as well as some reporting info about the pipeline.
A
sample zUMIs pipeline with configuration to work with the Rosenberg-2019
barcode setup is available
here.
The splitRtools
pipeline depends on the naming of the zUMIs pipeline
output, this is the variable in the .yaml
config file named
project:
. All zUMIs outputs for each sublibrary must be contained
within a folder with the same name as the zUMI project
name. This is
because the project name is embedded into each zUMIs output file. This
setting is specified when you run the zUMIs pipeline in the project
parameter in the yaml config file.
From the zUMIs pipeline outputs
(contained within the location specified in the out_dir:
parameter in
the .yaml
config file) you need the zUMIs_output folder, which
contains the expression
, stats
and barcodes.txt files. As well as
the project.BCstats.txt
file. These files need to be organised in the
structure outlined below.
The folders for each individual
sublibrary must be contained withing the data_folder
and this folder’s
absolute path must be specified in the run_split_pipe()
arguments.
|
|–data_folder
| |
| |-sub_lib_1
| | |-sub_lib_1.BCstats.txt
| | |-zUMIs_output
| |
| |-sub_lib_2
| |-sub_lib_n
The experiment barcoding layout must be provided as a csv file with two
columns - well position (numeric: 1-96) and barcode sequence in each
well. Currently splitRtools
supports one barcoding layout for the RT
plate (args rt_bc
) and another for the two subsequent ligation rounds
(args lig_bc
). An example of the barcoding layout sheet (Rosenberg
2019 format) is located in this repository in data/barcodes_v1.csv
.
Similar to the barcoding layout, the sample layout for the RT barcode
sample indexing needs to be provided, as well_position
and sample_id
in .xlsx
format. This enables the labeling of each cell with its
sample of origin based on it’s well position in the RT plate and is
specified in the argument sample_map
. An example of the sample map
layout sheet is located in this repository in data/cell_metadata.xlsx
.
You need to specify the read counts for each sublibrary so that the
pipeline can determine some of the sublibrary barcode-mapping stats.
This must be provided as a dataframe with one column sl_name
identifying the sublibrary name (the zUMIs project
) and second column
reads
specifying the number of reads per sublibrary. The format is
shown in the example below.
The splitRtools pipeline is run through the run_split_pipe()
function,
which acts as a wrapper to execute the pipeline. A basic setup for the
pipeline is as follows: (for more information on pipeline arguments use
?run_split_pipe
)
reads_df = data.frame(sl_name = c('exp013_p27_s4', 'exp013_p27_s5'), reads = c(1041593427, 1083652637))
# Run the splitRtool pipeline
# Each sublibrary is contained within its own folder in the data_folder folder and must contain zUMIs output, named by sublib name.
run_split_pipe(mode = 'single', # Process each sublibrary seperately
n_sublibs = 2, # How many to sublibraries are present
data_folder = "~/path/to/data_folder", # Location of zUMIs data directory
output_folder = "~/path/to/output_folder", # Output folder path
filtering_mode = "manual", # Filter by 'knee' (standard) or 'manual' threshold UMI value (default 1000) transcripts
filter_value = 500, # If filtering mode = "manual" which UMI transcript value to filter at.
count_reads = FALSE, # Count reads from FASTQ files, if TRUE you must provide a path to FASTQ files (only works with single sublibrarys!)
total_reads = reads_df, # DataFrame of raw read count per sublibrary
fastq_path = NA, # Path to folder containing subibrary raw FASTQ if count_reads = TRUE
rt_bc = "~/path/to_RT_barcode_map/barcodes_v2_48.csv", # RT barcode map
lig_bc = "~/path/to_ligation_barcode_map/barcodes_v1.csv", # Ligation barcode map
sample_map = "~/path/to_RT_sample_layout_map/exp013_cell_metadata.xlsx" # RT sample-well mapping plate layout file
)
|
|–output_folder
|
|-sub_lib_1
| |-unfiltered_sce_h5ad_objects
| |-filtered_sce_h5ad_ojects
| |-ggplot_outputs
| |-report_data_outputs
|
|-sub_lib_2
|-sub_lib_n
|-merged_sublibrary_data
The first stage of the pipeline labels converts the DGE count matrix
into a SingleCellExperiment
object and labels each cell with various
ColData
interpreting the cell barcode into a series of well IDs based
each stage of the barcoding process and the correspondence between the
RT wells ID and the sample_map
.xlsx file provided. This data is then
stored as an SCE
or an .h5ad
object in unfiltered/
output folder
for each sublibrary.
The SingleCellExperiment
object is
then filtered based in either a manual cutoff of UMI per cell or using
the DropletUtils
package knee filtering threshold depending on the
setting of the filter_mode
and filter_value
(only used for manual
filtering) arguments. The SCE and a corresponding .h5ad object are
stored in in the filtred/
output folder for each sublibrary.
The splitRtools pipeline will generate a set of diagnostic plots in
order to evaluate the initial quality of the SPLiT-seq scRNA-seq data
and barcoding process. Thesea are saved in the gplots/
output folder.
After labeling the data is filtered using either the
DropletUtils
package spline-fitting functionality or a user specified
manual cutoff of transcripts. This produces the following waterfall plot
along with quantifiaction of the cell types recovered by sample:
The barcoding cell data is then mapped to the respective plate
locations across the 3 barcoding rounds to provide a series of heatmaps
displaying cells recovered per well and median UMI per cell per well
across the RT1, L2 and L3 plates: