scChromHMM
provides a suite of tools for rapid processing of single-cell histone modification data to perform chromatin states analysis of the genome within each single-cell. It is an extention of bulk ChromHMM framework, which consumes the HMM model learned from ChromHMM and perform chromatin state analysis by running forward-backward algorithm for each single-cell.
scChromHMM primarily requires a group of four kind of files, which are defined as follows:
- fragment files: Fragment files contains the information about the mapping location of the sequencing read fragments on the genome. The basic format is similar to as described by 10x, and it's primarily a BED file with an additional information of cellular barcode for each mapped fragment. toy example:
h3k27ac_fragments.tsv.gz
- NOTE the tabix index of the fragment files is also needed and can be generated using the command
tabix -f -p bed <fragment_file.gz>
for a block zipped (bgzip) fragment file. toy example:example/h3k27ac_fragments.tsv.gz.tbi
- NOTE the tabix index of the fragment files is also needed and can be generated using the command
- hmm_model: A tsv file containing the information about the hmm model parameters. The default schema of this file is similar to the one generated by ChromHMM. toy example:
example/model_2.txt
. - anchors: A tsv file with the list of anchors from the query data onto the reference data, along with their anchroring scores. toy example:
example/k27ac.txt
. - reference_cells: A list of all the cellular barcodes (one per line) present in the reference dataset. toy example:
example/cells.txt
scChromHMM has been tested with stable release 1.52.1 of Rust, and the program can be compiled by using the command:
$ cargo build --release
Once compiled the scChromHMM program can be run to generate the posterior probability distribution across the hidden states using the command:
$ target/release/schrom hmm -f <fragment_files> -m <hmm_model> -a <anchor_files> -c <reference_cells> -t <number_of_threads> -o <output_folder>
Note: The order of fragment files should be the same as the anchor files. A toy example can be run using the data present in the example folder using the following command: (An extra flag --onlyone
has been added to run the toy example on a subsequence of chromosome 1).
RUST_BACKTRACE=full RUST_LOG="trace" /usr/bin/time target/release/schrom hmm -f example/h3k27ac_fragments.tsv.gz example/h3k27me3_fragments.tsv.gz example/h3k4me1_fragments.tsv.gz -m example/model_2.txt -a example/k27ac.txt example/k27me3.txt example/k4me1.txt -c example/cells.txt -t 10 -o output --onlyone
The hmm
subcommand of the scChromHMM tool generates cell-wise posterior probabilities for every reference cell across the genome. The probabilities are stored for each cell in a binary format i.e. 200bp region by state matrix with integer values in range [0-100]. toy example: output/chr1/L1_CCTCTAGTCGCTAAAC.bin
. Based on the number of reference cells, size of the output posterior probabilites can grow significantly; and some downstream analyses are faster to work with region by cells matrix (for each state) instead of region by state (for each cell) matrices. Hence, scChromHMM subcommand transform
can be used to convert the data into the "short" representation of region by cell. The command to do that is as follows:
$ target/release/schrom transform -c <reference_cells> -i <input_folder> -o <output_folder>
The toy example can be run using the following command. NOTE An extra flag --onlyone
has been added to run the toy example on a subsequence of chromosome 1.
$ mkdir short_output
$ RUST_BACKTRACE=full RUST_LOG="trace" /usr/bin/time target/release/schrom transform -c example/cells.txt -i output -o short_output --onlyone
The chromatin state wise, region by cells posterior probabilities of the toy example can be imported into the R environment using the following script:
library(Rcpp)
sourceCpp("src-R/parse.cpp")
mat <- get_state("short_output/chr1/1.bin", "chr1", "short_output/chr1/cells.txt")
dim(mat)
# [1] 5001 7201