This repository contains a Snakemake workflow to identify and quantify hypomethylated regions within centromeres, or Centromere Dip Regions (CDRs; Altemose et al., Science, 2022).
Original pipeline constructed by @arozanski97 with help from Glennis Logsdon
Adapted and distributed by @fkmastrorosa, @wharvey31, and @koisland.
This is done by:
- Extracting the sequence of interest.
- Intersect the sequence with its methylation data.
- Bin the region of interest into 5 kbp windows and calculate the mean methylation percentage.
- Run RepeatMasker on the sequence of interest to identify regions containing alpha-satellite repeats (ALR/Alpha)
- For each alpha-satellite containing region:
- Merge consecutive bins.
- Detect valleys in average methylation percentage based on relative prominence.
- Check sides of each found CDR, filtering other CDRs, to calculate its relative height.
- Optionally, extend edges to mean signal and some number of standard deviations to get missed peaks.
- Optionally, merge adjacent detected CDRs.
git clone https://github.com/EichlerLab/CDR-Finder.git
cd CDR-Finder
To setup with conda
:
conda env create --name cdr_finder -f env.yaml
conda activate cdr_finder
If using singularity
, only snakemake
is required. This will create a python virtual environment and install snakemake
.
python -m venv venv && source venv/bin/activate
pip install snakemake
To run with conda
.
snakemake -np --sdm conda -c 4
To run with singularity
. This will use logsdonlab/cdr-finder
to run the workflow.
snakemake -np --sdm apptainer conda -c 4
Or alternatively to add as a workflow to an existing Snakefile.
git submodule add https://github.com/koisland/CDR-Finder
# Pass CDR config here.
CDR_CONFIG = {}
module CDRFinder:
snakefile:
"CDR-Finder/workflow/Snakefile"
config:
CDR_CONFIG
use rule * from CDRFinder as cdr_*
rule all:
input:
rules.cdr_all.input
Multiple samples can be provided via the configfile. Each sample should contain the following:
fasta
- Genome assembly.
bamfile
- Alignment BAM file with methylation tags.
- Requires aligned reads with the Mm and Ml tags (MM and ML also supported), and the reference sequence used for alignment.
- PacBio data requires that the alignment does not hard-clip supplementary alignment using
pbmm2
or -Y flag forminimap2
/winnowmap
. - https://github.com/epi2me-labs/modbam2bed?tab=readme-ov-file#usage
regions
- BED file of target region coordinates.
titles
- Optional
- Two column TSV file with chrom name and plot title name.
cdr_bed
- CDR regions.
{config.output_dir}/bed/{sample}_CDR.bed
cdr_plot
- CDR regions plotted with RepeatMasker annotations.
{config.output_dir}/plot/{sample}_CDR.png
parameter | description | default |
---|---|---|
species |
Species to use for RepeatMasker Dfam database. CDR-Finder was designed for human centromeres so performance in other species is untested. Use with caution. See https://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html. | human |
restrict_alr |
Restrict CDR search to sequence annotated as ALR/Alpha by RepeatMasker. Disabling this will cause false positives and should only be set to false if the input region is well curated. |
true |
window_size |
Size of the methylation windows to average over. | 5000 |
alr_threshold |
Size of ALR repeat stretches to include in search of CDR. | 100,000 |
bp_merge |
Distance in bases to merge adjacent CDRs. Can be omitted. | 1 |
bp_alr_merge |
Distance in bases to merge adjacent alpha-satellite regions. | 1,000 |
bp_edge |
Distance in bases to check CDR edges. Used to determine height of dip. Large values give better estimates of true height. | 500,000 |
height_perc_valley_threshold |
Threshold percent of the median methylation percentage needed as the minimal height of a valley from the median. Larger values filter for deeper valleys. | 0.34 |
prom_perc_valley_threshold |
Threshold percent of the median methylation percentage needed as the minimal prominence of a valley from the median. Larger values filter for more prominent valleys. Helps in removing low-confidence CDRs. | 0.3 |
edge_height_heuristic |
Heuristic used when determining edge height of CDR. Either min, max, or avg. | min |
extend_edges_std |
Extend edges of CDR until the mean average methylation signal and some number of standard deviations is reached. May fix smaller, less prominent CDRs being missed. A value of 0 is the mean while +1/-1 is one stdev above/below the mean. | -1 |
See docs/ISSUES.md
for edge-cases and effects of parameter tuning.
Set up the conda environment and pull test data with git-lfs
.
conda env create --name cdr_finder -f env.yaml
conda activate cdr_finder
git lfs install && git lfs pull
To run the test case on chr8 and chr21.
snakemake -c 1 -p --sdm conda --configfile test/config/config.yaml
To run integration tests.
pytest -vvv
Requires root
user and docker
.
make docker
make singularity
- Glennis A Logsdon, F Kumara Mastrorosa, Keisuke K Oshima, Allison N Rozanski, William T Harvey, Evan E Eichler, Identification and annotation of centromeric hypomethylated regions with Centromere Dip Region (CDR)-Finder, Bioinformatics, 2024;, btae733, https://doi.org/10.1093/bioinformatics/btae733*