Skip to content

Workflow to identify the centromere dip regions (CDR) from methyl BAMs

Notifications You must be signed in to change notification settings

EichlerLab/CDR-Finder

Repository files navigation

CDR-Finder

CI

This repository contains a Snakemake workflow to identify and quantify hypomethylated regions within centromeres, or Centromere Dip Regions (CDRs; Altemose et al., Science, 2022).

Original pipeline constructed by @arozanski97 with help from Glennis Logsdon

Adapted and distributed by @fkmastrorosa, @wharvey31, and @koisland.

This is done by:

  • Extracting the sequence of interest.
  • Intersect the sequence with its methylation data.
  • Bin the region of interest into 5 kbp windows and calculate the mean methylation percentage.
  • Run RepeatMasker on the sequence of interest to identify regions containing alpha-satellite repeats (ALR/Alpha)
  • For each alpha-satellite containing region:
    • Merge consecutive bins.
    • Detect valleys in average methylation percentage based on relative prominence.
    • Check sides of each found CDR, filtering other CDRs, to calculate its relative height.
    • Optionally, extend edges to mean signal and some number of standard deviations to get missed peaks.
    • Optionally, merge adjacent detected CDRs.

Getting Started

git clone https://github.com/EichlerLab/CDR-Finder.git
cd CDR-Finder

To setup with conda:

conda env create --name cdr_finder -f env.yaml
conda activate cdr_finder

If using singularity, only snakemake is required. This will create a python virtual environment and install snakemake.

python -m venv venv && source venv/bin/activate
pip install snakemake

Usage

To run with conda.

snakemake -np --sdm conda -c 4

To run with singularity. This will use logsdonlab/cdr-finder to run the workflow.

snakemake -np --sdm apptainer conda  -c 4

Or alternatively to add as a workflow to an existing Snakefile.

git submodule add https://github.com/koisland/CDR-Finder
# Pass CDR config here.
CDR_CONFIG = {}

module CDRFinder:
    snakefile:
        "CDR-Finder/workflow/Snakefile"
    config:
        CDR_CONFIG

use rule * from CDRFinder as cdr_*

rule all:
    input:
        rules.cdr_all.input

Input

Multiple samples can be provided via the configfile. Each sample should contain the following:

  • fasta
    • Genome assembly.
  • bamfile
    • Alignment BAM file with methylation tags.
    • Requires aligned reads with the Mm and Ml tags (MM and ML also supported), and the reference sequence used for alignment.
    • PacBio data requires that the alignment does not hard-clip supplementary alignment using pbmm2 or -Y flag for minimap2/winnowmap.
    • https://github.com/epi2me-labs/modbam2bed?tab=readme-ov-file#usage
  • regions
    • BED file of target region coordinates.
  • titles
    • Optional
    • Two column TSV file with chrom name and plot title name.

Output

  • cdr_bed
    • CDR regions.
    • {config.output_dir}/bed/{sample}_CDR.bed
  • cdr_plot
    • CDR regions plotted with RepeatMasker annotations.
    • {config.output_dir}/plot/{sample}_CDR.png

Parameters

parameter description default
species Species to use for RepeatMasker Dfam database. CDR-Finder was designed for human centromeres so performance in other species is untested. Use with caution. See https://www.repeatmasker.org/genomicDatasets/RMGenomicDatasets.html. human
restrict_alr Restrict CDR search to sequence annotated as ALR/Alpha by RepeatMasker. Disabling this will cause false positives and should only be set to false if the input region is well curated. true
window_size Size of the methylation windows to average over. 5000
alr_threshold Size of ALR repeat stretches to include in search of CDR. 100,000
bp_merge Distance in bases to merge adjacent CDRs. Can be omitted. 1
bp_alr_merge Distance in bases to merge adjacent alpha-satellite regions. 1,000
bp_edge Distance in bases to check CDR edges. Used to determine height of dip. Large values give better estimates of true height. 500,000
height_perc_valley_threshold Threshold percent of the median methylation percentage needed as the minimal height of a valley from the median. Larger values filter for deeper valleys. 0.34
prom_perc_valley_threshold Threshold percent of the median methylation percentage needed as the minimal prominence of a valley from the median. Larger values filter for more prominent valleys. Helps in removing low-confidence CDRs. 0.3
edge_height_heuristic Heuristic used when determining edge height of CDR. Either min, max, or avg. min
extend_edges_std Extend edges of CDR until the mean average methylation signal and some number of standard deviations is reached. May fix smaller, less prominent CDRs being missed. A value of 0 is the mean while +1/-1 is one stdev above/below the mean. -1

See docs/ISSUES.md for edge-cases and effects of parameter tuning.

Testing

Set up the conda environment and pull test data with git-lfs.

conda env create --name cdr_finder -f env.yaml
conda activate cdr_finder
git lfs install && git lfs pull

To run the test case on chr8 and chr21.

snakemake -c 1 -p --sdm conda --configfile test/config/config.yaml

To run integration tests.

pytest -vvv

Containerization

Requires root user and docker.

make docker
make singularity

Cite

  • Glennis A Logsdon, F Kumara Mastrorosa, Keisuke K Oshima, Allison N Rozanski, William T Harvey, Evan E Eichler, Identification and annotation of centromeric hypomethylated regions with Centromere Dip Region (CDR)-Finder, Bioinformatics, 2024;, btae733, https://doi.org/10.1093/bioinformatics/btae733*

About

Workflow to identify the centromere dip regions (CDR) from methyl BAMs

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •