Skip to content

Latest commit

 

History

History
283 lines (214 loc) · 16 KB

README.md

File metadata and controls

283 lines (214 loc) · 16 KB

SV-Pop Pipeline

SV-Pop is an on-demand toolkit for parsing variants into tables (BED files), annotating variants, and merging across tools and/or samples.

Several annotations are built into SV-Pop, such as gene intersect or nearest gene (RefSeq), regulatory element intersects (ENCODE cCRE, histone marks, or DHS), annotated reference regions (chromosome band, simple repeats, segmental duplications, homopolymer intersects, and dinucleotide intersects), and features run on variant sequences (RepeatMasker, TRF, GC content, insertion sequence mapping location). Many of these annotations are automatically pulled from the UCSC browser, generated by SV-Pop itself (e.g. homopolymer sites), or SV-Pop runs an external tool (RepeatMasker, TRF, minimap2).

A single analysis directory can accept variants from many sources (e.g. caller, such as PAV, pbsv, DeepVariant, most VCFs, or any pre-formatted BED table), and each source can have any number of samples. Variants are transformed to a consistent BED+6 format that SV-Pop recognizes (#CHROM, POS, END, ID, SVTYPE, SVLEN, + any number of other fields). Variants can be annotated (e.g. RefSeq intersects), filtered (e.g. confident loci), and merged.

Supported variant types are:

vartype svtype description
snv snv SNPs, SNVs
indel ins, del size < 50 bp
sv ins, del, inv size ≥ 50 bp
dup dup Duplication
sub sub Substitution (ref bases replaced)
rgn rgn Any region

There are several combined svtypes:

  1. insdel: ins and del
  2. insdelinv: ins, del, and inv

There are two types of merges, callerset and sampleset. A callerset merge takes multiple callers for a single sample and merges them into a consensus callset. A sampleset merges a single source (single caller or single callerset) across multiple samples (e.g. nonredundant callset across samples). The merging process is flexible and is capable of using a combination of size, position, matching REF/ALT fields (SNVs), and sequence identity (defaults to 80% identical). Merges of merges are possible, for example, callerset merge followed by a sampleset merge, but are generally not advisable because it severely complicates analysis. SV-Pop also supports intersecting variant sources, which can be used to provide support for a callset without explicitly merging support into the callset itself. Merges are always done on the same variant type (vartype & svtype).

The variant ID for any source (sample + caller, or merge) is unique. IDs are tracks so that any variant can be traced back through to its source. SV-Pop uses this to pull annotations from the original source through a merge. It can also be used by downstream analysis.

Along with the variant BED files, there is also a FASTA file for variant sequences. These sequences are optional except for tasks than explicitly need them (e.g. RepeatMasker on variant sequences or merging by sequence identity).

SV-Pop makes extensive use of wildcards in paths, for example "{sample}" embedded in a path will be replaced with a sample name. This is used to locate input and to request annotations for a specific variant source.

All data files in SV-Pop are gzipped to conserve space. Multiple samples, annotations, and merges can use a surprising amount of disk space.

Install

Clone the repository and submodules:
git clone --recursive https://github.com/EichlerLab/svpop.git

Requires Python 3 with these packages:

  1. BioPython
  2. numpy
  3. pandas
  4. pysam
  5. snakemake
  6. intervaltree
  7. matplotlib (optional)

External tools SV-Pop may call:

  1. samtools: Input variant processing, variant sequence FASTA indexes, and alt-mapping insertions.
  2. tabix: Needed if variants are transformed back to VCF (optional).
  3. bcftools: VCFs are parsed with bcftools into tables for SV-Pop.
  4. minimap2: Used for re-mapping insertion sequence (optional).
  5. bedtools: Mainly used for processing annotations from UCSC.
  6. UCSC toolkit: Generating tracks for UCSC (optional.
  7. RepeatMasker: Annotating SV sequences (optional)
  8. TRF: Annotating SV sequneces (optional).

The easiest way to install these is to use conda. If distributing over Slurm, include snakemake-executor-plugin-slurm.

Configuration

The main configuration file config/config.json is in JSON format. It contains paths to the reference and defines rules for merging. Input sources are contained in config/samples.tsv.

config/config.json

If there are no merges defined, then the configuration file only needs to define a path to the reference assembly and the UCSC reference name for automatically pulling annotations (default "hg38").

For example:

{
  "reference": "/path/to/hg38.no_alt.fa.gz",
  "ucsc_ref_name": "hg38"
}

config/samples.tsv

This is a tab-separated-values (TSV) file defining input sources. TSVs can be edited in a text editor (ensure tabs are the tab character and not spaces) or spreadsheet programs like Excel and LibreOffice. SV-Pop can support any number of input sources, which may be individual callers ()

Field Description
NAME Input source name
SAMPLE Sample name or "DEFAULT"
TYPE Input file type (format or tool)
DATA Path to variant data
VERSION Version of the program that generated the tool
PARAMS Additional parameters for the input parser
COMMENT Free-form comment

NAME: Name can be any identifier, such as "pbsv", "pav-hifi", "ebert2021". This identifier will be used to request variants from a specific source.

SAMPLE: This is typically DEFAULT, which will be used to locate any sample (requires "{sample}" in the DATA path). SV-Pop can also accept a TSV with one line per sample (e.g. NAME = "pbsv" for two lines with a unique sample ID on each line).

TYPE: Determines how the input is parsed.

Supported types are:

  1. bed: DATA paths for pre-formatted BED file (BED 6+, #CHROM, POS, END, ID, SVTYPE, SVLEN, + other optional fields). IDs must be unique.
  2. deepvariant: DATA is a path to a DeepVariant VCF.
  3. dipcall: DATA is a path to a DipCall VCF.
  4. dvpepper: DATA is a path to a PEPPER-Margin-DeepVariant VCF.
  5. clair3: DATA is a path to a Clair3 VCF.
  6. cutesv: DATA is a path to a CuteSV VCF.
  7. delly: DATA is a path to a Delly2 VCF.
  8. pavbed: DATA is a path to a bed directory in PAV (results/{sample}/bed in a PAV run directory).
  9. pavbedhap: Each haplotype is an independent callset. DATA is a path to the PAV run directory (where results is found). Each PAV sample has two samplesin SV-Pop, one with "-h1" and one with "-h2" appended to the sample name. For example, for PAV sample "HG00733", SV-Pop would have samples "HG00733-h1" and "HG00733-h2", but not "HG00733".
    • Optional sample_pattern="..." in the PARAMS column allows flexible matching for sample names. For example, if PAV was run on sample "SAMPLE1" with two assemblers, hifiasm (sample name "ha_SAMPLE1") and Verkko (sample name "vk_SAMPLE1"), SV-Pop allows these to be separated into distinct entries while retaining the original sample name. In this example, one entry in the sample table would have sample_pattern=ha_{sample} (for example NAME="pav-ha") and the other sample_pattern=vk_{sample} (for example NAME="pav-vk"). This allows PAV to be run on a collection of assemblies and split into distinct entries in SV-Pop without mangling the sample name.
  10. gatk: DATA is a path to a GATK VCF.
    • Retrieves FORMAT fields GT, GQ, DP, and AD
  11. longshot: DATA is a path to a longshot VCF.
    • Retrieves FORMAT fields GT, GQ, and DP
  12. pbsv: DATA is a path to pbsv output VCFs. Requires wildcard "{vartype}" in the path which is filled with "sv" for SVs or "dup" for duplication calls.
  13. sniffles: DATA is a path to a Sniffles VCF.
  14. sniffles2: DATA is a path to a Sniffles2 VCF.
  15. svim: DATA is a path to a SVIM VCF.
  16. svimasm: DATA is a path to a SVIM-asm VCF.
  17. vcf: DATA is a path to an arbitrary VCF.
    • Use PARAM field keywords "info" and "format" to specify INFO and FORMAT fields to include in the BED file as variant annotations (e.g. "info=AF;format=GT,GQ").

DATA: Path to input. See supported types for a description of what DATA should point to.

VERSION: Can be used by parsers to modify how variants are parsed. May be set for any input for documentation purposes. some input types, like pavbedhap, will alter where it looks in the variant call output for input files. This string may be enclosed with double-quotes to prevent Excel from interpreting it like an integer (i.e. Excel likes to change 3.0.0 to 3, but "3.0.0" would be left as is and SV-Pop will strip the quotes).

PARAMS: Additional parameters:

  1. info: List of INFO fields to pull from the VCF
  2. format: List of FORMAT fields to pull from the VCF
  3. pass: List of FILTER column values to allow. The FILTER column must match at least one element in this set. If no list of filters is given (i.e. bare "pass" in the config line), then all variants are accepted. Note that "." in the VCF FILTER column is an implicit "PASS" (used by variant callers that do not have a PASS column).
  4. fill_del [default False]: If "True" (or bare "fill_del" value), then retrieve missing DEL sequences if they are not in the VCF (i.e. symbolic ALT with no INFO/SEQ). This allows intersects and merging using sequence match criteria on these events. Note that large erroneous DELs will require excessive memory to store sequences (often many times larger than the host genome).
  5. filter_gt [default True]: Filter variants on the GT column. For each record, if all GT alleles are "0" and/or ".", the variant record is dropped. Some callers write "./." for all GT records, and setting "filter_gt=False" will prevent all records from being dropped. If this is used in combination with a multi-sample VCF, all variant records will be kept in all samples (identical callsets across samples).
  6. cnv_deldup [default True]: Translate CNV records to either DEL or DUP based on the CN field and add them to the DUP and DEL callsets. The original CNV calls are still available as CNVs (i.e. "sv_cnv"). To disable adding CNVs to DEL and DUP, set "cnv_deldup=False".
  7. strict_sample [default False]: Require the VCF sample name to match the sample name being processed by SV-Pop even if there is only one sample column in the VCF.
  8. min_svlen: Minimum variant size for indels and SVs (supported by pavbedhap and VCF parsers). Value is inclusive.
  9. max_svlen: Maximum variant size for indels and SVs (supported by pavbedhap and VCF parsers). Value is inclusive.
  10. fill_ref [default False]: If the input VCF has N's in the REF column and variant calls are not resolved by symbolic ALTs (i.e. ALT="" with INFO columns describing the variant), then the reference alleles must be filled in. Set this parameter to "True" to tell SV-Pop to fill in N's in the REF column before resolving variants.

Running SV-Pop

To run SV-Pop, first install SV-Pop to a directory (called the "PIPELINE_DIR" in this guide). Then, change to a working directory that will contain results (called "WORKING_DIR"). Place configuration files in this WORKING_DIR location (config JSON and samples TSV, see above). Lastly, link Snakefile and profiles to the WORKING_DIR:

cd WORKING_DIR

ln -s /path/to/PIPELINE_DIR/Snakefile ./
ln -s /path/to/PIPELINE_DIR/profiles ./

To run, execute snakemake directly. It can be run locally or distributed over a cluster. SV-Pop has profiles setup for local and for Slurm. To exeucute over Slurm, snakemake-executor-plugin-slurm must be installed with snakemake.

Local: snakemake --profile profiles/local -c CORES TARGET

Slurm: snakemake --profile profiles/slurm --executor slurm -j JOBS TARGET

TARGET is the SV-Pop target (file names) to generate. See TARGETS.md for a list of target output files.

CORES is the maximum number of cores to consume (local) and JOBS is the maximum number of consecutive jobs to schedule (slurm). As jobs complete, the next jobs are scheduled.

Wildcards

The above path contains definitions for several wildcards:

results/variant/{sourcetype}/{sourcename}/{sample}/{filter}/{svset}/bed/{vartype}_{svtype}.bed.gz
  1. sourcetype: Type of input. "caller" (single caller/sample), "sampleset" (merge across samples), or "callerset" ( merge across callers).
  2. sourcename: Name of the source.
    • sourcetype caller: Matches NAME in config/samples.tsv
    • sourcetype sampleset or callerset: Configured sampleset or callerset merge (defined in config/config.json)
  3. sample: Name of the sample
    • sampleset: Name of a list of samples defined in config/samples.tsv in section samplelist.
  4. filter: An early filter applied before any annotation or merging. These filters are BED files of regions that should not be included (any variant intersect is dropped). "all" indicates no filtering.
  5. svset: A powerful filter for subsetting variants (e.g. "notr" for no tandem repeats). Can be used to filter by annotations inside the variant BED (e.g. variant length) or outside (e.g. intersection with annotated regions). This is a customizable feature. "all" indicates no filtering.
  6. vartype: Type of variant (snv, sv, indel, dup, rgn, or sub).
  7. svtype: Subtype of variant (snv, ins, del, inv, dup, rgn, sub).

Annotations will have custom parameter wildcards that determine their behavior.

Annotations

The pipeline does many annotations including intersecting with UCSC (see below), running TRF and RepeatMasker on inserted or deleted SV sequence, reference mapping location for SV insertions (finding SV donor sites for duplications), and homopolymer run intersections.

  • UCSC tracks: GRC patches, centromeres, gaps, AGP, chromosome band, RefSeq (CDS, ncRNA, intron, upstream/downstream flank), TRF, segmental duplications (SD), RepeatMasker, CpG islands, ENCODE histone marks, and ORegAnno.

Example:

results/variant/caller/pav/HG00733/all/all/anno/refseq/refseq-count_sv_ins.tsv.gz

This file requests RefSeq intersections for PAV HG00733 SV insertions. The table counts the number of bases affected within coding regions, UTRs, introns, and ncRNA exons.

Merging and intersecting variants

The pipeline can merge variants in two ways:

  1. Callerset: Merge variants from different callers for the same sample.
  2. Sampleset: Merge variants from different samples.

Variants can also be intersected without merging, which deterimines which variants from two sources are alike and which are different.

See MERGE.md for information on merging and intersecting.

Cite

Cite the current version in new publications (Ebert 2021):

Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Sulovari A, Ebler J, Zhou W, Serra Mari R, Yilmaz F, Zhao X, Hsieh P, Lee J, Kumar S, Lin J, Rausch T, Chen Y, Ren J, Santamarina M, Höps W, Ashraf H, Chuang NT, Yang X, Munson KM, Lewis AP, Fairley S, Tallon LJ, Clarke WE, Basile AO, Byrska-Bishop M, Corvelo A, Evani US, Lu TY, Chaisson MJP, Chen J, Li C, Brand H, Wenger AM, Ghareghani M, Harvey WT, Raeder B, Hasenfeld P, Regier AA, Abel HJ, Hall IM, Flicek P, Stegle O, Gerstein MB, Tubio JMC, Mu Z, Li YI, Shi X, Hastie AR, Ye K, Chong Z, Sanders AD, Zody MC, Talkowski ME, Mills RE, Devine SE, Lee C, Korbel JO, Marschall T, Eichler EE. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021 Apr 2;372(6537):eabf7117. doi: 10.1126/science.abf7117. Epub 2021 Feb 25. PMID: 33632895; PMCID: PMC8026704.

The pipeline was originally published in 2019 (Audano 2019):

Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, Warren WC, Magrini V, McGrath SD, Li YI, Wilson RK, Eichler EE. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019 Jan 24;176(3):663-675.e19. doi: 10.1016/j.cell.2018.12.019. Epub 2019 Jan 17. PMID: 30661756; PMCID: PMC6438697.

This is NOT the same as "SV-Pop: population-based structural variant analysis and visualization" (Ravenhall et al. 2019. BMC Bioinformatics). It was named before that paper came out.