Skip to content

Takes Manta VCFs or BEDPE file formats and annotates the structural variants

Notifications You must be signed in to change notification settings

5im1z/MantaSVAnnotator

 
 

Repository files navigation

MANTA_vcf2bedpe

The vcf to bedpe workflow is desgined to prepare the Manta calls for SV annotation. During the preperation a few predetermined filtering steps are applied.

1. Manta calls that are less than 50bp are removed. 

2. Manta calls labeled as 'IMPRECISE' by Manta are removed. These lack additional metadata such as homology that impact downstream analysis. These calls are also ones that may lack a precise breakpoint location.

3. Manta calls that align to any chromosome other than 1-22, X, and Y are removed. 

These Manta calls are written out in the same output directory in a minimally processed file for further investigation if desired. This file is labeled as the sample name with {%.removed_calls} ending.

This code is an R implementation of the svtools vcf2bedpe function which can also be used. The filters applied differ between the two functions.

Example Usage

Rscript MANTA_vcf2bedpe.R -i <Required:path to vcf file> -o <Required:output_directory_path/>

MantaSVAnnotator

Takes Manta bedpe from either the MANTA_vcf2bedpe function or from the svtools. Annotates each breakpoint to determine if it is in a gencode identified region. This function also outputs genes that are present in the TAD the SV occurs in.

Uses fuzzy filtering based on gnomad germline SVs to determine somatic events. {%.sv.annotated.bedpe} contains both germline and somatic annotated events. {%.somatic_only_sv.annotated.bedpe} contains all SV annotations that were not within 100bp of a perfect match in the gnomad germline SV reference or have a span greater than 1000bp.

use Manta_SV_Annotator_2 (REQUIRED inputs are input fiile path, output directory, gene annotation file, and germline annotation file)

use hg38_ensembl_genelocations_formatted.txt as the gene annotation file

use hg38_ensembl_exonlocations_formatted.txt.zip as the exon annotation file (needs to be unzipped first)

use gnomad_germline_hg38all.txt as the germline annotation file

Example usage

Rscript Manta_SV_Annotator_2.R 
-i <Required:input input_filepath> 
-o <Required:output directory_path/> 
-r <Required:gene annotation_filepath> 
-e <Required:exon annotation_filepath>
-g <Required:germline reference_filepath> 
-c <Optional:cores (default = 1)>

Reference files

Ensembl gene and exon locations were downloaded from biomart(http://nov2020.archive.ensembl.org/biomart). Formatting restricted the table to gene boundaries, chromosome, and gene ID to reduce file size.

Gnomad germline SV reference files were downloaded from the gnomad project public database (https://gnomad.broadinstitute.org/downloads SV 2.1 (controls) sites BED). They were then restricted to those that 'PASSED' gnomad's filtering process. The rtracklayer(https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) implimentation of liftOver was used to translate the hg19 positions to hg38. Insertions had endpoints adjusted to reflect the size of the insertion in relation to the reference genome.

File pipeline

FILE_NAME.somaticSV.vcf --> [MANTA_vcf2bedpe.R] --> FILE_NAME.somaticSV.vcf.bedpe + FILE_NAME.somaticSV.vcf_removed_calls
FILE_NAME.somaticSV.vcf.bedpe --> [Manta_SV_Annotator_2.R] --> FILE_NAME.somaticSV.somatic_only_sv.annotated.bedpe + FILE_NAME.somaticSV.sv.annotated.bedpe

About

Takes Manta VCFs or BEDPE file formats and annotates the structural variants

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%