Somatic mutations occur on specific haplotypes. Short-read sequencing methods obscure the original haplotype relationship and important information about the cis/trans relationship of somatic events is lost. This work utilizes linked-read WGS data to leverage germline haplotype structures and infer the haplotypic context of somatic mutations.
This repository contains submodules. Use this command when you clone:
git clone --recurse-submodules https://github.com/ding-lab/SomaticHaplotype
If you have already cloned this repository without using --recurse-submodules
, you can retroactively initialize and update the submodules recursively using
git submodule update --init --recursive
Source: https://git-scm.com/book/en/v2/Git-Tools-Submodules
To create and activate the SomaticHaplotype conda environment with correct python and python modules, run
cd SomaticHaplotype
bash external_downloads/set_up_SomaticHaplotype_environment.sh
python SomaticHaplotype.py [module] [output directory] [output prefix]
python SomaticHaplotype.py --help
usage: SomaticHaplotype.py [-h] [--bam BAM] [--vcf VCF] [--vcf_id VCF_ID]
[--range RANGE] [--pb1 PB1] [--pb2 PB2] [--sum SUM]
[--maf MAF] [--sombx SOMBX] [--variant VARIANT]
[--ibd IBD] [--hbd HBD] [--dem DEM] [--version]
module output_directory output_prefix
positional arguments:
module Module the program should run. Could be one of phaseblock,
summarize, extend, somatic, or ancestry.
output_directory Absolute or relative path to output directory
output_prefix Prefix for file names in output directory. Warning:
existing files in output_directory with same prefix will
be overwritten.
optional arguments:
-h, --help show this help message and exit
--bam BAM Path to bam file
--vcf VCF Path to VCF file
--vcf_id VCF_ID Sample ID from VCF file
--range RANGE Genomic range chr:start-stop, chr, chr:start, chr:-stop
--pb1 PB1 Path to first phase block file
--pb2 PB2 Path to second phase block file
--sum SUM Path to existing summary file
--maf MAF Path to sample-specific somatic MAF (assumes all variants
are associated with single sample)
--sombx SOMBX Path to file containing barcodes supporting somatic MAF
variants extracted from BAM
--variant VARIANT Path to file containing newline-separated variant IDs,
format CHROM:POS:REF:ALT (ALT is comma separated list of
each ALT variant)
--ibd IBD Path to file reporting IBD (identical-by-descent)
segments, reported in Refined-IBD format
--hbd HBD Path to file reporting HBD (homozygous-by-descent)
segments, reported in Refined-IBD format
--dem DEM Demographic information about reference population used
in IBD analysis. Tab-separated columns:
sample/pop/super_pop/sex
--version show program's version number and exit
module | arguments |
---|---|
phaseblock | --bam , --vcf , --vcf_id , --range |
summarize | --pb1 |
extend | --sum , --pb1 , --pb2 , --range |
somatic | --pb1 , --range , --maf xor --variant , --sum |
ancestry | --pb1 , --vcf , --vcf_id , --range , --ibd , --hbd , --dem |
Download test data files (test_data.tar.gz) from figshare (size: 1.3 Gb) and untar using this command.
tar xvf test_data.tar.gz
The following script runs SomaticHaplotype.py
on data in test_data/
and creates output in test_output/
.
bash run_SomaticHaplotype_on_test_data.sh
We are in the process of sharing controlled access to lrWGS and WGS data from this study. (updated 2023-08-14)