Novel-X detects and genotypes novel sequence insertions in 10X sequencing dataset using non-trivial read alignment signatures and barcode information.
To start working with Novel-X please clone this repository recursively:
git clone --recursive git@github.com:1dayac/Novel-X.git
If you clone repository non-recursively Novel-X will not work. To fix this run from Novel-X folder:
git submodule update --init --recursive
Novel-X is a pipeline based on a popular Snakemake workflow management system and consists of multiple steps and requires a lot of external software.
First, the following software should be installed (version numbers used for testing are shown in brackets, but other versions should also work):
- Longranger (version 2.15) - Download Page
- Velvet (commit 9adf09f) - GitHub Page - outdated but still useful assembler with minimal assumptions about the data. Note that we use kmer length of 63 during the assembly for 10X data, and Velvet should be compiled using
make ’MAXKMERLENGTH=63’
command. For more information, refer to the Velvet manual.
- BlastN (version 2.2.31) - Download Page
- Samtools (version 1.7) - Project Page
- SPAdes (version 3.13) - Project Page
- Quast (version 4.4)- Project Page
All this programs (except LongRanger) can be installed with conda package. We provide conda-env.yml file that allows to install them using the following command:
conda env create -f conda-env.yml
Path to executables (if executables are not in $PATH) should be provided in path_to_executables_config.json file.
Python dependencies are listed in requirements.txt file. They can be downloaded and installed with following command:
pip install -r requirements.txt
Inside bxtools folder run following commands (estimated execution time is around 2 minutes):
./configure
make
make install
We tested our tool using CentOS Linux 7 OS, but we suppose that it should work at any modern Unix-like system.
Then you are ready to go.
Novel-X can be run with novel-x.py script with two modes:
- run - run pipeline from the scratch
- restart - if previous pipeline was not finished for some reason you can try to catch up with novel-x.py restart command.
A typical command to start Novel-X is
python novel-x.py run --bam my_bam.bam --genome my_genome.fasta --outdir my_dir
Optional arguments are:
- --lr20 - needed if you run pipeline on a bam file obtained by LongRanger2.0 pipeline
- --nt - optional filtering of non-human sequences from the orphan contigs
We added two option groups to handle different data and its properties (molecule length, intra-molecule coverage, etc.).
Data option group:
- --10x - for 10X Genomics data [Default]
- --tellseq - for Tell-Seq data
- --stlfr - for stLFR data
Tell-Seq and stLFR data should be converted to LongRanger-compatible bam. For stLFR data, use this pipeline. For Tell-Seq data refer to Tell-Seq paper.
Coverage group:
- --high-coverage - best for 60X coverage and higher [Default]
- --low-coverage - best for 20X-40X coverage
You can invoke help message by typing:
python novel-x.py run --help
or
python novel-x.py restart --help
Novel-X write results into vcf-file. If your bam-file was named HM2KYBBXX_NA18509.bam, the resulting vcf-file will be named HM2KYBBXX_NA18509.vcf and will be stored inside the outdir folder.
Run from the start:
python ~/Novel-X/novel-x.py run --bam /athena/ihlab/scratch/dmm2017/70_samples_data/HLF3WBBXX_NA12006_longranger.bam -t 8 -m 200 --nt /athena/ihlab/scratch/dmm2017/blast_database/ --genome /athena/ihlab/scratch/dmm2017/hg38/hg38.fa --outdir /athena/ihlab/scratch/dmm2017/70_samples/novelx_NA12006
Restart from the last stage:
python ~/Novel-X/novel-x.py restart --outdir novelx_NA12006
There is a problem on filter_target_contig stage at the moment. It can exit with non-zero exit code. We recommend to comment out the next line before using restart option:
parallel --jobs {THREADS} filter_target_contigs ::: {input.contigs}/*
We placed a toy dataset in demo folder to test that software is installed correctly. You can run command:
python ~/Novel-X/novel-x.py run --bam ~/Novel-X/demo/demo.bam -t 1 -m 20 --genome ~/Novel-X/demo/demo.fasta --outdir out
This command takes about 15 minutes to finish on our hardware. It produces a vcf-file with a single vcf record.
chr1_25500000_25535000 29503 . T TGTATTGTGTGTATGAGGGTTGTGTGCTGTGTGTTGTGTATATATTGTATGTGTTATGTGTATGTATGTCGTATGAGTGTATTCTGTATATGTGTTTTGTGTGGTCTATTATGTATGTGGCATGTGTTGTGTATGTGTGTTGTGTGTGATGTGTTGTATGTGTGTTGTGCATATATGTTGTTTCTGTGTATGTATGTTATGTGTATGTGTATGTTGTGTTGTATGTATGGGTTGTGCCTATGTGCTGTGTTGTGTGCTGCATGCATGTTTGTGTGGTGTGTGTATTTAGGTTGTGTGCTATTTATGTGTCTATATTGTATGTGTTGTATGTGTGTTGTATGTATGTGTAGTGTATGTGTGTTGTGTGTGATGTGTATATGTGGTGTGTGTATGTCTGTTATGTGTATGTATGAGTGTATGTGTGTTGTGTGTGTTGTGTATATGTGTTGTGTGTGTTGTGTATGTGTGTTGTGAGTTGTGTATATGTGGTGTGAGTTGTGTTGTGTCATGTATGTGTGCATTGTGTATAGGTGTTGCATGTGTGTTGTGTTGTGTGTATGTGTTGTGTTGTGTATATGTGGTATGTGAATGTGTATGTTGTATGTTGTGTTGTATGTATATGTGTTATGTATATGTGATGTGTGTGTTGTGTATATGCTGGGTGTGTGTGTACATGTGTGTATGTGTGTTGTATGTATGTGTGTATGCATGTGTGTTGCGTATATGTGGTATGTGTGCATGTGTGTTGTCATGTGTATGTGTGTTGTGTATATGTGTGTGTTGTGTATATGTGTTGTGTGTATGTGTATCATGTTGTGTGTATGTGTTATGTTGTGTATATGTGGTGTGTGAATGTGTGTTGTGTGTATGTGTATGTTGTCTGTTTTGTGTGTGTATACGTGGTGTGTGTGTGTTGTGTTGTGTATATGTGTTGTGTGTGTTGCGTGTATGTGTTGTGTGTT . PASS DP=100 NODE_1_length_4180_cov_43.887399 2776 276 347 1306
Output may slightly differ based on your software versions.
"Novel sequence insertion detection using Linked-Reads" preprint is available at https://www.biorxiv.org/content/10.1101/551028v1.
Feel free to drop any inquiry to meleshko.dmitrii@gmail.com