This page describes how to assemble unitigs from high-coverage human samples sequenced with Illumina 100bp reads, and to construct the FM-index for a collection of human samples. The following procedure is used to assemble and index ~300 samples from Simons Genome Diversity Project (SGDP).
-
Download and unpack precompiled binaries:
wget -O- http://sf.net/projects/biobin/files/unclassified/sgdp-1.1_x64-linux.tar.bz2/download \ | bzip2 -dc | tar xvf -
This will create a directory
sgdp-1.0_x64-linux
. For simplicity, we assume you have put the directory on the$PATH
. -
Generate Makefiles:
fermi2.pl unitig -t8 -p foo "trimadap foo.fq.gz" > foo.mak
This requests 8 threads at the maximum and to write the results with prefix
foo
.trimadap
trims typical adapters used in Illumina resequencing. It is part of seqtk. If the input is BAM, you may replace with:fermi2.pl unitig -t8 -p foo "htscmd bam2fq foo.bam | trimadap" > foo.mak
If you are running this pipeline the first time, it is recommended to have a look at
foo.mak
. It is simple. -
Assemble unitigs:
make -f foo.mak
For 100bp human data at ~30X coverage, it usually take 2.5-3.5 days in real time with peak memory around 85GB. File
foo.mag.gz
contains the final unitigs. You can delete the rest of files, though it is usually a good idea to keep*.log
files for timing and trouble shooting. -
Append
*.mag.gz
in the working directory to an existing indexbar.fmr
:fermi2.pl mag2fmr -i bar.fmr *.mag.gz > all.mak make -f all.mak k8 sgdp-1.0_x64-linux/fermi2.js log2tbl *.fmr.log > all.tbl
This appends each set of unitigs to the previous FM-index. It takes about 30 minutes to add one sample. The last
.fmr
file is the final FM-index you should keep. If you do not have an existing index, simply drop the-i bar.fmr
option.all.tbl
keeps the sample information. -
Variant calling:
bwa index ref.fa; samtools faidx ref.fa bwa mem -B9 -O16 -L10 -t8 ref.fa foo.mag.gz | gzip -1 > foo.sam.gz samtools view -uS foo.sam.gz | samtools sort - foo htsbox pileup -cq40 -Q5 -f ref.fa foo.bam > foo.vcf
where
htsbox
is available via github.