containX is a prototype implementation of an algorithm that decides which contained reads can be dropped during overlap graph sparsfication. Reads which are substrings of longer reads are typically referred to as contained reads. The string graph model filters out contained reads during graph construction. Contained reads are typically considered redundant by commonly-used long-read assemblers. However, removing all contained reads can lead to coverage gaps, especially in diploid, polyploid genomes and metagenomes (see example below). Here we have implemented novel heuristics to distinguish redundant and non-redundant contained reads.
Clone source code from master branch.
git clone https://github.com/at-cg/containX.git
To compile, the software requires C++ compiler with c++11 and openmp, which are available by default in GCC >= 4.8.
cd containX
make
Expect containX
executable in your folder.
In the current algorithm, we assume that there are no sequencing errors (e.g., reads have been error-corrected). Future versions of code will permit a small error-rate. You will need a fastq file (say reads.fastq) to begin. Prior to using containX, use minimap2 (Li 2018) to compute read overlaps. Also use hifiasm read overlapper (Cheng et al. 2021) to identify reads that are sampled from a non-repetitive region of a genome and have a heterozygous SNP. Minimap2 can be downloaded from here. A modified version of hifiasm code can be obtained from here. Note that the modified code is available through branch hifiasm_dev_debug. Use the following commands to run the pipeline (may need to adjust thread count).
minimap2 -t 32 -w 101 -k 27 -g 500 -B 8 -O 8,48 -E 4,2 -cx ava-ont reads.fastq reads.fastq > overlaps.paf
hifiasm --dbg-het-cnt -o hifiasmoutput -t 32 reads.fastq
cat hifiasmoutput.het_cnt.log | tr -d ">" | awk '{if ($2 > 0) {print $1}}' > hifiasmoutput.readids.txt
containX -t 32 -p hifiasmoutput.readids.txt -n nonRedundantContainedReads.txt reads.fastq overlaps.paf
The same steps as above, but the step using hifiasm can be skipped. Users are welcome to run containX on simple examples provided in data folder.
minimap2 -t 32 -w 101 -k 27 -g 500 -B 8 -O 8,48 -E 4,2 -cx ava-ont reads.fastq reads.fastq > overlaps.paf
containX -t 32 -n nonRedundantContainedReads.txt reads.fastq overlaps.paf
- "Coverage-preserving sparsification of overlap graphs for long-read assembly". Bioinformatics, 2023.