diff --git a/CHANGES.md b/CHANGES.md index 2d907665..9b0be8b7 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -1,7 +1,11 @@ # Strobealign Changelog -## development version +## v0.13.0 (2024-03-04) +* #394: Added option `--aemb` (abundance estimation for metagenomic binning), + which makes strobealign output a table with estimated abundance values for + each contig (instead of SAM or PAF). This was contributed by Shaojun Pan + (@psj1997). * #386: Parallelize indexing even more by using @alugowski’s [poolSTL](https://github.com/alugowski/) `pluggable_sort`. Indexing a human reference (measured on CHM13) now takes only ~45 s on a diff --git a/CMakeLists.txt b/CMakeLists.txt index 988b1235..92b07c18 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -1,6 +1,6 @@ cmake_minimum_required(VERSION 3.16) -project(strobealign VERSION 0.12.0) +project(strobealign VERSION 0.13.0) include(FetchContent) option(ENABLE_AVX "Enable AVX2 support" OFF) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 29f5ba25..af7ba6fe 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -24,7 +24,7 @@ If needed, run `make` with `VERBOSE=1` to get more logging output. After CMake has been run, you can use this one-liner to compile strobealign and run the tests: ``` -make -j -C build && tests/run.sh +make -s -j -C build && tests/run.sh ``` Whenever you make changes that could potentially affect mapping results, you can diff --git a/README.md b/README.md index 5d736fa4..ab3fee37 100644 --- a/README.md +++ b/README.md @@ -77,47 +77,68 @@ as a developer. ### Python bindings -Experimental and incomplete Python bindings can be installed with +Experimental Python bindings can be installed with `pip install .`. The only documentation for the moment are the tests in `tests/*.py`. ## Usage +To align paired-reads against a reference FASTA and produce a sorted BAM file, +using eight threads: ``` -strobealign ref.fa reads.fq > output.sam # Single-end reads -strobealign ref.fa reads1.fq reads2.fq > output.sam # Paired-end reads - -strobealign -x ref.fa reads.fq > output.paf # Single-end reads mapping only (PAF) -strobealign -x ref.fa reads1.fq reads2.fq > output.paf # Paired-end reads mapping only (PAF) +strobealign -t 8 ref.fa reads.1.fastq.gz reads.2.fastq.gz | samtools sort -o sorted.bam ``` -To use interleaved files, use the `--interleaved` flag: - +For single-end reads: ``` -strobealign ref.fa reads.fq --interleaved > output.sam # Single and/or paired-end reads +strobealign -t 8 ref.fa reads.fastq.gz | samtools sort -o sorted.bam ``` -To report secondary alignments, set parameter `-N [INT]` for a maximum of `[INT]` secondary alignments. - -The above commands are suitable for interactive use and test runs. -For normal use, avoid creating SAM files on disk as they get very large compared -to their compressed BAM counterparts. Instead, either pipe strobealign’s output -into `samtools view` to create unsorted BAM files: +For mixed reads (the input file can contain both single and paired-end reads): ``` -strobealign ref.fa reads.1.fastq.gz reads.2.fastq.gz | samtools view -o mapped.bam +strobealign -t 8 ref.fa --interleaved reads.fastq.gz | samtools sort -o sorted.bam ``` -Or use `samtools sort` to create a sorted BAM file: + +In alignment mode, strobealign produces SAM output. By piping the output +directly into `samtools`, the above commands avoid creating potentially large +intermediate SAM files and also reduce disk I/O. + +To produce unsorted BAM, use `samtools view` instead of `samtools sort`. + + +### Mapping-only mode + +The command-line option `-x` switches strobealign into mapping-only mode, +in which it will output [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md) +files instead of SAM. For example: ``` -strobealign ref.fa reads.1.fastq.gz reads.2.fastq.gz | samtools sort -o sorted.bam +strobealign -x -t 8 ref.fa reads.1.fastq.gz reads.2.fastq.gz | igzip > output.paf.gz ``` -This is usually faster than doing the two steps separately because fewer -intermediate files are created. +`igzip` is a faster version of gzip that is part of +[ISA-L](https://github.com/intel/isa-l/). +If it is not available, replace it with `pigz` or regular `gzip` in the +command. + + +## Abundance estimation mode (for metagenomic binning) + +The command-line option `--aemb` switches strobealign into abundance estimation +mode, intended for metagenomic binning. +In this mode, strobealign outputs a single table with abundance values in +tab-separated value format instead of SAM or PAF. -To output the estimated abundance of every contig, the format of output file is: contig_id \t abundance_value: +Paired-end example: ``` -strobealign ref.fa reads.fq --aemb > abundance.txt # Single-end reads -strobealign ref.fa reads1.fq reads2.fq --aemb > abundance.txt # Paired-end reads +strobealign -t 8 --aemb ref.fa reads.1.fastq.gz reads.2.fastq.gz > abundances.tsv ``` +The output table contains one row for each contig of the reference. +The first column is the reference/contig id and the second its abundance. + +The abundance is the number of bases mapped to a contig, divided by the length +of the contig. Reads mapping to *n* different locations are weighted 1/*n*. + +Further columns may be added to this table in future versions of strobealign. + ## Command-line options @@ -127,12 +148,11 @@ options. Some important ones are: * `-r`: Mean read length. If given, this overrides the read length estimated from the input file(s). This is usually only required in combination with `--create-index`, see [index files](#index-files). -* `-t N`, `--threads=N`: Use N threads. This mainly applies to the mapping step - as the indexing step is only partially parallelized. +* `-t N`, `--threads=N`: Use N threads (both for mapping and indexing). * `--eqx`: Emit `=` and `X` CIGAR operations instead of `M`. * `-x`: Only map reads, do not do no base-level alignment. This switches the output format from SAM to [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md). -* `--aemb`: Output the estimated abundance value of every contig, the format of output file is: contig_id \t abundance_value. +* `--aemb`: Output estimated abundance value of each contig, see section above. * `--rg-id=ID`: Add RG tag to each SAM record. * `--rg=TAG:VALUE`: Add read group metadata to the SAM header. This can be specified multiple times. Example: `--rg-id=1 --rg=SM:mysamle --rg=LB:mylibrary`. @@ -172,11 +192,11 @@ To create an index, use the `--create-index` option. Since strobealign needs to know the read length, either provide it with read file(s) as if you wanted to map them: - strobealign --create-index ref.fa reads.1.fastq.gz reads.2.fastq.gz + strobealign --create-index -t 8 ref.fa reads.1.fastq.gz reads.2.fastq.gz Or set the read length explicitly with `-r`: - strobealign --create-index ref.fa -r 150 + strobealign --create-index -t 8 ref.fa -r 150 This creates a file named `ref.fa.rX.sti` containing the strobemer index, where `X` is the canonical read length that the index is optimized for (see @@ -184,7 +204,7 @@ above). To use the index when mapping, provide option `--use-index` when doing the actual mapping: - strobealign --use-index ref.fa reads.1.fastq.gz reads.2.fastq.gz | samtools ... + strobealign --use-index -t 8 ref.fa reads.1.fastq.gz reads.2.fastq.gz | samtools ... - Note that the `.sti` files are usually tied to a specific strobealign version. That is, when upgrading strobealign, the `.sti` files need to be regenerated. diff --git a/setup.py b/setup.py index 7dbdbc17..12d8eaef 100644 --- a/setup.py +++ b/setup.py @@ -5,7 +5,7 @@ name="strobealign", description="Python bindings for strobealign", license="MIT", - version="0.12.0", + version="0.13.0", packages=["strobealign"], package_dir={"": "src/python"}, cmake_install_dir="src/python",