From 4671799b707506af47f8d3a4b180f613a73aea33 Mon Sep 17 00:00:00 2001 From: Andrey Prjibelski Date: Fri, 31 May 2024 01:02:51 +0200 Subject: [PATCH] move spaligner readme, add citations and links --- docs/binspreader.md | 20 +++++- docs/getting-started.md | 18 ++--- docs/installation.md | 4 +- docs/pathracer.md | 21 ++++-- .../pipeline.jpg => docs/spaligner.jpg | Bin .../spaligner/README.md => docs/spaligner.md | 66 +++++++++++++----- docs/standalone.md | 13 ++-- mkdocs.yml | 1 + 8 files changed, 100 insertions(+), 43 deletions(-) rename src/projects/spaligner/pipeline.jpg => docs/spaligner.jpg (100%) rename src/projects/spaligner/README.md => docs/spaligner.md (82%) diff --git a/docs/binspreader.md b/docs/binspreader.md index 6b23a4be77..2d0adbb53b 100644 --- a/docs/binspreader.md +++ b/docs/binspreader.md @@ -16,6 +16,16 @@ source of information for refining. Optionally, BinSPreader can be provided with multiple Hi-C and/or paired-end libraries. The [BinSPreader protocol](https://star-protocols.cell.com/protocols/2802) contains more detailed instructions on installing and running BinSPreader. +## Compilation + +To compile SPAligner, run + +``` +./spades_compile -SPADES_ENABLE_PROJECTS=binspreader +``` + +After the compilation is complete, `binspreader` executable will be located in the `bin/` folder. + ## Command line options Required positional arguments: @@ -69,7 +79,7 @@ binspreader [OPTION...] Labels correction regularization parameter for labeled data (default: 0.6) -### Output +## Output BinSPreader stores all output files in the output directory ` ` set by the user. - `/binning.tsv` contains refined binning in `.tsv` format @@ -83,3 +93,11 @@ In addition - `/bin_label_1.fastq, /bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used) - `/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used) - `/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used) + + +## References + +If you are using **BinSPreader** in your research, please cite: + +[Tolstoganov et al., 2022](https://www.cell.com/iscience/pdf/S2589-0042(22)01042-2.pdf) and +[Ochkalova et al., 2023](https://www.sciencedirect.com/science/article/pii/S2666166723003842). \ No newline at end of file diff --git a/docs/getting-started.md b/docs/getting-started.md index a3f650c2d7..cbad1dce9e 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -89,20 +89,20 @@ bin/spades.py --rnaviral -1 left.fastq.gz -2 right.fastq.gz -o output_folder ## Standalone SPAdes tools -- `spades-kmercount` - k-mer counting; +- [`spades-kmercount`](standalone.md#k-mer-counter) - k-mer counting; -- `spades-read-filter` - read filtering using k-mer coverage; +- [`spades-read-filter`](standalone.md#k-mer-coverage-read-filter) - read filtering using k-mer coverage; -- `spades-kmer-estimating` - estimating number of unique k-mers; +- [`spades-kmer-estimating`](standalone.md#k-mer-cardinality-estimating) - estimating number of unique k-mers; -- `spades-gbuilder` - assembly graph construction; +- [`spades-gbuilder`](standalone.md#graph-construction) - assembly graph construction; -- `spades-gsimplifier` - assembly graph simplification; +- [`spades-gsimplifier`](standalone.md#graph-simplification) - assembly graph simplification; -- `spalgner` - alignment of long reads to assembly graph; +- [`spalgner`](spaligner.md) - alignment of long reads to assembly graph; -- `spades-gmapper` - specific alignment of long reads to assembly graph used in hybrid assembly pipeline; +- [`spades-gmapper`](standalone.md#long-read-to-graph-alignment) - specific alignment of long reads to assembly graph used in hybrid assembly pipeline; -- `binspreader` - refinement of metagenome-assembled genomes; +- [`binspreader`](binspreader.md) - refinement of metagenome-assembled genomes; -- `pathracer` - alignment of profile HMMs to assembly graph. +- [`pathracer`](pathracer.md) - alignment of profile HMMs to assembly graph. diff --git a/docs/installation.md b/docs/installation.md index 91a5a5f3ef..f4961cb4bc 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -88,7 +88,7 @@ for example: which will install SPAdes into `/usr/local/bin`. -After installation you will get the same files (listed above) in `./bin` directory (or `/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` variable. +After installation, you will get the same files (listed above) in `./bin` directory (or `/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` variable. ## Building additional tools SPAdes toolkit includes a number of standalone tools that are built using core @@ -106,7 +106,7 @@ subset of SPAdes components. The components are: - [`spades_tools`](standalone.md) - [`binspareader`](binspreader.md) - [`pathracer`](pathracer.md) - - [`spaligner`](standalone.md#spaligner) + - [`spaligner`](spaligner.md) By default, only SPAdes and SPAdes tools are enabled (so `-DSPADES_ENABLE_PROJECTS="spades;spades_tools"` is the default). Alternatively, diff --git a/docs/pathracer.md b/docs/pathracer.md index 6738ce8a62..77330bff4d 100644 --- a/docs/pathracer.md +++ b/docs/pathracer.md @@ -24,6 +24,16 @@ Both tool use extended pHMM model allowing frame shifts: but for `pathracer-seq-fs` this extension is crucial: for aligning amino-acid pHMMs without allowing indels in the nucleotide space six frame translation + `hmmsearch` from **HMMer** package is more than enough. +## Compilation + +To compile SPAligner, run + +``` +./spades_compile -SPADES_ENABLE_PROJECTS=pathracer +``` + +After the compilation is complete, `pathracer` executable will be located in the `bin/` folder. + ## Input Currently, the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g., **MEGAHIT**). Contact us if you need some other format support. Input sequences are supposed to be in FASTA/FASTQ format. @@ -192,12 +202,9 @@ pathracer bac.hmm synth_strain_gbuilder.gfa --queries 16S_rRNA -m 250 --top 1000 ``` ## References -If you are using **PathRacer** in your research, please cite: -A. Shlemov and A. Korobeynikov. PathRacer: racing profile HMM paths on assembly -graph. In _Proceedings of International Conference on Algorithms for Computational Biology, -AlCoB 2019. Berkeley, California, USA, May 28–30, 2019,_ volume 11488 LNCS, pages -80–94, 2019. - + +If you are using **PathRacer** in your research, please cite: + +[Shlemov and Korobeynikov, 2019](https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6) In case of any problems running **PathRacer** please contact [SPAdes support](https://github.com/ablab/spades/issues) attaching the log file. -Your suggestions are also very welcome! diff --git a/src/projects/spaligner/pipeline.jpg b/docs/spaligner.jpg similarity index 100% rename from src/projects/spaligner/pipeline.jpg rename to docs/spaligner.jpg diff --git a/src/projects/spaligner/README.md b/docs/spaligner.md similarity index 82% rename from src/projects/spaligner/README.md rename to docs/spaligner.md index 390b5a7fab..950b44ec81 100644 --- a/src/projects/spaligner/README.md +++ b/docs/spaligner.md @@ -1,31 +1,57 @@ -# SPAligner +# SPAligner: long read to graph aligner + +SPAligner is a tool for fast and accurate alignment of nucleotide sequences to assembly graphs. +It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read +to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")). + + +## Compilation + +To compile SPAligner, run + +``` +./spades_compile -SPADES_ENABLE_PROJECTS=spaligner +``` + +After the compilation is complete, `spaligner` executable will be located in the `bin/` folder. -Tool for fast and accurate alignment of nucleotide sequences (s.a. long reads, coding sequences, etc.) to assembly graphs. ## Running SPAligner - spaligner spaligner_config.yaml \ # config file +Synopsis: + + spaligner spaligner_config.yaml \ # config file -d pacbio \ # data type: pacbio, nanopore - -g assembly_graph.gfa \ # gfa-file with assembly graph - -k 77 \ # graph K-mer size - -s pacbio_reads.fastq.gz \ # sequences to align in fasta/fastq formats - -t 8 # number of threads, 8 by default + -g assembly_graph.gfa \ # assembly graph + -k 77 \ # graph k-mer size + -s pacbio_reads.fastq.gz \ # input sequences / reads + -t 8 # number of threads -By default, spaligner_config.yaml will be installed into /usr/share/spaligner/ or can be found in assembler/projects/spaligner/. +By default, `spaligner_config.yaml` can be found in `src/projects/spaligner/`. -Alignments will be saved to spaligner_result/alignment.tsv by default. +Alignments will be saved to `spaligner_result/alignment.tsv` by default. -## Compilation +### Command line options + +`-d ` + long reads type: `nanopore` or `pacbio` + +`-s ` + file with sequences in FASTA or FASTQ formats (can be gzipped) - git clone https://github.com/ablab/spades.git - cd spades/assembler/ - mkdir build && cd build && cmake ../src - make spaligner +`-g ` + file with an assembly graph in GFA format -Now to run SPAligner move to folder `assembler/` and execute +`-k ` + k-mer length that was used for graph construction + +`-t ` + number of threads (default: 8) + +`-o, --outdir ` + output directory to use (default: `spaligner_result/`) - build/bin/spaligner ## Output @@ -102,7 +128,7 @@ If a sequence was not fully aligned, SPAligner tries to prolong the longest alig Overview of the alignment of the nucleotide query sequence *S* (orange bar) to assembly graph *G*. Assembly graph edges are considered directed left-to-right (explicit edge orientation was omitted to improve the clarity). -![pipeline](pipeline.jpg) +![pipeline](spaligner.jpg) 1. **Anchor search.** Anchors (regions of high similarity) between the query and the edge labels are identified with [BWA-MEM](http://bio-bwa.sourceforge.net/). 2. **Anchor filtering.** Anchors shorter than *K*, assembly graph *K*-mer size,(anchors 2, 6, 11), anchors “in the middle” of long edge (anchor 7) or ambiguous anchors (anchor 10 mostly covered by anchor 9, both anchors 4 and 5) are discarded. @@ -146,6 +172,10 @@ Increase of `max_gs_states`, `max_restorable_length`, `queue_limit`, `iteration_ Turning off restore_ends or run_dijkstra in nucleotide sequence alignment mode leads to shorter alignments, but considerable speed-up. -## Contacts +## References + +If you are using **SPAligner** in your research, please cite: + +[Dvorkina et al., 2020](https://link.springer.com/article/10.1186/s12859-020-03590-7) For any questions or suggestions please do not hesitate to contact Tatiana Dvorkina . diff --git a/docs/standalone.md b/docs/standalone.md index 55cd3e5495..3e83181503 100644 --- a/docs/standalone.md +++ b/docs/standalone.md @@ -169,12 +169,17 @@ Additional options are: original graph -## Long read to graph alignment +## hybridSPAdes aligner + +_Not to be confused with [SPAligner](spaligner.md)._ -### hybridSPAdes aligner A tool `spades-gmapper ` gives the opportunity to extract long read alignments generated with hybridSPAdes pipeline options. It has three mandatory options: dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file), graph file in GFA format and an output file name. +While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and +want to get exactly its intermediate results, [SPAligner](spaligner.md) is an end-product application for sequence-to-graph alignment with tunable parameters and output types. + + Synopsis: `spades-gmapper [-k ] [-t ] [-tmpdir ]` Additional options are: @@ -188,13 +193,11 @@ Additional options are: `-tmpdir ` scratch directory to use -While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and want to get exactly its intermediate results, [SPAligner](standalone.md#spaligner) is an end-product application for sequence-to-graph alignment with tunable parameters and output types. ### SPAligner A tool for fast and accurate alignment of nucleotide sequences to assembly graphs. It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")). -Synopsis: `spaligner src/projects/spaligner_config.yaml -d -s -g -k [-t ] [-o ]` Parameters are: @@ -216,8 +219,6 @@ Parameters are: `-o, --outdir ` output directory to use (default: spaligner_result/) -For more information on parameters and options please refer to the main SPAligner manual (assembler/src/projects/spaligner/README.md). - Also if you want to align protein sequences please refer to our [pre-release version](https://github.com/ablab/spades/releases/tag/spaligner-paper). Note that in order you use SPAligner one needs either to use pre-built binaries or compile SPAdes from sources using the additional `-DSPADES_ENABLE_PROJECTS=spaligner` option. diff --git a/mkdocs.yml b/mkdocs.yml index fd2ef72c32..39c7fef5b0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -14,6 +14,7 @@ nav: - Transcriptome assembly: rna.md - Binning refining: binspreader.md - HMM mapping on assembly graph: pathracer.md + - Sequence to graph alignment: spaligner.md - SPAdes tools: standalone.md - Citation: citation.md - Feedback: feedback.md