move spaligner readme, add citations and links

ablab · May 30, 2024 · 4671799 · 4671799
1 parent 67fa317
commit 4671799
Show file tree

Hide file tree

Showing 8 changed files with 100 additions and 43 deletions.
diff --git a/docs/binspreader.md b/docs/binspreader.md
@@ -16,6 +16,16 @@ source of information for refining. Optionally, BinSPreader can be provided with
 multiple Hi-C and/or paired-end libraries. The [BinSPreader protocol](https://star-protocols.cell.com/protocols/2802) contains more detailed
 instructions on installing and running BinSPreader.
 
+## Compilation
+
+To compile SPAligner, run
+
+```
+./spades_compile -SPADES_ENABLE_PROJECTS=binspreader
+```
+
+After the compilation is complete, `binspreader` executable will be located in the `bin/` folder.
+
 ## Command line options
 
 Required positional arguments: 
@@ -69,7 +79,7 @@ binspreader <graph (in GFA)> <binning (in .tsv)> <output directory> [OPTION...]
  Labels correction regularization parameter for labeled data (default: 0.6)
 
 
-### Output
+## Output
 BinSPreader stores all output files in the output directory `<output_dir> ` set by the user.
 
 - `<output_dir>/binning.tsv` contains refined binning in `.tsv` format
@@ -83,3 +93,11 @@ In addition
 - `<output_dir>/bin_label_1.fastq, <output_dir>/bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used)
 - `<output_dir>/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used)
 - `<output_dir>/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used)
+
+
+## References
+
+If you are using **BinSPreader** in your research, please cite:
+
+[Tolstoganov et al., 2022](https://www.cell.com/iscience/pdf/S2589-0042(22)01042-2.pdf) and
+[Ochkalova et al., 2023](https://www.sciencedirect.com/science/article/pii/S2666166723003842). 
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -89,20 +89,20 @@ bin/spades.py --rnaviral -1 left.fastq.gz -2 right.fastq.gz -o output_folder
 
 ## Standalone SPAdes tools
 
-- `spades-kmercount` - k-mer counting;
+- [`spades-kmercount`](standalone.md#k-mer-counter) - k-mer counting;
 
-- `spades-read-filter` - read filtering using k-mer coverage;
+- [`spades-read-filter`](standalone.md#k-mer-coverage-read-filter) - read filtering using k-mer coverage;
 
-- `spades-kmer-estimating` - estimating number of unique k-mers;
+- [`spades-kmer-estimating`](standalone.md#k-mer-cardinality-estimating) - estimating number of unique k-mers;
 
-- `spades-gbuilder` - assembly graph construction;
+- [`spades-gbuilder`](standalone.md#graph-construction) - assembly graph construction;
 
-- `spades-gsimplifier` - assembly graph simplification;
+- [`spades-gsimplifier`](standalone.md#graph-simplification) - assembly graph simplification;
 
-- `spalgner` - alignment of long reads to assembly graph;
+- [`spalgner`](spaligner.md) - alignment of long reads to assembly graph;
 
-- `spades-gmapper` - specific alignment of long reads to assembly graph used in hybrid assembly pipeline;
+- [`spades-gmapper`](standalone.md#long-read-to-graph-alignment) - specific alignment of long reads to assembly graph used in hybrid assembly pipeline;
 
-- `binspreader` - refinement of metagenome-assembled genomes;
+- [`binspreader`](binspreader.md) - refinement of metagenome-assembled genomes;
 
-- `pathracer` - alignment of profile HMMs to assembly graph.
+- [`pathracer`](pathracer.md) - alignment of profile HMMs to assembly graph.
diff --git a/docs/installation.md b/docs/installation.md
@@ -88,7 +88,7 @@ for example:
 
 which will install SPAdes into `/usr/local/bin`.
 
-After installation you will get the same files (listed above) in `./bin` directory (or `<destination_dir>/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` variable.
+After installation, you will get the same files (listed above) in `./bin` directory (or `<destination_dir>/bin` if you specified PREFIX). We also suggest adding `bin` directory to the `PATH` variable.
 
 ## Building additional tools
 SPAdes toolkit includes a number of standalone tools that are built using core
@@ -106,7 +106,7 @@ subset of SPAdes components. The components are:
  - [`spades_tools`](standalone.md)
  - [`binspareader`](binspreader.md)
  - [`pathracer`](pathracer.md)
- - [`spaligner`](standalone.md#spaligner)
+ - [`spaligner`](spaligner.md)
 
 By default, only SPAdes and SPAdes tools are enabled (so
 `-DSPADES_ENABLE_PROJECTS="spades;spades_tools"` is the default). Alternatively,

diff --git a/docs/pathracer.md b/docs/pathracer.md
@@ -24,6 +24,16 @@ Both tool use extended pHMM model allowing frame shifts:
 but for `pathracer-seq-fs` this extension is crucial: for aligning amino-acid pHMMs without allowing indels in the nucleotide space
 six frame translation + `hmmsearch` from **HMMer** package is more than enough.
 
+## Compilation
+
+To compile SPAligner, run
+
+```
+./spades_compile -SPADES_ENABLE_PROJECTS=pathracer
+```
+
+After the compilation is complete, `pathracer` executable will be located in the `bin/` folder.
+
 ## Input
 Currently, the tool supports only _de Bruijn_ graphs in GFA format as produced by **SPAdes** or compatible assembler in this matter (e.g., **MEGAHIT**).
 Contact us if you need some other format support. Input sequences are supposed to be in FASTA/FASTQ format.
@@ -192,12 +202,9 @@ pathracer bac.hmm synth_strain_gbuilder.gfa --queries 16S_rRNA -m 250 --top 1000
 ```
 
 ## References
-If you are using **PathRacer** in your research, please cite: 
-A. Shlemov and A. Korobeynikov. PathRacer: racing profile HMM paths on assembly
-graph. In _Proceedings of International Conference on Algorithms for Computational Biology,
-AlCoB 2019. Berkeley, California, USA, May 28&ndash;30, 2019,_ volume 11488 LNCS, pages
-80&ndash;94, 2019. 
-<https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6>
+
+If you are using **PathRacer** in your research, please cite:
+
+[Shlemov and Korobeynikov, 2019](https://link.springer.com/chapter/10.1007/978-3-030-18174-1_6)
 
 In case of any problems running **PathRacer** please contact [SPAdes support](https://github.com/ablab/spades/issues) attaching the log file.
-Your suggestions are also very welcome!
diff --git a/src/projects/spaligner/pipeline.jpg → docs/spaligner.jpg b/src/projects/spaligner/pipeline.jpg → docs/spaligner.jpg
diff --git a/src/projects/spaligner/README.md → docs/spaligner.md b/src/projects/spaligner/README.md → docs/spaligner.md
@@ -1,31 +1,57 @@
-# SPAligner
+# SPAligner: long read to graph aligner
+
+SPAligner is a tool for fast and accurate alignment of nucleotide sequences to assembly graphs.
+It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read
+to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")).
+
+
+## Compilation
+
+To compile SPAligner, run
+
+```
+./spades_compile -SPADES_ENABLE_PROJECTS=spaligner
+```
+
+After the compilation is complete, `spaligner` executable will be located in the `bin/` folder.
 
-Tool for fast and accurate alignment of nucleotide sequences (s.a. long reads, coding sequences, etc.) to assembly graphs. 
 
 ## Running SPAligner
 
- spaligner spaligner_config.yaml \ # config file 
+Synopsis: 
+
+ spaligner spaligner_config.yaml \ # config file
   -d pacbio \ # data type: pacbio, nanopore
-  -g assembly_graph.gfa \ # gfa-file with assembly graph 
-  -k 77 \ # graph K-mer size
-  -s pacbio_reads.fastq.gz \ # sequences to align in fasta/fastq formats
-  -t 8 # number of threads, 8 by default
+  -g assembly_graph.gfa \ # assembly graph 
+  -k 77 \ # graph k-mer size
+  -s pacbio_reads.fastq.gz \ # input sequences / reads
+  -t 8 # number of threads
 
-By default, spaligner_config.yaml will be installed into /usr/share/spaligner/ or can be found in assembler/projects/spaligner/.
+By default, `spaligner_config.yaml` can be found in `src/projects/spaligner/`.
 
-Alignments will be saved to spaligner_result/alignment.tsv by default.
+Alignments will be saved to `spaligner_result/alignment.tsv` by default.
 
 
-## Compilation
+### Command line options
+
+`-d <type> `
+ long reads type: `nanopore` or `pacbio`
+
+`-s <filename> `
+ file with sequences in FASTA or FASTQ formats (can be gzipped)
 
- git clone https://github.com/ablab/spades.git
- cd spades/assembler/
- mkdir build && cd build && cmake ../src
- make spaligner
+`-g <filename> `
+ file with an assembly graph in GFA format
 
-Now to run SPAligner move to folder `assembler/` and execute
+`-k <int> `
+ k-mer length that was used for graph construction
+
+`-t <int> `
+ number of threads (default: 8)
+
+`-o, --outdir <dir> `
+ output directory to use (default: `spaligner_result/`)
 
- build/bin/spaligner
 
 ## Output
 
@@ -102,7 +128,7 @@ If a sequence was not fully aligned, SPAligner tries to prolong the longest alig
 
 Overview of the alignment of the nucleotide query sequence *S* (orange bar) to assembly graph *G*. Assembly graph edges are considered directed left-to-right (explicit edge orientation was omitted to improve the clarity).
 
-![pipeline](pipeline.jpg)
+![pipeline](spaligner.jpg)
 
 1. **Anchor search.** Anchors (regions of high similarity) between the query and the edge labels are identified with [BWA-MEM](http://bio-bwa.sourceforge.net/). 
 2. **Anchor filtering.** Anchors shorter than *K*, assembly graph *K*-mer size,(anchors 2, 6, 11), anchors “in the middle” of long edge (anchor 7) or ambiguous anchors (anchor 10 mostly covered by anchor 9, both anchors 4 and 5) are discarded.
@@ -146,6 +172,10 @@ Increase of `max_gs_states`, `max_restorable_length`, `queue_limit`, `iteration_
 Turning off restore_ends or run_dijkstra in nucleotide sequence alignment mode leads to shorter alignments, but considerable speed-up.
 
 
-## Contacts
+## References
+
+If you are using **SPAligner** in your research, please cite:
+
+[Dvorkina et al., 2020](https://link.springer.com/article/10.1186/s12859-020-03590-7)
 
 For any questions or suggestions please do not hesitate to contact Tatiana Dvorkina <tedvorkina@gmail.com>.
diff --git a/docs/standalone.md b/docs/standalone.md
@@ -169,12 +169,17 @@ Additional options are:
  original graph
 
 
-## Long read to graph alignment
 
+## hybridSPAdes aligner
+
+_Not to be confused with [SPAligner](spaligner.md)._
 
-### hybridSPAdes aligner
 A tool `spades-gmapper ` gives the opportunity to extract long read alignments generated with hybridSPAdes pipeline options. It has three mandatory options: dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file), graph file in GFA format and an output file name.
 
+While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and
+want to get exactly its intermediate results, [SPAligner](spaligner.md) is an end-product application for sequence-to-graph alignment with tunable parameters and output types.
+
+
 Synopsis: `spades-gmapper <dataset description (in YAML)> <graph (in GFA)> <output filename> [-k <value>] [-t <value>] [-tmpdir <dir>]`
 
 Additional options are:
@@ -188,13 +193,11 @@ Additional options are:
 `-tmpdir <dir_name> `
  scratch directory to use
 
-While `spades-gmapper` is a solution for those who work on hybridSPAdes assembly and want to get exactly its intermediate results, [SPAligner](standalone.md#spaligner) is an end-product application for sequence-to-graph alignment with tunable parameters and output types.
 
 
 ### SPAligner
 A tool for fast and accurate alignment of nucleotide sequences to assembly graphs. It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")).
 
-Synopsis: `spaligner src/projects/spaligner_config.yaml -d <value> -s <value> -g <value> -k <value> [-t <value>] [-o <value>]`
 
 Parameters are:
 
@@ -216,8 +219,6 @@ Parameters are:
 `-o, --outdir <dir> `
  output directory to use (default: spaligner_result/)
 
-For more information on parameters and options please refer to the main SPAligner manual (assembler/src/projects/spaligner/README.md).
-
 Also if you want to align protein sequences please refer to our [pre-release version](https://github.com/ablab/spades/releases/tag/spaligner-paper).
 
 Note that in order you use SPAligner one needs either to use pre-built binaries or compile SPAdes from sources using the additional `-DSPADES_ENABLE_PROJECTS=spaligner` option.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -14,6 +14,7 @@ nav:
  - Transcriptome assembly: rna.md
  - Binning refining: binspreader.md
  - HMM mapping on assembly graph: pathracer.md
+ - Sequence to graph alignment: spaligner.md
  - SPAdes tools: standalone.md
  - Citation: citation.md
  - Feedback: feedback.md