ablab · asl · Apr 2, 2024 · Feb 19, 2024 · Feb 29, 2024 · Feb 29, 2024
diff --git a/NEWREADME.md b/NEWREADME.md
@@ -0,0 +1,79 @@
+# About SPAdes
+
+SPAdes is an assembly toolkit containing various assembly pipelines.
+
+- [Complete SPAdes user manual]()
+
+- [SPAdes download page](https://github.com/ablab/spades/releases/)
+
+- [Latest SPAdes publication](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102)
+
+
+# Quick start
+
+- Complete user manual can be found [here](). Information below is provided merely for your convenience and cannot be considered as the user guide.
+
+- SPAdes is an assembler for second-generation sequencing data (Illumina or IonTorrent). PacBio and Nanopore reads are supported *only* as supplementary data. SPAdes can assemble genomes, metagenomes, transcriptomes, viral geonmes etc. 
+
+- Download SPAdes binaries for [Linux](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Linux.tar.gz) or [MacOS](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Darwin.tar.gz). You can also compile SPAdes from [source](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5.tar.gz) (requires g++ 9.0+, cmake 3.16+, zlib and libbz2). SPAdes requires only Python 3.8+ to be installed.
+
+- Test your SPAdes intallation by running
+
+```
+    bin/spades.py --test
+```
+
+- A single paired-end library (separate files, gzipped):
+
+```
+    bin/spades.py -1 left.fastq.gz -2 right.fastq.gz -o output_folder
+```
+
+- A single paired-end library (interlaced reads):
+
+```
+    bin/spades.py --12 interlaced.fastq -o output_folder
+```
+
+- Two paired-end libraries (separate files):
+
+```
+    bin/spades.py --pe1-1 1_left.fastq --pe1-2 1_right.fastq --pe2-1 2_left.fastq --pe2-2 2_right.fastq -o output_folder
+```
+
+- IonTorrent data:
+```
+    bin/spades.py --iontorrent -s it_reads.fastq -o output_folder
+```
+
+- A paired-end library coupled with long PacBio reads:
+
+```
+    bin/spades.py -1 left.fastq.gz -2 right.fastq.gz --pacbio pb.fastq -o output_folder
+```
+
+- Available assembly modes: `--isolate`, `--sc`, `--plasmid`, `--meta`, `--metaplasmid`, `--metaviral`, `--rna`, `--rnaviral`, `--corona`, `--bio`.
+
+
+# Citation
+If you use SPAdes in your research, please cite [our latest paper](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102).
+
+In case you perform hybrid assembly using  PacBio or Nanopore reads, you may also cite [Antipov et al., 2015](http://bioinformatics.oxfordjournals.org/content/early/2015/11/20/bioinformatics.btv688.short). If you use multiple paired-end and/or mate-pair libraries you may additionally cite papers describing SPAdes repeat resolution algorithms [Prjibelski et al., 2014](http://bioinformatics.oxfordjournals.org/content/30/12/i293.short) and [Vasilinetc et al., 2015](http://bioinformatics.oxfordjournals.org/content/31/20/3262.abstract). 
+
+If you use other pipelines, please cite the following papers:
+
+-   metaSPAdes: [Nurk et al., 2017](https://genome.cshlp.org/content/27/5/824.short).
+-   plasmidSPAdes: [Antipov et al., 2016](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610).
+-   metaplasmidSPAdes / plasmidVerify: [Antipov et al., 2019](https://genome.cshlp.org/content/29/6/961.short)
+-   metaviralSPAdes / viralVerify: [Antipov et al., 2020](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667)
+-   rnaSPAdes: [Bushmanova et al., 2019](https://academic.oup.com/gigascience/article/8/9/giz100/5559527).
+-   biosyntheticSPAdes: [Meleshko et al., 2019](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1).
+-   coronaSPAdes paper is currently available at [bioRxiv](https://www.biorxiv.org/content/10.1101/2020.07.28.224584v1.abstract).
+
+You may also include older papers [Nurk, Bankevich et al., 2013](http://link.springer.com/chapter/10.1007%2F978-3-642-37195-0_13) or [Bankevich, Nurk et al., 2012](http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0021), especially if you assemble single-cell data.
+
+
+# Feedback and bug reports
+
+Please, leave your comments and bug reports at [our GitHub repository tracker](https://github.com/ablab/spades/issues). If you have any troubles running SPAdes, please attach us `params.txt` and `spades.log` from the output folder.
+
diff --git a/docs/citation.md b/docs/citation.md
@@ -0,0 +1,17 @@
+# Citation
+If you use SPAdes in your research, please cite [our latest paper](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102).
+
+In case you perform hybrid assembly using  PacBio or Nanopore reads, you may also cite [Antipov et al., 2015](http://bioinformatics.oxfordjournals.org/content/early/2015/11/20/bioinformatics.btv688.short). If you use multiple paired-end and/or mate-pair libraries you may additionally cite papers describing SPAdes repeat resolution algorithms [Prjibelski et al., 2014](http://bioinformatics.oxfordjournals.org/content/30/12/i293.short) and [Vasilinetc et al., 2015](http://bioinformatics.oxfordjournals.org/content/31/20/3262.abstract). 
+
+If you use other pipelines, please cite the following papers:
+
+-   metaSPAdes: [Nurk et al., 2017](https://genome.cshlp.org/content/27/5/824.short).
+-   plasmidSPAdes: [Antipov et al., 2016](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610).
+-   metaplasmidSPAdes / plasmidVerify: [Antipov et al., 2019](https://genome.cshlp.org/content/29/6/961.short)
+-   metaviralSPAdes / viralVerify: [Antipov et al., 2020](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667)
+-   rnaSPAdes: [Bushmanova et al., 2019](https://academic.oup.com/gigascience/article/8/9/giz100/5559527).
+-   biosyntheticSPAdes: [Meleshko et al., 2019](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1).
+-   coronaSPAdes paper is currently available at [bioRxiv](https://www.biorxiv.org/content/10.1101/2020.07.28.224584v1.abstract).
+
+You may also include older papers [Nurk, Bankevich et al., 2013](http://link.springer.com/chapter/10.1007%2F978-3-642-37195-0_13) or [Bankevich, Nurk et al., 2012](http://online.liebertpub.com/doi/abs/10.1089/cmb.2012.0021), especially if you assemble single-cell data.
+
diff --git a/docs/datatypes.md b/docs/datatypes.md
@@ -0,0 +1,76 @@
+# Tips on SPAdes parameters
+
+## Assembling IonTorrent reads
+
+Only FASTQ or BAM files are supported as input.
+
+The selection of k-mer length is non-trivial for IonTorrent. If the dataset is more or less conventional (good coverage, moderate or low GC, etc), then use our [recommendation for long reads](datatypes.md#assembling-long-illumina-paired-reads) (e.g. assemble using k-mer lengths 21,33,55,77,99,127). However, due to increased error rate some changes of k-mer lengths (e.g. selection of shorter ones) may be required. For example, if you ran SPAdes with k-mer lengths 21,33,55,77 and then decided to assemble the same data set using more iterations and larger values of K, you can run SPAdes once again specifying the same output folder and the following options: `--restart-from k77 -k 21,33,55,77,99,127 --mismatch-correction -o <previous_output_dir>`. Do not forget to copy contigs and scaffolds from the previous run. We are planning to tackle issue of selecting k-mer lengths for IonTorrent reads in next versions.
+
+You may need no error correction for Hi-Q enzyme at all. However, we suggest trying to assemble your data with and without error correction and select the best variant.
+
+For non-trivial datasets (e.g. with high GC, low or uneven coverage) we suggest to enable single-cell mode (setting `--sc` option) and use k-mer lengths of 21,33,55.
+
+## Assembling long Illumina paired reads
+
+Recent advances in DNA sequencing technology have led to a rapid increase in read length. Nowadays, it is a common situation to have a data set consisting of 2x150 or 2x250 paired-end reads produced by Illumina MiSeq or HiSeq2500. However, the use of longer reads alone will not automatically improve assembly quality. An assembler that can properly take advantage of them is needed.
+
+SPAdes use of iterative k-mer lengths allows benefiting from the full potential of the long paired-end reads. Currently one has to set the assembler options up manually, but we plan to incorporate automatic calculation of necessary options soon.
+
+Please note that in addition to the read length, the insert length also matters a lot. It is not recommended to sequence a 300bp fragment with a pair of 250bp reads. We suggest using 350-500 bp fragments with 2x150 reads and 550-700 bp fragments with 2x250 reads.
+
+### Multi-cell data set with read length 2x150 bp
+
+Do not turn off SPAdes error correction (BayesHammer module), which is included in SPAdes default pipeline.
+
+If you have enough coverage (50x+), then you may want to try to set k-mer lengths of 21, 33, 55, 77 (selected by default for reads with length 150bp).
+
+Make sure you run assembler with the `--careful` option to minimize number of mismatches in the final contigs.
+
+We recommend that you check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.
+
+For reads corrected prior to running the assembler:
+
+``` bash
+
+    spades.py -k 21,33,55,77 --careful --only-assembler <your reads> -o spades_output
+```
+
+To correct and assemble the reads:
+
+``` bash
+
+    spades.py -k 21,33,55,77 --careful <your reads> -o spades_output
+```
+
+### Multi-cell data set with read lengths 2x250 bp
+
+Do not turn off SPAdes error correction (BayesHammer module), which is included in SPAdes default pipeline.
+
+By default we suggest to increase k-mer lengths in increments of 22 until the k-mer length reaches 127. The exact length of the k-mer depends on the coverage: k-mer length of 127 corresponds to 50x k-mer coverage and higher. For read length 250bp SPAdes automatically chooses K values equal to 21, 33, 55, 77, 99, 127.
+
+Make sure you run assembler with `--careful` option to minimize number of mismatches in the final contigs.
+
+We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.
+
+For reads corrected prior to running the assembler:
+
+``` bash
+
+    spades.py -k 21,33,55,77,99,127 --careful --only-assembler <your reads> -o spades_output
+```
+
+To correct and assemble the reads:
+
+``` bash
+
+    spades.py -k 21,33,55,77,99,127 --careful <your reads> -o spades_output
+```
+
+### Single-cell data set with read lengths 2 x 150 or 2 x 250
+
+The default k-mer lengths are recommended. For single-cell data sets SPAdes selects k-mer sizes 21, 33 and 55.
+
+However, it might be tricky to fully utilize the advantages of long reads you have. Consider contacting us for more information and to discuss assembly strategy.
+
+
+
diff --git a/docs/feedback.md b/docs/feedback.md
@@ -0,0 +1,5 @@
+# Feedback and bug reports
+
+Your comments, bug reports, and suggestions are very welcomed. They will help us to further improve SPAdes. If you have any troubles running SPAdes, please send us `params.txt` and `spades.log` from the output folder.
+
+You can leave your comments and bug reports at [our GitHub repository tracker](https://github.com/ablab/spades/issues).
diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -0,0 +1,46 @@
+# Quick start
+
+- SPAdes is an assembler for second-generation sequencing data (Illumina or IonTorrent). PacBio and Nanopore reads are supported *only* as supplementary data. SPAdes can assemble genomes, metagenomes, transcriptomes, viral geonmes etc. 
+
+- Download SPAdes binaries for [Linux](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Linux.tar.gz) or [MacOS](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Darwin.tar.gz). You can also compile SPAdes from [source](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5.tar.gz) (requires g++ 9.0+, cmake 3.16+, zlib and libbz2). SPAdes requires only Python 3.8+ to be installed.
+
+- Test your SPAdes intallation by running
+
+```
+    bin/spades.py --test
+```
+
+- A single paired-end library (separate files, gzipped):
+
+```
+    bin/spades.py -1 left.fastq.gz -2 right.fastq.gz -o output_folder
+```
+
+- A single paired-end library (interlaced reads):
+
+```
+    bin/spades.py --12 interlaced.fastq -o output_folder
+```
+
+- Two paired-end libraries (separate files):
+
+```
+    bin/spades.py --pe1-1 1_left.fastq --pe1-2 1_right.fastq --pe2-1 2_left.fastq --pe2-2 2_right.fastq -o output_folder
+```
+
+- IonTorrent data:
+```
+    bin/spades.py --iontorrent -s it_reads.fastq -o output_folder
+```
+
+- A paired-end library coupled with long PacBio reads:
+
+```
+    bin/spades.py -1 left.fastq.gz -2 right.fastq.gz --pacbio pb.fastq -o output_folder
+```
+
+- Available assembly modes: `--isolate`, `--sc`, `--plasmid`, `--meta`, `--metaplasmid`, `--metaviral`, `--rna`, `--rnaviral`, `--corona`, `--bio`.
+
+
+
+
diff --git a/docs/hmm.md b/docs/hmm.md
@@ -0,0 +1,21 @@
+# HMM-guided mode
+The majority of SPAdes assembly modes (normal multicell, single-cell, rnaviral, meta and of course biosynthetic) also supports HMM-guided mode as implemented in biosyntheticSPAdes. The detailed description could be found in [biosyntheticSPAdes paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118), but in short: amino acid profile HMMs are aligned to the edges of assembly graph. After this the subgraphs containing the set of matches ("domains") are extracted and all possible paths through the domains that are supported both by paired-end data (via scaffolds) and graph topology are obtained (putative biosynthetic gene clusters).
+
+HMM-guided mode could be enabled via providing a set of HMMs via `--custom-hmms` option. In HMM guided mode the set of contigs and scaffolds (see [SPAdes output](output.md#spades-output) section for more information ) is kept intact, however additional [biosyntheticSPAdes output](output.md#biosyntheticspades-output) represents the output of HMM-guided assembly.
+
+Note that normal biosyntheticSPAdes mode (via `--bio` option) is a bit different from HMM-guided mode: besides using the special set of profile HMMS representing a family of NRSP/PKS domains also includes a set of assembly graph simplification and processing settings aimed for fuller recovery of biosynthetic gene clusters.
+
+## coronaSPAdes mode
+
+Given an increased interest in coronavirus research we developed a coronavirus assembly mode for SPAdes assembler (also known as coronaSPAdes). It allows to assemble full-length coronaviridae genomes from the transcriptomic and metatranscriptomic data. Algorithmically, coronaSPAdes is an rnaviralSPAdes that uses the set of HMMs from [Pfam SARS-CoV-2 2.0](ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam_SARS-CoV-2_2.0/) set as well as additional HMMs as outlined by [(Phan et al, 2019)](https://doi.org/10.1093/ve/vey035). coronaSPAdes could be run via a dedicated `coronaspades.py` script. See [coronaSPAdes preprint](https://www.biorxiv.org/content/10.1101/2020.07.28.224584v1) for more information about rnaviralSPAdes,  coronaSPAdes and HMM-guided mode. Output for any HMM-related mode (`--bio`, `--corona`, or `--custom-hmms` flags) is the same with biosyntheticSPAdes' output.
+
+
+## wastewaterSPAdes mode
+
+SARS-CoV-2 wastewater samples are extensively collected and studied because it allows quantitative assessment of viral load in surrounding populations. We developed wastewaterSPAdes that solves SARS-CoV-2 deconvolution problem using assembly graph structure.
+To use wastewaterSPAdes, you'll need to:
+
+- Set `--sewage` flag to the `coronaspades.py`.
+- Provide the SARS-CoV-2 reference genome as trusted contigs.
+
+Results of wastewaterSPAdes are stored in `lineages.csv` file. First column contains strain name, and second column contains estimated abundance of this strain in the sample.
diff --git a/docs/hybrid.md b/docs/hybrid.md
@@ -0,0 +1,18 @@
+# Hybrid assembly
+
+## PacBio and Oxford Nanopore reads
+
+SPAdes can take as an input an unlimited number of PacBio and Oxford Nanopore libraries.
+
+PacBio CLR and Oxford Nanopore reads are used for hybrid assemblies (e.g. with Illumina or IonTorrent). There is no need to pre-correct this kind of data. SPAdes will use PacBio CLR and Oxford Nanopore reads for gap closure and repeat resolution.
+
+For PacBio you just need to have filtered subreads in FASTQ/FASTA format. Provide these filtered subreads using `--pacbio` option. Oxford Nanopore reads are provided with `--nanopore` option.
+
+PacBio CCS/Reads of Insert reads or pre-corrected (using third-party software) PacBio CLR / Oxford Nanopore reads can be simply provided as single reads to SPAdes.
+
+## Additional contigs
+
+In case you have contigs of the same genome generated by other assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using `--trusted-contigs` or `--untrusted-contigs`. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited.
+
+Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified.
+
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,5 @@
+# About SPAdes
+
+SPAdes - St. Petersburg genome assembler - is an assembly toolkit containing various assembly pipelines. This manual will help you to install and run SPAdes. SPAdes version 3.15.5 was released under GPLv2 on July 14th, 2022 and can be downloaded from <https://github.com/ablab/spades>. 
+
+The latest SPAdes paper describing various pipelines in a protocol format is available [here](https://currentprotocols.onlinelibrary.wiley.com/doi/abs/10.1002/cpbi.102).