From ea720c20f9b7245e390be8bafcdbfb137b19e335 Mon Sep 17 00:00:00 2001 From: Andrey Prjibelski Date: Wed, 3 Apr 2024 12:06:15 +0300 Subject: [PATCH] typos and grammar in the manual --- docs/datatypes.md | 12 ++++++------ docs/getting-started.md | 4 ++-- docs/hmm.md | 5 +++-- docs/hybrid.md | 4 ++-- docs/input.md | 7 +++---- docs/installation.md | 2 +- docs/output.md | 20 +++++++++++--------- docs/rna.md | 13 ++++++------- docs/running.md | 26 +++++++++++--------------- docs/standalone.md | 24 ++++++++++-------------- mkdocs.yml | 2 +- 11 files changed, 56 insertions(+), 63 deletions(-) diff --git a/docs/datatypes.md b/docs/datatypes.md index c0b997bcc..4d4da86ab 100644 --- a/docs/datatypes.md +++ b/docs/datatypes.md @@ -4,11 +4,11 @@ Only FASTQ or BAM files are supported as input. -The selection of k-mer length is non-trivial for IonTorrent. If the dataset is more or less conventional (good coverage, moderate or low GC, etc), then use our [recommendation for long reads](datatypes.md#assembling-long-illumina-paired-reads) (e.g. assemble using k-mer lengths 21,33,55,77,99,127). However, due to increased error rate some changes of k-mer lengths (e.g. selection of shorter ones) may be required. For example, if you ran SPAdes with k-mer lengths 21,33,55,77 and then decided to assemble the same data set using more iterations and larger values of K, you can run SPAdes once again specifying the same output folder and the following options: `--restart-from k77 -k 21,33,55,77,99,127 --mismatch-correction -o `. Do not forget to copy contigs and scaffolds from the previous run. We are planning to tackle issue of selecting k-mer lengths for IonTorrent reads in next versions. +The selection of k-mer length is non-trivial for IonTorrent. If the dataset is more or less conventional (good coverage, moderate or low GC, etc), then use our [recommendation for long reads](datatypes.md#assembling-long-illumina-paired-reads) (e.g. assemble using k-mer lengths 21,33,55,77,99,127). However, due to increased error rate some changes of k-mer lengths (e.g. selection of shorter ones) may be required. For example, if you ran SPAdes with k-mer lengths 21,33,55,77 and then decided to assemble the same data set using more iterations and larger values of K, you can run SPAdes once again specifying the same output folder and the following options: `--restart-from k77 -k 21,33,55,77,99,127 --mismatch-correction -o `. Do not forget to copy contigs and scaffolds from the previous run. You may need no error correction for Hi-Q enzyme at all. However, we suggest trying to assemble your data with and without error correction and select the best variant. -For non-trivial datasets (e.g. with high GC, low or uneven coverage) we suggest to enable single-cell mode (setting `--sc` option) and use k-mer lengths of 21,33,55. +For non-trivial datasets (e.g. with high GC, low or uneven coverage) we suggest enabling single-cell mode (setting `--sc` option) and use k-mer lengths of 21,33,55. ## Assembling long Illumina paired reads @@ -24,7 +24,7 @@ Do not turn off SPAdes error correction (BayesHammer module), which is included If you have enough coverage (50x+), then you may want to try to set k-mer lengths of 21, 33, 55, 77 (selected by default for reads with length 150bp). -Make sure you run assembler with the `--careful` option to minimize number of mismatches in the final contigs. +Make sure you run assembler with the `--careful` option to minimize the number of mismatches in the final contigs. We recommend that you check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs. @@ -46,11 +46,11 @@ To correct and assemble the reads: Do not turn off SPAdes error correction (BayesHammer module), which is included in SPAdes default pipeline. -By default we suggest to increase k-mer lengths in increments of 22 until the k-mer length reaches 127. The exact length of the k-mer depends on the coverage: k-mer length of 127 corresponds to 50x k-mer coverage and higher. For read length 250bp SPAdes automatically chooses K values equal to 21, 33, 55, 77, 99, 127. +By default we suggest increasing k-mer lengths in increments of 22 until the k-mer length reaches 127. The exact length of the k-mer depends on the coverage: k-mer length of 127 corresponds to 50x k-mer coverage and higher. For read length 250bp SPAdes automatically chooses K values equal to 21, 33, 55, 77, 99, 127. -Make sure you run assembler with `--careful` option to minimize number of mismatches in the final contigs. +Make sure you run assembler with `--careful` option to minimize the number of mismatches in the final contigs. -We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs. +We recommend you to check the SPAdes log file at the end of each iteration to control the average coverage of the contigs. For reads corrected prior to running the assembler: diff --git a/docs/getting-started.md b/docs/getting-started.md index 2d5db2927..12e5a7460 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -1,10 +1,10 @@ # Quick start -- SPAdes is an assembler for second-generation sequencing data (Illumina or IonTorrent). PacBio and Nanopore reads are supported *only* as supplementary data. SPAdes can assemble genomes, metagenomes, transcriptomes, viral geonmes etc. +- SPAdes is an assembler for second-generation sequencing data (Illumina or IonTorrent). PacBio and Nanopore reads are supported *only* as supplementary data. SPAdes can assemble genomes, metagenomes, transcriptomes, viral genomes etc. - Download SPAdes binaries for [Linux](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Linux.tar.gz) or [MacOS](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5-Darwin.tar.gz). You can also compile SPAdes from [source](https://github.com/ablab/spades/releases/download/v3.15.5/SPAdes-3.15.5.tar.gz) (requires g++ 9.0+, cmake 3.16+, zlib and libbz2). SPAdes requires only Python 3.8+ to be installed. -- Test your SPAdes intallation by running +- Test your SPAdes installation by running ``` bin/spades.py --test diff --git a/docs/hmm.md b/docs/hmm.md index ef048ad7e..0f68b07ef 100644 --- a/docs/hmm.md +++ b/docs/hmm.md @@ -12,10 +12,11 @@ Given an increased interest in coronavirus research we developed a coronavirus a ## wastewaterSPAdes mode -SARS-CoV-2 wastewater samples are extensively collected and studied because it allows quantitative assessment of viral load in surrounding populations. We developed wastewaterSPAdes that solves SARS-CoV-2 deconvolution problem using assembly graph structure. +SARS-CoV-2 wastewater samples are extensively collected and studied because it allows quantitative assessment of viral load in surrounding populations. We developed wastewaterSPAdes that solves the SARS-CoV-2 deconvolution problem using assembly graph structure. To use wastewaterSPAdes, you'll need to: - Set `--sewage` flag to the `coronaspades.py`. - Provide the SARS-CoV-2 reference genome as trusted contigs. -Results of wastewaterSPAdes are stored in `lineages.csv` file. First column contains strain name, and second column contains estimated abundance of this strain in the sample. +Results of wastewaterSPAdes are stored in `lineages.csv` file. First column contains the strain name, and second column contains estimated abundance of this strain in the sample. + diff --git a/docs/hybrid.md b/docs/hybrid.md index b0b9251eb..87757637c 100644 --- a/docs/hybrid.md +++ b/docs/hybrid.md @@ -6,13 +6,13 @@ SPAdes can take as an input an unlimited number of PacBio and Oxford Nanopore li PacBio CLR and Oxford Nanopore reads are used for hybrid assemblies (e.g. with Illumina or IonTorrent). There is no need to pre-correct this kind of data. SPAdes will use PacBio CLR and Oxford Nanopore reads for gap closure and repeat resolution. -For PacBio you just need to have filtered subreads in FASTQ/FASTA format. Provide these filtered subreads using `--pacbio` option. Oxford Nanopore reads are provided with `--nanopore` option. +For PacBio you just need to have filtered subreads in FASTQ/FASTA format. Provide these filtered subreads using the `--pacbio` option. Oxford Nanopore reads are provided with `--nanopore` option. PacBio CCS/Reads of Insert reads or pre-corrected (using third-party software) PacBio CLR / Oxford Nanopore reads can be simply provided as single reads to SPAdes. ## Additional contigs -In case you have contigs of the same genome generated by other assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using `--trusted-contigs` or `--untrusted-contigs`. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited. +In case you have contigs of the same genome generated by another assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using `--trusted-contigs` or `--untrusted-contigs`. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited. Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified. diff --git a/docs/input.md b/docs/input.md index 98ba4f9a7..f75313fba 100644 --- a/docs/input.md +++ b/docs/input.md @@ -10,7 +10,7 @@ To run SPAdes you need at least one library of the following types: Illumina and IonTorrent libraries should not be assembled together. All other types of input data are compatible. SPAdes should not be used if only PacBio CLR, Oxford Nanopore, Sanger reads or additional contigs are available. -SPAdes supports mate-pair only assembly. However, we recommend to use only high-quality mate-pair libraries in this case (e.g. that do not have a paired-end part). We tested mate-pair only pipeline using Illumina Nextera mate-pairs. See more [here](running.md#specifying-multiple-libraries). +SPAdes supports mate-pair only assembly. However, we recommend to use only high-quality mate-pair libraries in this case (e.g. that do not have a paired-end part). We tested the mate-pair-only pipeline using Illumina Nextera mate-pairs. See more [here](running.md#specifying-multiple-libraries). Notes: @@ -43,10 +43,9 @@ In an unlikely case some of the reads from your mate-pair (or high-quality mate- ## Unpaired (single-read) libraries -By using command line interface, you can specify up to nine different single-read libraries. To input more libraries, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file). +By using the command line interface, you can specify up to nine different single-read libraries. To input more libraries, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file). -Single librairies are assumed to have high quality and a reasonable coverage. For example, you can provide PacBio CCS reads as a single-read library. +Single libraries are assumed to have high quality and reasonable coverage. For example, you can provide PacBio CCS reads as a single-read library. Note, that you should not specify PacBio CLR, Sanger reads or additional contigs as single-read libraries, each of them has a separate [option](running.md#input-data). - diff --git a/docs/installation.md b/docs/installation.md index 60a6f66a8..5f628eec5 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -113,7 +113,7 @@ If you added `bin` folder from SPAdes installation directory to the `PATH` varia spades.py --test ``` -For the simplicity we further assume that `bin` folder from SPAdes installation directory is added to the `PATH` variable. +For simplicity we further assume that the `bin` folder from SPAdes installation directory is added to the `PATH` variable. If the installation is successful, you will find the following information at the end of the log: diff --git a/docs/output.md b/docs/output.md index cb6ed8fa7..c60bcc373 100644 --- a/docs/output.md +++ b/docs/output.md @@ -16,7 +16,7 @@ Contigs/scaffolds names in SPAdes output FASTA files have the following format: `>NODE_3_length_237403_cov_243.207` Here `3` is the number of the contig/scaffold, `237403` is the sequence length in nucleotides and `243.207` is the k-mer coverage for the last (largest) k value used. Note that the k-mer coverage is always lower than the read (per-base) coverage. -In general, SPAdes uses two techniques for joining contigs into scaffolds. First one relies on read pairs and tries to estimate the size of the gap separating contigs. The second one relies on the assembly graph: e.g. if two contigs are separated by a complex tandem repeat, that cannot be resolved exactly, contigs are joined into scaffold with a fixed gap size of 100 bp. Contigs produced by SPAdes do not contain N symbols. +In general, SPAdes uses two techniques for joining contigs into scaffolds. First one relies on read pairs and tries to estimate the size of the gap separating contigs. The second one relies on the assembly graph: e.g. if two contigs are separated by a complex tandem repeat that cannot be resolved exactly, contigs are joined into a scaffold with a fixed gap size of 100 bp. Contigs produced by SPAdes do not contain N symbols. ## Assembly graph formats @@ -83,10 +83,10 @@ For all plasmidSPAdes' contig names in `contigs.fasta`, `scaffolds.fasta` and `a ## metaplasmidSPAdes and metaviralSPAdes output -The repeat resolution and extrachromosomal element detection in metaplasmidSPAdes/metaviralSPAdes is run independently for different coverage cutoffs values (see [paper](https://genome.cshlp.org/content/29/6/961.short) for details). In order to distinguish contigs with putative plasmids detected at different cutoff levels we extend the contig name in FASTA file with cutoff value used for this particular contig (in format `_cutoff_N`). This is why, in the contrast to regular SPAdes pipeline, there might be a contig with `NODE_1_` prefix for each cutoff with potential plasmids detected. In following example, there were detected two potential viruses using cutoff 0, one virus was detected with cutoff 5 and one with cutoff 10. -Also, we add a suffix that shows the structure of the suspective extrachromosomal element. -For metaplasmid mode we output only circular putative plasmids. -For metaviral mode we also output linear putative viruses and linear viruses with simple repeats ('9'-shaped components in the assembly graph) sequences. +The repeat resolution and extrachromosomal element detection in metaplasmidSPAdes/metaviralSPAdes is run independently for different coverage cutoffs values (see [paper](https://genome.cshlp.org/content/29/6/961.short) for details). In order to distinguish contigs with putative plasmids detected at different cutoff levels we extend the contig name in FASTA file with cutoff value used for this particular contig (in format `_cutoff_N`). This is why, in the contrast to regular SPAdes pipeline, there might be a contig with `NODE_1_` prefix for each cutoff with potential plasmids detected. In the following example, there were detected two potential viruses using cutoff 0, one virus was detected with cutoff 5 and one with cutoff 10. We also add a suffix that shows the structure of the suspective extrachromosomal element. + +In the metaplasmid mode SPAdes outputs only circular putative plasmids. +In the metaviral mode SPAdes also outputs linear putative viruses and linear viruses with simple repeats ('9'-shaped components in the assembly graph) sequences. ``` >NODE_1_length_40003_cov_13.48_cutoff_0_type_circular @@ -98,15 +98,17 @@ For metaviral mode we also output linear putative viruses and linear viruses wit ## biosyntheticSPAdes output biosyntheticSPAdes outputs four files of interest: -- scaffolds.fasta – contains DNA sequences from putative biosynthetic gene clusters (BGC). Since each sample may contain multiple BGCs and biosyntheticSPAdes can output several putative DNA sequences for eash cluster, for each contig name we append suffix `_cluster_X_candidate_Y`, where X is the id of the BGC and Y is the id of the candidate from the BGC. -- raw_scaffolds.fasta – SPAdes scaffolds generated without domain-graph related algorithms. Very close to regular scaffolds.fasta file. -- hmm_statistics.txt – contains statistics about BGC composition in the sample. First, it outputs number of domain hits in the sample. Then, for each BGC candidate we output domain order with positions on the corresponding DNA sequence from scaffolds.fasta. -- domain_graph.dot – contains domain graph structure, that can be used to assess complexity of the sample and structure of BGCs. For more information about domain graph construction, please refer to the paper. +- scaffolds.fasta – contains DNA sequences from putative biosynthetic gene clusters (BGC). Since each sample may contain multiple BGCs and biosyntheticSPAdes can output several putative DNA sequences for each cluster, for each contig name we append suffix `_cluster_X_candidate_Y`, where X is the id of the BGC and Y is the id of the candidate from the BGC. +- raw_scaffolds.fasta - SPAdes scaffolds generated without domain-graph related algorithms. Very close to the regular scaffolds.fasta file. +- hmm_statistics.txt - contains statistics about BGC composition in the sample. First, it outputs the number of domain hits in the sample. Then, for each BGC candidate we output domain order with positions on the corresponding DNA sequence from scaffolds.fasta. +- domain_graph.dot - contains domain graph structure that can be used to assess complexity of the sample and structure of BGCs. For more information about domain graph construction, please refer to the paper. + ## rnaSPades output See [rnaSPAdes section](rna.md#rnaspades-output). + ## Genome assembly evaluation [QUAST](https://quast.sourceforge.net/) may be used to generate summary statistics (N50, maximum contig length, GC %, \# genes found in a reference list or with built-in gene finding tools, etc.) for a single assembly. It may also be used to compare statistics for multiple assemblies of the same data set (e.g., SPAdes run with different parameters, or several different assemblers). diff --git a/docs/rna.md b/docs/rna.md index a918b21ff..0218f7d71 100644 --- a/docs/rna.md +++ b/docs/rna.md @@ -13,7 +13,7 @@ rnaSPAdes take as an input at least one paired-end or single-end library. For hy ## Assembling multiple RNA-Seq libraries In case you have sequenced several RNA-Seq libraries using the same protocol from different tissues / conditions, and the goal as to assemble a total transcriptome, we suggest to provide all files as a single library (see [SPAdes input options](running.md#input-data)). Note, that sequencing using the same protocol implies that the resulting reads have the same length, insert size and strand-specificity. Transcript quantification for each sample can be done afterwards by separately mapping reads from each library to the assembled transcripts. -When assembling multiple strand-specific libraries, only the first one will be used to determine strand of each transcript. Thus, we suggest not to mix data with different strand-specificity. +When assembling multiple strand-specific libraries, only the first one will be used to determine the strand of each transcript. Thus, we suggest not to mix data with different strand-specificity. ## rnaSPAdes-specific options @@ -23,7 +23,7 @@ When assembling multiple strand-specific libraries, only the first one will be u rnaSPAdes supports strand-specific RNA-Seq datasets. You can set strand-specific type using the following option: `--ss ` - Use ` = rf` when first read in pair corresponds to reverse gene strand (antisense data, e.g. obtained via dUTP protocol) and ` = fr` otherwise (forward). + Use ` = rf` when first read in a pair corresponds to reverse gene strand (antisense data, e.g. obtained via dUTP protocol) and ` = fr` otherwise (forward). Note, that strand-specificity is not related and should not be confused with FR and RF orientation of paired reads. RNA-Seq paired-end reads typically have forward-reverse orientation (--> <--), which is assumed by default and no additional options are needed (see [SPAdes input options](running.md#input-data)). @@ -41,17 +41,16 @@ rnaSPAdes outputs one main FASTA file named transcripts.fasta. The corresponding In addition rnaSPAdes outputs transcripts with different level of filtration into /: - `hard_filtered_transcripts.fasta` - includes only long and reliable transcripts with rather high expression. -- `soft_filtered_transcripts.fasta` - includes short and low-expressed transcipts, likely to contain junk sequences. +- `soft_filtered_transcripts.fasta` - includes short and low-expressed transcripts, likely to contain junk sequences. -We reccomend to use main `transcripts.fasta` file in case you don't have any specific needs for you projects. +We recommend using the main `transcripts.fasta` file in case you don't have any specific needs for your projects. Contigs/scaffolds names in rnaSPAdes output FASTA files have the following format: `>NODE_97_length_6237_cov_11.9819_g8_i2` Similarly to SPAdes, 97 is the number of the contig, 6237 is its sequence length in nucleotides and 11.9819 is the k-mer coverage. Note that the k-mer coverage is always lower than the read (per-base) coverage. -g8_i2 correspond to the gene number 8 and isoform number 2 within this gene. Transcripts with the same gene number are presumably received from same or somewhat similar (e.g. paralogous) genes. Note, that the prediction is based on the presence of shared sequences in the transcripts and is very approximate. +g8_i2 corresponds to the gene number 8 and isoform number 2 within this gene. Transcripts with the same gene number are presumably received from same or somewhat similar (e.g. paralogous) genes. Note, that the prediction is based on the presence of shared sequences in the transcripts and is very approximate. ## Assembly evaluation -[rnaQUAST](https://github.com/ablab/rnaquast) may be used for transcriptome assembly quality assessment for model organisms when reference genome and gene database are available. rnaQUAST also includes [BUSCO](https://busco.ezlab.org/) and [GeneMarkS-T](http://topaz.gatech.edu/GeneMark/) tools for _de novo_ evaluation. - +[rnaQUAST](https://github.com/ablab/rnaquast) may be used for transcriptome assembly quality assessment for model organisms when a reference genome and a gene annotation are available. rnaQUAST also includes [BUSCO](https://busco.ezlab.org/) and [GeneMarkS-T](http://topaz.gatech.edu/GeneMark/) tools for _de novo_ evaluation. diff --git a/docs/running.md b/docs/running.md index 1fcab53d5..889a21a74 100644 --- a/docs/running.md +++ b/docs/running.md @@ -7,24 +7,23 @@ To run SPAdes from the command line, type spades.py [options] -o ``` -Note that we assume that `bin` forder from SPAdes installation directory is added to the `PATH` variable (provide full path to SPAdes executable otherwise: `/bin/spades.py`). +Note that we assume that the `bin` folder from SPAdes installation directory is added to the `PATH` variable (provide full path to SPAdes executable otherwise: `/bin/spades.py`). ## Basic options and modes `-o ` Specify the output directory. Required option. - `--isolate ` This flag is highly recommended for high-coverage isolate and multi-cell Illumina data; improves the assembly quality and running time. - We also recommend to trim your reads prior to the assembly. + We also recommend trimming your reads prior to the assembly. This option is not compatible with `--only-error-correction` or `--careful` options. `--sc ` This flag is required for MDA (single-cell) data. `--meta ` (same as `metaspades.py`) - This flag is recommended when assembling metagenomic data sets (runs metaSPAdes, see [paper](https://genome.cshlp.org/content/27/5/824.short) for more details). Currently metaSPAdes supports only a **_single_** short-read library which has to be **_paired-end_** (we hope to remove this restriction soon). In addition, you can provide long reads (e.g. using `--pacbio` or `--nanopore` options), but hybrid assembly for metagenomes remains an experimental pipeline and optimal performance is not guaranteed. It does not support [careful mode](running.md#pipeline-options) (mismatch correction is not available). In addition, you cannot specify coverage cutoff for metaSPAdes. Note that metaSPAdes might be very sensitive to presence of the technical sequences remaining in the data (most notably adapter readthroughs), please run quality control and pre-process your data accordingly. + This flag is recommended when assembling metagenomic data sets (runs metaSPAdes, see [paper](https://genome.cshlp.org/content/27/5/824.short) for more details). Currently metaSPAdes supports only a **_single_** short-read library which has to be **_paired-end_** (we hope to remove this restriction soon). In addition, you can provide long reads (e.g. using `--pacbio` or `--nanopore` options), but hybrid assembly for metagenomes remains an experimental pipeline and optimal performance is not guaranteed. It does not support [careful mode](running.md#pipeline-options) (mismatch correction is not available). In addition, you cannot specify coverage cutoff for metaSPAdes. Note that metaSPAdes might be very sensitive to the presence of the technical sequences remaining in the data (most notably adapter readthroughs), please run quality control and pre-process your data accordingly. `--plasmid ` (same as `plasmidspades.py`) This flag is required when assembling only plasmids from WGS data sets (runs plasmidSPAdes, see [paper](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610) for the algorithm details). Note, that plasmidSPAdes is not compatible with single-cell mode (`--sc`). Additionally, we do not recommend to run plasmidSPAdes on more than one library. @@ -35,18 +34,15 @@ See [plasmidSPAdes output section](output.md#plasmidspades-output) for details. `--metaviral ` (same as `metaviralspades.py`) -These options works specially for extracting extrachromosomal elements from metagenomic assemblies. They run similar pipelines that slightly differ in the simplification step; another difference is that for metaviral mode we output linear putative extrachromosomal contigs and for metaplasmid mode we do not. +These options work specially for extracting extrachromosomal elements from metagenomic assemblies. They run similar pipelines that slightly differ in the simplification step; another difference is that for metaviral mode we output linear putative extrachromosomal contigs and for metaplasmid mode we do not. See [metaplasmid paper](https://genome.cshlp.org/content/29/6/961.short) and [metaviral paper](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667) for the algorithms details. See [metaplasmidSPAdes/metaviralSPAdes section](output.md#metaplasmidspades-and-metaviralspades-output) for details see. - - -Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we recommend to additionally verify resulting contigs with [viralVerify tool](https://github.com/ablab/viralVerify). - +Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we recommend verifying resulting contigs with [viralVerify tool](https://github.com/ablab/viralVerify). `--bio ` - This flag is required when assembling only non-ribosomal and polyketide gene clusters from WGS data sets (runs biosyntheticSPAdes, see [paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1) for the algorithm details). biosyntheticSPAdes is supposed to work on isolate or metagenomic WGS dataset. Note, that biosyntheticSPAdes is not compatible with any other modes. See [biosyntheticSPAdes output section](output.md#biosyntheticspades-output) for details. + This flag is required when assembling only non-ribosomal and polyketide gene clusters from WGS data sets (runs biosyntheticSPAdes, see [paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1) for the algorithm details). biosyntheticSPAdes is supposed to work on isolated or metagenomic WGS dataset. Note, that biosyntheticSPAdes is not compatible with any other modes. See [biosyntheticSPAdes output section](output.md#biosyntheticspades-output) for details. `--rna ` (same as `rnaspades.py`) This flag should be used when assembling RNA-Seq data sets (runs rnaSPAdes). To learn more, see [rnaSPAdes manual](rna.md). @@ -78,7 +74,7 @@ Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we recomme Runs assembly module only. `--careful` - Tries to reduce the number of mismatches and short indels. Also runs MismatchCorrector - a post processing tool, which uses [BWA](http://bio-bwa.sourceforge.net) tool (comes with SPAdes). This option is recommended only for assembly of small genomes. We strongly recommend not to use it for large and medium-size eukaryotic genomes. Note, that this options is is not supported by metaSPAdes and rnaSPAdes. + Tries to reduce the number of mismatches and short indels. Also runs MismatchCorrector - a post processing tool, which uses [BWA](http://bio-bwa.sourceforge.net) tool (comes with SPAdes). This option is recommended only for assembly of small genomes. We strongly recommend not to use it for large and medium-size eukaryotic genomes. Note that this option is not supported by metaSPAdes and rnaSPAdes. `--continue` Continues SPAdes run from the specified output folder starting from the last available check-point. Check-points are made after: @@ -107,7 +103,7 @@ Since all files will be overwritten, do not forget to copy your assembly from th Note: - this option is NOT mandatory for using `--restart-from` and `--continue` options, but may speed them up; -- making checkpoints may take more time and significant amount of disc space. +- making checkpoints may take more time and a significant amount of disk space. `--disable-gzip-output` Forces read error correction module not to compress the corrected reads. If this options is not set, corrected reads will be in `*.fastq.gz` format. @@ -221,12 +217,12 @@ High-quality MP data can be used for mate-pair only assembly. #### Other input `--assembly-graph ` - File with assembly graph. Could only be used in plasmid, metaplasmid, metaviral and biosynthetic mode. The primary purpose of this option to run these pipelines on already constructed and simplified assembly graph this way skipping a large part of SPAdes pipeline. Original reads the graph was constructed from need to be specified as well. Exact k-mer length (via `-k` option) should be provided. Note that the output would be different as compared to standalone runs of these pipelines as they setup graph simplification options as well. + File with assembly graph. Could only be used in plasmid, metaplasmid, metaviral and biosynthetic mode. The primary purpose of this option is to run these pipelines on already constructed and simplified assembly graphs, thus skipping a large part of SPAdes pipeline. Original reads the graph was constructed from need to be specified as well. Exact k-mer length (via `-k` option) should be provided. Note that the output would be different as compared to standalone runs of these pipelines as they set up graph simplification options as well. ### Specifying multiple libraries with YAML data set file -An alternative way to specify an input data set for SPAdes is to create a [YAML](http://www.yaml.org/) data set file. By using a YAML file you can provide an unlimited number of paired-end, mate-pair and unpaired libraries. Basically, YAML data set file is a text file, in which input libraries are provided as a comma-separated list in square brackets. Each library is provided in braces as a comma-separated list of attributes. The following attributes are available: +An alternative way to specify an input data set for SPAdes is to create a [YAML](http://www.yaml.org/) data set file. By using a YAML file you can provide an unlimited number of paired-end, mate-pair and unpaired libraries. Basically, a YAML data set file is a text file, in which input libraries are provided as a comma-separated list in square brackets. Each library is provided in braces as a comma-separated list of attributes. The following attributes are available: - orientation ("fr", "rf", "ff") - type ("paired-end", "mate-pairs", "hq-mate-pairs", "single", "pacbio", "nanopore", "sanger", "trusted-contigs", "untrusted-contigs") @@ -323,7 +319,7 @@ Notes: Number of threads. The default value is 16. `-m ` (or `--memory `) - Set memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actual amount of consumed RAM will be below this limit. Make sure this value is correct for the given machine. SPAdes uses the limit value to automatically determine the sizes of various buffers, etc. + Set memory limit in Gb. SPAdes terminates if it reaches this limit. The default value is 250 Gb. Actual amount of RAM consumed will be below this limit. Make sure this value is correct for the given machine. SPAdes uses the limit value to automatically determine the sizes of various buffers, etc. `--tmp-dir ` Set directory for temporary files from read error correction. The default value is `/corrected/tmp` diff --git a/docs/standalone.md b/docs/standalone.md index 47dd4a0d2..d67321f75 100644 --- a/docs/standalone.md +++ b/docs/standalone.md @@ -5,17 +5,13 @@ To provide input data to SPAdes k-mer counting tool `spades-kmercounter ` you may just specify files in [SPAdes-supported formats](running.md#spades-input) without any flags (after all options) or provide dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file). -Output: `/final_kmers` - unordered set of kmers in binary format. Kmers from both forward a -nd reverse-complementary reads are taken into account. +Output: `/final_kmers` - unordered set of kmers in binary format. Kmers from both forward and reverse-complementary reads are taken into account. -Output format: All kmers are written sequentially without any separators. Each kmer takes the same nu -mber of bits. One kmer of length K takes 2*K bits. Kmers are aligned by 64 bits. For example, one kme -r with length=21 takes 8 bytes, with length=33 takes 16 bytes, and with length=55 takes 16 bytes. Eac -h nucleotide is coded with 2 bits: 00 - A, 01 - C, 10 - G, 11 - T. +Output format: All kmers are written sequentially without any separators. Each k-mer takes the same number of bits. One k-mer of length K takes 2*K bits. Kmers are aligned by 64 bits. For example, one kmer with length=21 takes 8 bytes, with length=33 takes 16 bytes, and with length=55 takes 16 bytes. Each nucleotide is coded with 2 bits: 00 - A, 01 - C, 10 - G, 11 - T. Example: - For kmer: AGCTCT + For k-mer: AGCTCT Memory: 6 bits * 2 = 12, 64 bits (8 bytes) Let’s describe bytes: data[0] = AGCT -> 11 01 10 00 -> 0xd8 @@ -81,7 +77,7 @@ The options are: ## k-mer cardinality estimating -`spades-kmer-estimating ` is a tool for estimating approximate number of unique k-mers in the provided reads. Kmers from reverse-complementary reads aren"t taken into account for k-mer cardinality estimating. +`spades-kmer-estimating` is a tool for estimating the approximate number of unique k-mers in the provided reads. Kmers from reverse-complementary reads aren"t taken into account for k-mer cardinality estimating. To provide input data to SPAdes k-mer cardinality estimating tool `spades-kmer-estimating ` you should provide dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file). @@ -172,11 +168,12 @@ Additional options are: while processing a subgraph -- file listing edges which are dead-ends in the original graph + ## Long read to graph alignment ### hybridSPAdes aligner -A tool `spades-gmapper ` gives opportunity to extract long read alignments generated with hybridSPAdes pipeline options. It has three mandatory options: dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file), graph file in GFA format and an output file name. +A tool `spades-gmapper ` gives the opportunity to extract long read alignments generated with hybridSPAdes pipeline options. It has three mandatory options: dataset description file in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file), graph file in GFA format and an output file name. Synopsis: `spades-gmapper [-k ] [-t ] [-tmpdir ]` @@ -194,7 +191,6 @@ Additional options are: While `spades-mapper` is a solution for those who work on hybridSPAdes assembly and want to get exactly its intermediate results, [SPAligner](standalone.md#spaligner) is an end-product application for sequence-to-graph alignment with tunable parameters and output types. - ### SPAligner A tool for fast and accurate alignment of nucleotide sequences to assembly graphs. It takes file with sequences (in fasta/fastq format) and assembly in GFA format and outputs long read to graph alignment in various formats (such as tsv, fasta and [GPA](https://github.com/ocxtal/gpa "GPA-format spec")). @@ -220,11 +216,11 @@ Parameters are: `-o, --outdir ` output directory to use (default: spaligner_result/) -For more information on parameters and options please refer to main SPAligner manual (assembler/src/projects/spaligner/README.md). +For more information on parameters and options please refer to the main SPAligner manual (assembler/src/projects/spaligner/README.md). Also if you want to align protein sequences please refer to our [pre-release version](https://github.com/ablab/spades/releases/tag/spaligner-paper). -Note that in order you use SPAligner one need either to use pre-built binaries or compiler SPAdes from sources using additional `-DSPADES_ENABLE_PROJECTS=spaligner` option. +Note that in order you use SPAligner one needs either to use pre-built binaries or compile SPAdes from sources using the additional `-DSPADES_ENABLE_PROJECTS=spaligner` option. ## Binning refining using assembly graphs @@ -251,7 +247,7 @@ Main options: Number of threads to use (default: 1/2 of available threads) `-m` - Allow multiple bin assignment (defalut: false) + Allow multiple bin assignment (default: false) `-Smax|-Smle` Simple maximum or maximum likelihood binning assignment strategy (default: max likelihood) @@ -275,7 +271,7 @@ Main options: Labels correction regularization parameter for labeled data (default: 0.6) -BinSPreader stores all output files in output directory ` ` set by the user. +BinSPreader stores all output files in the output directory ` ` set by the user. - `/binning.tsv` contains refined binning in `.tsv` format - `/bin_stats.tsv` contains various per-bin statistics diff --git a/mkdocs.yml b/mkdocs.yml index 20c9cd167..a61129c90 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,7 +10,7 @@ nav: - Command line options: running.md - SPAdes output: output.md - HMM-guided mode: hmm.md - - Trasncriptome assembly: rna.md + - Transcriptome assembly: rna.md - SPAdes tools: standalone.md - Citation: citation.md - Feedback: feedback.md