diff --git a/README.md b/README.md index f62e9e1ec..659f9b682 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Besides, SPAdes package includes supplementary tools for efficient k-mer countin - Complete user manual can be found [here](https://ablab.github.io/spades/). Information below is provided merely for your convenience and cannot be considered as the user guide. -- SPAdes assembler supports: +- SPAdes assembler supports: - Assembly of second-generation sequencing data (Illumina or IonTorrent); - PacBio and Nanopore reads that are used as supplementary data only. diff --git a/docs/datatypes.md b/docs/datatypes.md index 4d0163743..7ad6baf5e 100644 --- a/docs/datatypes.md +++ b/docs/datatypes.md @@ -4,7 +4,7 @@ ### Isolated and multi-cell datasets -When assembling multi-cell and bacterial isolated datasets with decent coverage (say 50x or higher), we strongly recommend to use `--isolate` option. +When assembling standard eukaryotic and bacterial isolated datasets with decent coverage (50x or higher), we strongly recommend to use `--isolate` option. SPAdes is capable of detecting optimal k-mer sizes automatically. Thus, if the assembly went smoothly without any errors or warnings, there is nothing to worry about. For example, for read length 100bp the default k values are 21, 33, 55; for 150bp reads SPAdes uses k-mer sizes 21, 33, 55, 77; and for 250bp reads six iterations are used by default: 21, 33, 55, 77, 99, 127. @@ -14,19 +14,17 @@ We strongly recommend *not* to change `-k` parameter unless you are clearly awar The default k-mer lengths are recommended. For single-cell data sets SPAdes always selects k-mer sizes 21, 33 and 55. -It might be tricky to fully utilize the advantages of longer Illumina reads (e.g. 250bp). Consider contacting us for more information and to discuss assembly strategy. +Do not hesitate to contact us for more information if you plan to assemble single-cell data with long Illumina reads (250 bp and longer). ## Assembling IonTorrent reads FASTQ or BAM files are supported as input in IonTorrent mode. -The selection of k-mer sizes might be non-trivial for IonTorrent. If the dataset is more or less conventional (good coverage, moderate or low GC, etc), then you can try using larger k-mer lengths, e.g. 21, 33, 55, 77, 99, 127. +The selection of k-mer sizes might be non-trivial for IonTorrent. If the dataset is more or less conventional (good coverage, moderate or low GC, etc), then you can try using a larger k-mer lengths, e.g. 21, 33, 55, 77, 99, 127. However, due to increased error rate some changes of k-mer lengths (e.g. selection of shorter ones) may be required. For example, if you ran SPAdes with k-mer lengths 21,33,55,77 and then decided to assemble the same data set using more iterations and larger values of K, you can run SPAdes once again specifying the same output folder and the following options: `--restart-from k77 -k 21,33,55,77,99,127 -o `. Do not forget to copy contigs and scaffolds from the previous run. -You may need no error correction for Hi-Q enzyme at all. However, we suggest trying to assemble your data with and without error correction and select the best variant. - -For non-trivial datasets (e.g. with high GC, low or uneven coverage) we suggest enabling single-cell mode (setting `--sc` option) and use k-mer lengths of 21, 33, 55. - +You may need no error correction for Hi-Q sequencing kit at all. However, we suggest trying to assemble your data with and without error correction and select the best variant. +For non-trivial datasets (e.g. with high GC, low or uneven coverage) we recommend enabling single-cell mode (setting `--sc` option) and use k-mer lengths of 21, 33, 55. diff --git a/docs/hybrid.md b/docs/hybrid.md index 87757637c..48eac5628 100644 --- a/docs/hybrid.md +++ b/docs/hybrid.md @@ -2,17 +2,26 @@ ## PacBio and Oxford Nanopore reads -SPAdes can take as an input an unlimited number of PacBio and Oxford Nanopore libraries. +SPAdes can take as input an unlimited number of PacBio and Oxford Nanopore libraries. -PacBio CLR and Oxford Nanopore reads are used for hybrid assemblies (e.g. with Illumina or IonTorrent). There is no need to pre-correct this kind of data. SPAdes will use PacBio CLR and Oxford Nanopore reads for gap closure and repeat resolution. +PacBio CLR and Oxford Nanopore reads are used for hybrid assemblies (e.g. with Illumina or IonTorrent). +There is no need to pre-correct this kind of data. +SPAdes will use PacBio CLR and Oxford Nanopore reads for gap closure and repeat resolution. -For PacBio you just need to have filtered subreads in FASTQ/FASTA format. Provide these filtered subreads using the `--pacbio` option. Oxford Nanopore reads are provided with `--nanopore` option. +For PacBio you need to provide filtered subreads in FASTQ/FASTA format via +the `--pacbio` option. Oxford Nanopore reads are provided with `--nanopore` option. PacBio CCS/Reads of Insert reads or pre-corrected (using third-party software) PacBio CLR / Oxford Nanopore reads can be simply provided as single reads to SPAdes. ## Additional contigs -In case you have contigs of the same genome generated by another assembler(s) and you wish to merge them into SPAdes assembly, you can specify additional contigs using `--trusted-contigs` or `--untrusted-contigs`. First option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. Second option is used for less reliable contigs that may have more errors or contigs of unknown quality. These contigs will be used only for gap closure and repeat resolution. The number of additional contigs is unlimited. +In case you have contigs of the same genome generated by different assembler(s), and you wish to add them into SPAdes assembly, +you can specify additional contigs using `--trusted-contigs` or `--untrusted-contigs`. +The first option is used when high quality contigs are available. These contigs will be used for graph construction, gap closure and repeat resolution. +The second option is either used for less reliable contigs, which may have more errors, or for contigs of unknown quality. +Such contigs will be used only for gap closure and repeat resolution. -Note, that SPAdes does not perform assembly using genomes of closely-related species. Only contigs of the same genome should be specified. +Both options allow providing unlimited number of contigs. + +Note, that SPAdes **does not** perform assembly using genomes of closely-related species. Only contigs of the same genome should be provided. diff --git a/docs/index.md b/docs/index.md index 95436eb87..753caecfc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,13 +2,13 @@ ![SPAdes](spades.png){ align=right } -SPAdes - St. Petersburg genome assembler - a versatile toolkit designed for assembling and analyzing sequencing data from -Illumina and IonTorrent technologies. In addition, most of SPAdes pipelines support a hybrid mode allowing the use of +SPAdes - St. Petersburg genome assembler - a versatile toolkit designed for assembling and analyzing sequencing data from +Illumina and IonTorrent technologies. In addition, most of SPAdes pipelines support a hybrid mode allowing the use of long reads (PacBio and Oxford Nanopore) as supplementary data. -SPAdes package provides pipelines for DNA assembly of isolates and single-cell bacteria, as well as of -metagenomic and transcriptomic data. Additional modes allow to recover bacterial plasmids and RNA viruses, -perform HMM-guided assembly and more. SPAdes package also includes supplementary tools for efficient +SPAdes package provides pipelines for DNA assembly of isolates and single-cell bacteria, as well as of +metagenomic and transcriptomic data. Additional modes allow to recover bacterial plasmids and RNA viruses, +perform HMM-guided assembly and more. SPAdes package also includes supplementary tools for efficient k-mer counting and k-mer-based read filtering, assembly graph construction and simplification, sequence-to-graph alignment and metagenomic binning refinement. diff --git a/docs/input.md b/docs/input.md index 91f904754..bc9692c68 100644 --- a/docs/input.md +++ b/docs/input.md @@ -1,6 +1,9 @@ # SPAdes basic input -SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ (can be gzipped). For IonTorrent data SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by Torrent Server). However, in order to run read error correction, reads should be in FASTQ or BAM format. Sanger, Oxford Nanopore and PacBio CLR reads can be provided in both formats since SPAdes does not run error correction for these types of data. +SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ (can be gzipped). +For IonTorrent data SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by Torrent Server). +However, in order to run read error correction, reads should be in FASTQ or BAM format. +Sanger, Oxford Nanopore and PacBio CLR reads can be provided in both formats since SPAdes does not run error correction for these types of data. To run SPAdes you need at least one library of the following types: @@ -8,9 +11,11 @@ To run SPAdes you need at least one library of the following types: - IonTorrent paired-end/high-quality mate-pairs/unpaired reads - PacBio CCS reads -Illumina and IonTorrent libraries should not be assembled together. All other types of input data are compatible. SPAdes should not be used if only PacBio CLR, Oxford Nanopore, Sanger reads or additional contigs are available. +Illumina and IonTorrent libraries should not be assembled together. All other types of input data are compatible. +SPAdes should not be used if only PacBio CLR, Oxford Nanopore, Sanger reads or additional contigs are available. -SPAdes supports mate-pair only assembly. However, we recommend to use only high-quality mate-pair libraries in this case (e.g. that do not have a paired-end part). We tested the mate-pair-only pipeline using Illumina Nextera mate-pairs. See more [here](running.md#specifying-multiple-libraries). +SPAdes supports mate-pair only assembly. However, we recommend to use only high-quality mate-pair libraries in this case +(e.g. that do not have a paired-end part). We tested the mate-pair-only pipeline using Illumina Nextera mate-pairs. See more [here](running.md#specifying-multiple-libraries). Notes: @@ -20,15 +25,19 @@ Notes: ## Paired read libraries -By using command line interface, you can specify up to nine different paired-end libraries, up to nine mate-pair libraries and also up to nine high-quality mate-pair ones. If you wish to use more, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file). We further refer to paired-end and mate-pair libraries simply as to read-pair libraries. +By using command line interface, you can specify (1) paired-end, (2) standard mate-pair and (3) high-quality mate-pair libraries. +You can provide up to nine libraries of each type. Libraries can be used in any combination, but recommend not to assemble low-quality mate-pairs alone. +If you wish to use more libraries, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file). -By default, SPAdes assumes that paired-end and high-quality mate-pair reads have forward-reverse (fr) orientation and usual mate-pairs have reverse-forward (rf) orientation. However, different orientations can be set for any library by using SPAdes options. +By default, SPAdes assumes that paired-end and high-quality mate-pair reads have forward-reverse (fr) orientation and usual mate-pairs have reverse-forward +(rf) orientation. However, different orientations can be indicated for any library by using SPAdes options. To distinguish reads in pairs we refer to them as left and right reads. For forward-reverse orientation, the forward reads correspond to the left reads and the reverse reads, to the right. Similarly, in reverse-forward orientation left and right reads correspond to reverse and forward reads, respectively, etc. -Each read-pair library can be stored in several files or several pairs of files. Paired reads can be organized in two different ways: +Each paired-end or mate-pair library can be stored in several files or several pairs of files. Paired reads can be organized in two different ways: -- In file pairs. In this case left and right reads are placed in different files and go in the same order in respective files. +- In file pairs. In this case left and right reads are placed in different files and must go in the same order. +I.e. for every left read at line X in the first file the corresponding right read from the pair must be at line X in the second file. - In interleaved files. In this case, the reads are interlaced, so that each right read goes after the corresponding paired left read. For example, Illumina produces paired-end reads in two files: `R1.fastq` and `R2.fastq`. If you choose to store reads in file pairs make sure that for every read from `R1.fastq` the corresponding paired read from `R2.fastq` is placed in the respective paired file on the same line number. If you choose to use interleaved files, every read from `R1.fastq` should be followed by the corresponding paired read from `R2.fastq`. @@ -40,6 +49,8 @@ Note that non-empty files with the remaining unmerged left/right reads (separate In an unlikely case some of the reads from your mate-pair (or high-quality mate-pair) library are "merged", you should provide the resulting reads as a SEPARATE single-read library. +See [examples](running.md#examples). + ## Unpaired (single-read) libraries By using the command line interface, you can specify up to nine different single-read libraries. To input more libraries, you can use [YAML data set file](running.md#specifying-multiple-libraries-with-yaml-data-set-file). @@ -48,3 +59,4 @@ Single libraries are assumed to have high quality and reasonable coverage. For e Note, that you should not specify PacBio CLR, Sanger reads or additional contigs as single-read libraries, each of them has a separate [option](running.md#input-data). +See [examples](running.md#examples). \ No newline at end of file