Skip to content

Commit

Permalink
Refine docs, separate out binspreader
Browse files Browse the repository at this point in the history
  • Loading branch information
asl committed Apr 5, 2024
1 parent 5efe28a commit ee3a7bd
Show file tree
Hide file tree
Showing 4 changed files with 195 additions and 130 deletions.
79 changes: 79 additions & 0 deletions docs/binspreader.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Binning refining using assembly graphs

BinSPreader is a tool that attempts to refine metagenome-assembled genomes
(MAGs) obtained from existing tools. BinSPreader exploits the assembly graph
topology and other connectivity information, such as paired-end and Hi-C reads,
to refine the existing binning, correct binning errors, propagate binning from
longer contigs to shorter contigs, and infer contigs belonging to multiple bins.

The tool requires initial binning to refine, as well as an assembly graph as a
source of information for refining. Optionally, BinSPreader can be provided with
multiple Hi-C and/or paired-end libraries.

## Command line options

Required positional arguments:

- Assembly graph file in [GFA 1.0
format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md), with
scaffolds included as path lines. Alternatively, scaffold paths can be
provided separately using `--path` option in the `.paths` format accepted by
Bandage (see [Bandage
wiki](https://github.com/rrwick/Bandage/wiki/Graph-paths) for details).
- Binning output from an existing tool (in `.tsv` format)

### Synopsis
```bash
binspreader <graph (in GFA)> <binning (in .tsv)> <output directory> [OPTION...]
```

### Main options

`--paths`
provide contigs paths from file separately from GFA

`--dataset`
Dataset in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file) describing Hi-C and paired-end reads

`-t`
Number of threads to use (default: 1/2 of available threads)

`-m`
Allow multiple bin assignment (default: false)

`-Smax|-Smle`
Simple maximum or maximum likelihood binning assignment strategy (default: max likelihood)

`-Rcorr|-Rprop`
Select propagation or correction mode (default: correction)

`--cami`
Use CAMI bioboxes binning format

`--zero-bin`
Emit zero bin for unbinned sequences

`--tall-multi`
Use tall table for multiple binning result

`--bin-dist`
Estimate pairwise bin distance (could be slow on large graphs!)

`-la`
Labels correction regularization parameter for labeled data (default: 0.6)


### Output
BinSPreader stores all output files in the output directory `<output_dir> ` set by the user.

- `<output_dir>/binning.tsv` contains refined binning in `.tsv` format
- `<output_dir>/bin_stats.tsv` contains various per-bin statistics
- `<output_dir>/bin_weights.tsv` contains soft bin weights per contig
- `<output_dir>/edge_weights.tsv` contains soft bin weights per edge

In addition

- `<output_dir>/bin_dist.tsv` contains refined bin distance matrix (if `--bin-dist` was used)
- `<output_dir>/bin_label_1.fastq, <output_dir>/bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used)
- `<output_dir>/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used)
- `<output_dir>/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used)
178 changes: 113 additions & 65 deletions docs/running.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,56 +3,104 @@
To run SPAdes from the command line, type

``` bash
spades.py [options] -o <output_dir>
spades.py [options] -o <output_dir>
```

Note that we assume that the `bin` folder from SPAdes installation directory is added to the `PATH` variable (provide full path to SPAdes executable otherwise: `<spades installation dir>/bin/spades.py`).

## Basic options and modes
## Running modes
#### `--isolate`

`-o <output_dir> `
Specify the output directory. Required option.
This flag is highly recommended for high-coverage isolate and multi-cell
Illumina data; improves the assembly quality and running time. We also
recommend trimming your reads prior to the assembly. This option is not
compatible with `--only-error-correction` or `--careful` options.

#### `--sc`

This flag is required for MDA amplified (single-cell) data. Assumes highly
uneven coverage and presence of amplification artifacts.

`--isolate `
This flag is highly recommended for high-coverage isolate and multi-cell Illumina data; improves the assembly quality and running time.
We also recommend trimming your reads prior to the assembly.
This option is not compatible with `--only-error-correction` or `--careful` options.
#### `--meta` (same as `metaspades.py`)

`--sc `
This flag is required for MDA (single-cell) data.
This flag is recommended when assembling metagenomic data sets (runs metaSPAdes,
see [paper](https://genome.cshlp.org/content/27/5/824.short) for more
details). Currently metaSPAdes supports only a **_single_** short-read library
which has to be **_paired-end_** (we hope to remove this restriction soon). In
addition, you can provide long reads (e.g. using `--pacbio` or `--nanopore`
options), but hybrid assembly for metagenomes remains an experimental pipeline
and optimal performance is not guaranteed. It does not support [careful
mode](running.md#pipeline-options) (mismatch correction is not available). In
addition, you cannot specify coverage cutoff for metaSPAdes. Note that
metaSPAdes might be very sensitive to the presence of the technical sequences
remaining in the data (most notably adapter readthroughs), please run quality
control and pre-process your data accordingly.

`--meta ` (same as `metaspades.py`)
This flag is recommended when assembling metagenomic data sets (runs metaSPAdes, see [paper](https://genome.cshlp.org/content/27/5/824.short) for more details). Currently metaSPAdes supports only a **_single_** short-read library which has to be **_paired-end_** (we hope to remove this restriction soon). In addition, you can provide long reads (e.g. using `--pacbio` or `--nanopore` options), but hybrid assembly for metagenomes remains an experimental pipeline and optimal performance is not guaranteed. It does not support [careful mode](running.md#pipeline-options) (mismatch correction is not available). In addition, you cannot specify coverage cutoff for metaSPAdes. Note that metaSPAdes might be very sensitive to the presence of the technical sequences remaining in the data (most notably adapter readthroughs), please run quality control and pre-process your data accordingly.
#### `--plasmid` (same as `plasmidspades.py`)

`--plasmid ` (same as `plasmidspades.py`)
This flag is required when assembling only plasmids from WGS data sets (runs plasmidSPAdes, see [paper](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610) for the algorithm details). Note, that plasmidSPAdes is not compatible with single-cell mode (`--sc`). Additionally, we do not recommend to run plasmidSPAdes on more than one library.
This flag enables plasmidSPAdes mode that assembles only
plasmids from WGS data sets (see
[paper](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610) for
the algorithm details). Note, that plasmidSPAdes is not compatible with
single-cell mode (`--sc`). Additionally, we do not recommend to run
plasmidSPAdes on more than one library.

See [plasmidSPAdes output section](output.md#plasmidspades-output) for details.

`--metaplasmid ` (same as `metaplasmidspades.py` and `--meta` `--plasmid`) and
#### `--metaplasmid` and `--metaviral`
(same as `metaplasmidspades.py` and `--meta` `--plasmid` and `metaviralspades.py`)

These options work specially for extracting extrachromosomal elements from
metagenomic assemblies. They run similar pipelines that slightly differ in the
simplification step; another difference is that for metaviral mode we output
linear putative extrachromosomal contigs and for metaplasmid mode we do not.
See [metaplasmid paper](https://genome.cshlp.org/content/29/6/961.short) and
[metaviral
paper](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667)
for the algorithms details.

See [metaplasmidSPAdes/metaviralSPAdes
section](output.md#metaplasmidspades-and-metaviralspades-output) for details of
the output.

Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we
recommend verifying resulting contigs with [viralVerify
tool](https://github.com/ablab/viralVerify).

#### `--bio `

`--metaviral ` (same as `metaviralspades.py`)
This flag enables biosyntheticSPAdes mode that assembles non-ribosomal and
polyketide gene clusters from WGS data sets (see
[paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1)
for the algorithm details). biosyntheticSPAdes is supposed to work on isolated
or metagenomic WGS dataset. Note, that biosyntheticSPAdes is not compatible with
any other modes. See [biosyntheticSPAdes output
section](output.md#biosyntheticspades-output) for details of the output.

These options work specially for extracting extrachromosomal elements from metagenomic assemblies. They run similar pipelines that slightly differ in the simplification step; another difference is that for metaviral mode we output linear putative extrachromosomal contigs and for metaplasmid mode we do not.
See [metaplasmid paper](https://genome.cshlp.org/content/29/6/961.short) and [metaviral paper](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667) for the algorithms details.
#### `--rna ` (same as `rnaspades.py`)

See [metaplasmidSPAdes/metaviralSPAdes section](output.md#metaplasmidspades-and-metaviralspades-output) for details see.
This flag should be used when assembling RNA-Seq data sets (runs rnaSPAdes). To
learn more, see dedicated [rnaSPAdes manual](rna.md). Not compatible with
`--only-error-correction` or `--careful` options.

Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we recommend verifying resulting contigs with [viralVerify tool](https://github.com/ablab/viralVerify).
#### `--rnaviral` (same as `rnaviralspades.py`)
This flag should be used when assembling viral RNA-Seq data sets (runs rnaviralSPAdes).
Not compatible with `--only-error-correction` or `--careful` options.

`--bio `
This flag is required when assembling only non-ribosomal and polyketide gene clusters from WGS data sets (runs biosyntheticSPAdes, see [paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1) for the algorithm details). biosyntheticSPAdes is supposed to work on isolated or metagenomic WGS dataset. Note, that biosyntheticSPAdes is not compatible with any other modes. See [biosyntheticSPAdes output section](output.md#biosyntheticspades-output) for details.
#### `--corona` (same as `coronaspades.py`)
Enables dedicated HMM-guided coronaviral assembly module. See [HMM-guided
mode](hmm.md) page for details.

`--rna ` (same as `rnaspades.py`)
This flag should be used when assembling RNA-Seq data sets (runs rnaSPAdes). To learn more, see [rnaSPAdes manual](rna.md).
Not compatible with `--only-error-correction` or `--careful` options.
#### `--iontorrent `

`--rnaviral` (same as `rnaviralspades.py`)
This flag should be used when assembling viral RNA-Seq data sets (runs rnaviralSPAdes).
Not compatible with `--only-error-correction` or `--careful` options.
This flag is required when assembling IonTorrent data. Allows BAM files as
input. Carefully read [IonTorrent section](datatypes.md#assembling-iontorrent-reads)
before using this option.

`--iontorrent `
This flag is required when assembling IonTorrent data. Allows BAM files as input. Carefully read [IonTorrent section](datatypes.md#assembling-iontorrent-reads) before using this option.
## Basic options

`-o <output_dir> `
Specify the output directory. Required option.

`--test`
Runs SPAdes on the toy data set; see [installation](installation.md#verifying-your-installation) for details.
Expand Down Expand Up @@ -337,98 +385,98 @@ Notes:
To test the toy data set, you can also run the following command from the SPAdes `bin` directory:

``` bash
spades.py --pe1-1 ../share/spades/test_dataset/ecoli_1K_1.fq.gz \
--pe1-2 ../share/spades/test_dataset/ecoli_1K_2.fq.gz -o spades_test
spades.py --pe1-1 ../share/spades/test_dataset/ecoli_1K_1.fq.gz \
--pe1-2 ../share/spades/test_dataset/ecoli_1K_2.fq.gz \
-o spades_test
```

If you have your library separated into several pairs of files, for example:

``` plain
lib1_forward_1.fastq
lib1_reverse_1.fastq
lib1_forward_2.fastq
lib1_reverse_2.fastq
lib1_forward_1.fastq
lib1_reverse_1.fastq
lib1_forward_2.fastq
lib1_reverse_2.fastq
```

make sure that corresponding files are given in the same order:

``` bash
spades.py --pe1-1 lib1_forward_1.fastq --pe1-2 lib1_reverse_1.fastq \
--pe1-1 lib1_forward_2.fastq --pe1-2 lib1_reverse_2.fastq \
-o spades_output
spades.py --pe1-1 lib1_forward_1.fastq --pe1-2 lib1_reverse_1.fastq \
--pe1-1 lib1_forward_2.fastq --pe1-2 lib1_reverse_2.fastq \
-o spades_output
```

Files with interlacing paired-end reads or files with unpaired reads can be specified in any order with one file per option, for example:

``` bash
spades.py --pe1-12 lib1_1.fastq --pe1-12 lib1_2.fastq \
--pe1-s lib1_unpaired_1.fastq --pe1-s lib1_unpaired_2.fastq \
-o spades_output
spades.py --pe1-12 lib1_1.fastq --pe1-12 lib1_2.fastq \
--pe1-s lib1_unpaired_1.fastq --pe1-s lib1_unpaired_2.fastq \
-o spades_output
```

If you have several paired-end and mate-pair reads, for example:

paired-end library 1

``` plain
lib_pe1_left.fastq
lib_pe1_right.fastq
lib_pe1_left.fastq
lib_pe1_right.fastq
```

mate-pair library 1

``` plain
lib_mp1_left.fastq
lib_mp1_right.fastq
lib_mp1_left.fastq
lib_mp1_right.fastq
```

mate-pair library 2

``` plain
lib_mp2_left.fastq
lib_mp2_right.fastq
lib_mp2_left.fastq
lib_mp2_right.fastq
```

make sure that files corresponding to each library are grouped together:

``` bash
spades.py --pe1-1 lib_pe1_left.fastq --pe1-2 lib_pe1_right.fastq \
--mp1-1 lib_mp1_left.fastq --mp1-2 lib_mp1_right.fastq \
--mp2-1 lib_mp2_left.fastq --mp2-2 lib_mp2_right.fastq \
-o spades_output
spades.py --pe1-1 lib_pe1_left.fastq --pe1-2 lib_pe1_right.fastq \
--mp1-1 lib_mp1_left.fastq --mp1-2 lib_mp1_right.fastq \
--mp2-1 lib_mp2_left.fastq --mp2-2 lib_mp2_right.fastq \
-o spades_output
```

If you have IonTorrent unpaired reads, PacBio CLR and additional reliable contigs:

``` plain
it_reads.fastq
pacbio_clr.fastq
contigs.fasta
it_reads.fastq
pacbio_clr.fastq
contigs.fasta
```

run SPAdes with the following command:

``` bash
spades.py --iontorrent -s it_reads.fastq \
--pacbio pacbio_clr.fastq --trusted-contigs contigs.fastq \
-o spades_output
spades.py --iontorrent -s it_reads.fastq \
--pacbio pacbio_clr.fastq --trusted-contigs contigs.fastq \
-o spades_output
```

If a single-read library is split into several files:

``` plain
unpaired1_1.fastq
unpaired1_2.fastq
unpaired1_3.fasta
unpaired1_1.fastq
unpaired1_2.fastq
unpaired1_3.fasta
```

specify them as one library:

``` bash
spades.py --s1 unpaired1_1.fastq \
--s1 unpaired1_2.fastq --s1 unpaired1_3.fastq \
-o spades_output
spades.py --s1 unpaired1_1.fastq \
--s1 unpaired1_2.fastq --s1 unpaired1_3.fastq \
-o spades_output
```

All options for specifying input data can be mixed if needed, but make sure that files for each library are grouped and files with left and right paired reads are listed in the same order.

Loading

0 comments on commit ee3a7bd

Please sign in to comment.