Refine docs, separate out binspreader

ablab · Apr 5, 2024 · ee3a7bd · ee3a7bd
1 parent 5efe28a
commit ee3a7bd
Show file tree

Hide file tree

Showing 4 changed files with 195 additions and 130 deletions.
diff --git a/docs/binspreader.md b/docs/binspreader.md
@@ -0,0 +1,79 @@
+# Binning refining using assembly graphs
+
+BinSPreader is a tool that attempts to refine metagenome-assembled genomes
+(MAGs) obtained from existing tools. BinSPreader exploits the assembly graph
+topology and other connectivity information, such as paired-end and Hi-C reads,
+to refine the existing binning, correct binning errors, propagate binning from
+longer contigs to shorter contigs, and infer contigs belonging to multiple bins.
+
+The tool requires initial binning to refine, as well as an assembly graph as a
+source of information for refining. Optionally, BinSPreader can be provided with
+multiple Hi-C and/or paired-end libraries.
+
+## Command line options
+
+Required positional arguments: 
+
+- Assembly graph file in [GFA 1.0
+ format](https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md), with
+ scaffolds included as path lines. Alternatively, scaffold paths can be
+ provided separately using `--path` option in the `.paths` format accepted by
+ Bandage (see [Bandage
+ wiki](https://github.com/rrwick/Bandage/wiki/Graph-paths) for details).
+- Binning output from an existing tool (in `.tsv` format)
+
+### Synopsis
+```bash
+binspreader <graph (in GFA)> <binning (in .tsv)> <output directory> [OPTION...]
+```
+
+### Main options
+
+`--paths`
+ provide contigs paths from file separately from GFA
+
+`--dataset` 
+ Dataset in [YAML format](running.md#specifying-multiple-libraries-with-yaml-data-set-file) describing Hi-C and paired-end reads
+
+ `-t` 
+ Number of threads to use (default: 1/2 of available threads)
+
+ `-m` 
+ Allow multiple bin assignment (default: false)
+
+ `-Smax|-Smle` 
+ Simple maximum or maximum likelihood binning assignment strategy (default: max likelihood)
+
+ `-Rcorr|-Rprop` 
+ Select propagation or correction mode (default: correction)
+
+`--cami` 
+ Use CAMI bioboxes binning format
+
+`--zero-bin` 
+ Emit zero bin for unbinned sequences
+
+`--tall-multi` 
+ Use tall table for multiple binning result
+
+`--bin-dist` 
+ Estimate pairwise bin distance (could be slow on large graphs!)
+
+`-la` 
+ Labels correction regularization parameter for labeled data (default: 0.6)
+
+
+### Output
+BinSPreader stores all output files in the output directory `<output_dir> ` set by the user.
+
+- `<output_dir>/binning.tsv` contains refined binning in `.tsv` format
+- `<output_dir>/bin_stats.tsv` contains various per-bin statistics
+- `<output_dir>/bin_weights.tsv` contains soft bin weights per contig
+- `<output_dir>/edge_weights.tsv` contains soft bin weights per edge
+
+In addition
+
+- `<output_dir>/bin_dist.tsv` contains refined bin distance matrix (if `--bin-dist` was used)
+- `<output_dir>/bin_label_1.fastq, <output_dir>/bin_label_2.fastq` read set for bin labeled by `bin_label` (if `--reads` was used)
+- `<output_dir>/pe_links.tsv` list of paired-end links between assembly graph edges with weights (if `--debug` was used)
+- `<output_dir>/graph_links.tsv` list of graph links between assembly graph edges with weights (if `--debug` was used)
diff --git a/docs/running.md b/docs/running.md
@@ -3,56 +3,104 @@
 To run SPAdes from the command line, type
 
 ``` bash
- spades.py [options] -o <output_dir>
+spades.py [options] -o <output_dir>
 ```
 
 Note that we assume that the `bin` folder from SPAdes installation directory is added to the `PATH` variable (provide full path to SPAdes executable otherwise: `<spades installation dir>/bin/spades.py`).
 
-## Basic options and modes
+## Running modes
+#### `--isolate`
 
-`-o <output_dir> `
- Specify the output directory. Required option.
+This flag is highly recommended for high-coverage isolate and multi-cell
+Illumina data; improves the assembly quality and running time. We also
+recommend trimming your reads prior to the assembly. This option is not
+compatible with `--only-error-correction` or `--careful` options.
+
+#### `--sc`
+
+This flag is required for MDA amplified (single-cell) data. Assumes highly
+uneven coverage and presence of amplification artifacts.
 
-`--isolate `
- This flag is highly recommended for high-coverage isolate and multi-cell Illumina data; improves the assembly quality and running time.
- We also recommend trimming your reads prior to the assembly.
- This option is not compatible with `--only-error-correction` or `--careful` options.
+#### `--meta` (same as `metaspades.py`)
 
-`--sc `
- This flag is required for MDA (single-cell) data.
+This flag is recommended when assembling metagenomic data sets (runs metaSPAdes,
+see [paper](https://genome.cshlp.org/content/27/5/824.short) for more
+details). Currently metaSPAdes supports only a **_single_** short-read library
+which has to be **_paired-end_** (we hope to remove this restriction soon). In
+addition, you can provide long reads (e.g. using `--pacbio` or `--nanopore`
+options), but hybrid assembly for metagenomes remains an experimental pipeline
+and optimal performance is not guaranteed. It does not support [careful
+mode](running.md#pipeline-options) (mismatch correction is not available). In
+addition, you cannot specify coverage cutoff for metaSPAdes. Note that
+metaSPAdes might be very sensitive to the presence of the technical sequences
+remaining in the data (most notably adapter readthroughs), please run quality
+control and pre-process your data accordingly.
 
-`--meta ` (same as `metaspades.py`)
- This flag is recommended when assembling metagenomic data sets (runs metaSPAdes, see [paper](https://genome.cshlp.org/content/27/5/824.short) for more details). Currently metaSPAdes supports only a **_single_** short-read library which has to be **_paired-end_** (we hope to remove this restriction soon). In addition, you can provide long reads (e.g. using `--pacbio` or `--nanopore` options), but hybrid assembly for metagenomes remains an experimental pipeline and optimal performance is not guaranteed. It does not support [careful mode](running.md#pipeline-options) (mismatch correction is not available). In addition, you cannot specify coverage cutoff for metaSPAdes. Note that metaSPAdes might be very sensitive to the presence of the technical sequences remaining in the data (most notably adapter readthroughs), please run quality control and pre-process your data accordingly.
+#### `--plasmid` (same as `plasmidspades.py`)
 
-`--plasmid ` (same as `plasmidspades.py`)
- This flag is required when assembling only plasmids from WGS data sets (runs plasmidSPAdes, see [paper](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610) for the algorithm details). Note, that plasmidSPAdes is not compatible with single-cell mode (`--sc`). Additionally, we do not recommend to run plasmidSPAdes on more than one library.
+This flag enables plasmidSPAdes mode that assembles only
+plasmids from WGS data sets (see
+[paper](https://academic.oup.com/bioinformatics/article/32/22/3380/2525610) for
+the algorithm details). Note, that plasmidSPAdes is not compatible with
+single-cell mode (`--sc`). Additionally, we do not recommend to run
+plasmidSPAdes on more than one library.
 
 See [plasmidSPAdes output section](output.md#plasmidspades-output) for details.
 
-`--metaplasmid ` (same as `metaplasmidspades.py` and `--meta` `--plasmid`) and
+#### `--metaplasmid` and `--metaviral`
+(same as `metaplasmidspades.py` and `--meta` `--plasmid` and `metaviralspades.py`)
+
+These options work specially for extracting extrachromosomal elements from
+metagenomic assemblies. They run similar pipelines that slightly differ in the
+simplification step; another difference is that for metaviral mode we output
+linear putative extrachromosomal contigs and for metaplasmid mode we do not.
+See [metaplasmid paper](https://genome.cshlp.org/content/29/6/961.short) and
+[metaviral
+paper](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667)
+for the algorithms details.
+
+See [metaplasmidSPAdes/metaviralSPAdes
+section](output.md#metaplasmidspades-and-metaviralspades-output) for details of
+the output.
+
+Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we
+recommend verifying resulting contigs with [viralVerify
+tool](https://github.com/ablab/viralVerify).
+
+#### `--bio `
 
-`--metaviral ` (same as `metaviralspades.py`)
+This flag enables biosyntheticSPAdes mode that assembles non-ribosomal and
+polyketide gene clusters from WGS data sets (see
+[paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1)
+for the algorithm details). biosyntheticSPAdes is supposed to work on isolated
+or metagenomic WGS dataset. Note, that biosyntheticSPAdes is not compatible with
+any other modes. See [biosyntheticSPAdes output
+section](output.md#biosyntheticspades-output) for details of the output.
 
-These options work specially for extracting extrachromosomal elements from metagenomic assemblies. They run similar pipelines that slightly differ in the simplification step; another difference is that for metaviral mode we output linear putative extrachromosomal contigs and for metaplasmid mode we do not.
-See [metaplasmid paper](https://genome.cshlp.org/content/29/6/961.short) and [metaviral paper](https://academic.oup.com/bioinformatics/article-abstract/36/14/4126/5837667) for the algorithms details.
+#### `--rna ` (same as `rnaspades.py`)
 
-See [metaplasmidSPAdes/metaviralSPAdes section](output.md#metaplasmidspades-and-metaviralspades-output) for details see.
+This flag should be used when assembling RNA-Seq data sets (runs rnaSPAdes). To
+learn more, see dedicated [rnaSPAdes manual](rna.md). Not compatible with
+`--only-error-correction` or `--careful` options.
 
-Additionally for plasmidSPAdes, metaplasmidSPAdes and metaviralSPAdes we recommend verifying resulting contigs with [viralVerify tool](https://github.com/ablab/viralVerify).
+#### `--rnaviral` (same as `rnaviralspades.py`)
+This flag should be used when assembling viral RNA-Seq data sets (runs rnaviralSPAdes).
+Not compatible with `--only-error-correction` or `--careful` options.
 
-`--bio `
- This flag is required when assembling only non-ribosomal and polyketide gene clusters from WGS data sets (runs biosyntheticSPAdes, see [paper](https://genome.cshlp.org/content/early/2019/06/03/gr.243477.118?top=1) for the algorithm details). biosyntheticSPAdes is supposed to work on isolated or metagenomic WGS dataset. Note, that biosyntheticSPAdes is not compatible with any other modes. See [biosyntheticSPAdes output section](output.md#biosyntheticspades-output) for details.
+#### `--corona` (same as `coronaspades.py`)
+Enables dedicated HMM-guided coronaviral assembly module. See [HMM-guided
+mode](hmm.md) page for details.
 
-`--rna ` (same as `rnaspades.py`)
- This flag should be used when assembling RNA-Seq data sets (runs rnaSPAdes). To learn more, see [rnaSPAdes manual](rna.md).
- Not compatible with `--only-error-correction` or `--careful` options.
+#### `--iontorrent `
 
-`--rnaviral` (same as `rnaviralspades.py`)
- This flag should be used when assembling viral RNA-Seq data sets (runs rnaviralSPAdes).
- Not compatible with `--only-error-correction` or `--careful` options.
+This flag is required when assembling IonTorrent data. Allows BAM files as
+input. Carefully read [IonTorrent section](datatypes.md#assembling-iontorrent-reads)
+before using this option.
 
-`--iontorrent `
- This flag is required when assembling IonTorrent data. Allows BAM files as input. Carefully read [IonTorrent section](datatypes.md#assembling-iontorrent-reads) before using this option.
+## Basic options
+
+`-o <output_dir> `
+ Specify the output directory. Required option.
 
 `--test`
  Runs SPAdes on the toy data set; see [installation](installation.md#verifying-your-installation) for details.
@@ -337,98 +385,98 @@ Notes:
 To test the toy data set, you can also run the following command from the SPAdes `bin` directory:
 
 ``` bash
- spades.py --pe1-1 ../share/spades/test_dataset/ecoli_1K_1.fq.gz \
- --pe1-2 ../share/spades/test_dataset/ecoli_1K_2.fq.gz -o spades_test
+spades.py --pe1-1 ../share/spades/test_dataset/ecoli_1K_1.fq.gz \
+ --pe1-2 ../share/spades/test_dataset/ecoli_1K_2.fq.gz \
+ -o spades_test
 ```
 
 If you have your library separated into several pairs of files, for example:
 
 ``` plain
- lib1_forward_1.fastq
- lib1_reverse_1.fastq
- lib1_forward_2.fastq
- lib1_reverse_2.fastq
+lib1_forward_1.fastq
+lib1_reverse_1.fastq
+lib1_forward_2.fastq
+lib1_reverse_2.fastq
 ```
 
 make sure that corresponding files are given in the same order:
 
 ``` bash
- spades.py --pe1-1 lib1_forward_1.fastq --pe1-2 lib1_reverse_1.fastq \
- --pe1-1 lib1_forward_2.fastq --pe1-2 lib1_reverse_2.fastq \
- -o spades_output
+spades.py --pe1-1 lib1_forward_1.fastq --pe1-2 lib1_reverse_1.fastq \
+  --pe1-1 lib1_forward_2.fastq --pe1-2 lib1_reverse_2.fastq \
+  -o spades_output
 ```
 
 Files with interlacing paired-end reads or files with unpaired reads can be specified in any order with one file per option, for example:
 
 ``` bash
- spades.py --pe1-12 lib1_1.fastq --pe1-12 lib1_2.fastq \
- --pe1-s lib1_unpaired_1.fastq --pe1-s lib1_unpaired_2.fastq \
- -o spades_output
+spades.py --pe1-12 lib1_1.fastq --pe1-12 lib1_2.fastq \
+  --pe1-s lib1_unpaired_1.fastq --pe1-s lib1_unpaired_2.fastq \
+  -o spades_output
 ```
 
 If you have several paired-end and mate-pair reads, for example:
 
 paired-end library 1
 
 ``` plain
- lib_pe1_left.fastq
- lib_pe1_right.fastq
+lib_pe1_left.fastq
+lib_pe1_right.fastq
 ```
 
 mate-pair library 1
 
 ``` plain
- lib_mp1_left.fastq
- lib_mp1_right.fastq
+lib_mp1_left.fastq
+lib_mp1_right.fastq
 ```
 
 mate-pair library 2
 
 ``` plain
- lib_mp2_left.fastq
- lib_mp2_right.fastq
+lib_mp2_left.fastq
+lib_mp2_right.fastq
 ```
 
 make sure that files corresponding to each library are grouped together:
 
 ``` bash
- spades.py --pe1-1 lib_pe1_left.fastq --pe1-2 lib_pe1_right.fastq \
- --mp1-1 lib_mp1_left.fastq --mp1-2 lib_mp1_right.fastq \
- --mp2-1 lib_mp2_left.fastq --mp2-2 lib_mp2_right.fastq \
- -o spades_output
+spades.py --pe1-1 lib_pe1_left.fastq --pe1-2 lib_pe1_right.fastq \
+  --mp1-1 lib_mp1_left.fastq --mp1-2 lib_mp1_right.fastq \
+  --mp2-1 lib_mp2_left.fastq --mp2-2 lib_mp2_right.fastq \
+  -o spades_output
 ```
 
 If you have IonTorrent unpaired reads, PacBio CLR and additional reliable contigs:
 
 ``` plain
- it_reads.fastq
- pacbio_clr.fastq
- contigs.fasta
+it_reads.fastq
+pacbio_clr.fastq
+contigs.fasta
 ```
 
 run SPAdes with the following command:
 
 ``` bash
- spades.py --iontorrent -s it_reads.fastq \
- --pacbio pacbio_clr.fastq --trusted-contigs contigs.fastq \
- -o spades_output
+spades.py --iontorrent -s it_reads.fastq \
+  --pacbio pacbio_clr.fastq --trusted-contigs contigs.fastq \
+  -o spades_output
 ```
 
 If a single-read library is split into several files:
 
 ``` plain
- unpaired1_1.fastq
- unpaired1_2.fastq
- unpaired1_3.fasta
+unpaired1_1.fastq
+unpaired1_2.fastq
+unpaired1_3.fasta
 ```
 
 specify them as one library:
 
 ``` bash
- spades.py --s1 unpaired1_1.fastq \
- --s1 unpaired1_2.fastq --s1 unpaired1_3.fastq \
- -o spades_output
+spades.py --s1 unpaired1_1.fastq \
+  --s1 unpaired1_2.fastq --s1 unpaired1_3.fastq \
+  -o spades_output
 ```
 
 All options for specifying input data can be mixed if needed, but make sure that files for each library are grouped and files with left and right paired reads are listed in the same order.
-