THE README IS OUTDATED!
This pipeline takes metagenomic (paired-end) short-read as input, as generated by Illumina sequencing. From this data, the pipeline aims to assemble high-quality single genomes.
- All samples are separately quality controlled to remove Illumina library adaptors, low quality sequences and sequence ends, and possible host-genetic (human) and spike in contamination (PhiX). The cleaned reads are used for the subsequently following steps.
- For all samples a separate metagenomic assembly is performed using Spades in metagenomic mode. The sequences are then mapped back to the resulting scaffolds and binned using the MaxBin2 software.
- Additionally, a co-assembly with Megahit is performed. Using a groupfile it is possible to split samples into separate groups for this co-assembly. Again the resulting contigs are used as reference for backmapping, followed by two separate binning approaches using MaxBin2 and Metabat2, which for this approach now can also incorporate across-sample abundance differences for the binning procedure.
- The resulting bins from the single-sample and subgroup co-assembly approaches are finally dereplicated using the dRep package, to achieve the highest-possible quality of single-genome bins combined with low redundancy.
- All samples are again mapped to the final resulting bins to estimate bin abundance.
To execute the pipeline with all default settings, do:
nextflow -c nextflow.config run main.nf --folder /path/to/folder
Please note: the output will be written where the pipeline is executed, NOT where the input files are located.
The pipeline uses several parameters to fine-tune the various pipeline stages. Some of these can be modified during pipeline execution:
- BBMap (v.37.88): QC, Mapping to contigs.
- Megahit (v1.1.2): Groupwise Co-Assemblies.
- Spades (v.3.9.0): Single-Sample Assemblies.
- Samtools (v.1.5): Conversion of SAM to BAM.
- MaxBin2 (v.2.2.4): Binning of Contigs.
- Metabat2 (v.2.12.1): Binning of Contigs.
- dRep (v.2.0.5): Evaluation of Binned Contigs.
- CheckM (v.1.0.11): Used by dRep.
Many of these tools have additional dependencies that are not listed here. If the tool works properly on its own, these are likely satisfied.
Toe remove human host/lab contamination the database has to be prepared as described here.
dRep
and CheckM
used different versions of Python2 and Python3. Please follow the instructions provided on the dRep website to solve this issue using pyenv
.
The version numbers are the software versions used in development/testing of the pipeline.
This pipeline is comprised of the following components:
- main.nf (the actual workflow definition)
- nextflow.config (the top-level configuration file with generic options)
- config/rzcluster.config (the RZcluster specific configuration options)
- README.md (this file)
This pipeline was developed by M. Rühlemann.