Skip to content

Commit

Permalink
Finish adding basic MALT parameters
Browse files Browse the repository at this point in the history
  • Loading branch information
jfy133 committed Feb 8, 2024
1 parent f9cad14 commit 14a1507
Show file tree
Hide file tree
Showing 8 changed files with 111 additions and 45 deletions.
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,7 @@
- [Kaiju](https://doi.org/10.1038/ncomms11257)

> Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. https://doi.org/10.1038/ncomms11257
- [MALT](https://doi.org/10.1038/s41559-017-0446-6)

> Vågene, Å. J., Herbig, A., Campana, M. G., Robles García, N. M., Warinner, C., Sabin, S., Spyrou, M. A., Andrades Valtueña, A., Huson, D., Tuross, N., Bos, K. I., & Krause, J. (2018). Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature Ecology & Evolution, 2(3), 520–528. https://doi.org/10.1038/s41559-017-0446-6
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,16 @@
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

<!--
1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
-->

1. Prepares input FASTA files for building
2. Build's databases for:
- [DIAMOND](https://doi.org/10.1038/nmeth.3176)
- [Kaiju](https://doi.org/10.1038/ncomms11257)
- [MALT](https://doi.org/10.1038/s41559-017-0446-6)

## Usage

Expand Down
4 changes: 4 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,8 @@ process {
withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
withName:'KAIJU_MAKEDB'{
memory = { check_max( 24.GB * task.attempt, 'memory' ) }

}
}
8 changes: 8 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,14 @@ process {
]
}

withName: 'CAT_CAT_DNA' {
ext.prefix = { "${meta.id}.fna" }
}

withName: 'CAT_CAT_AA' {
ext.prefix = { "${meta.id}.faa" }
}

withName: 'MALT_BUILD' {
ext.args = { "--sequenceType ${params.malt_sequencetype}" }

Expand Down
65 changes: 41 additions & 24 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,62 +14,79 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [DIAMOND](#diamond) - Database files for DIAMOND
- [Kaiju](#kaiju) - Database files for Kaiju
- [MALT](#malt) - Database files for MALT

### Diamond
### MultiQC

<details markdown="1">
<summary>Output files</summary>

- `diamond/`
- `<database>.dmnd`: DIAMOND dmnd database file
- `multiqc/`
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- `multiqc_plots/`: directory containing static images from the report in various formats.

</details>

[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.
[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.

### Kaiju
### Pipeline information

<details markdown="1">
<summary>Output files</summary>

- `kaiju/`
- `<database_name>.fmi`: Kaiju FMI index file
- `pipeline_info/`
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
- Parameters used by the pipeline run: `params.json`.

</details>

[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.
[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.
### Diamond

### MultiQC
[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

<details markdown="1">
<summary>Output files</summary>

- `multiqc/`
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- `multiqc_plots/`: directory containing static images from the report in various formats.
- `diamond/`
- `<database>.dmnd`: DIAMOND dmnd database file

</details>

[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
### Kaiju

### Pipeline information
[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

<details markdown="1">
<summary>Output files</summary>

- `pipeline_info/`
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
- Parameters used by the pipeline run: `params.json`.
- `kaiju/`
- `<database_name>.fmi`: Kaiju FMI index file

</details>

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.

### MALT

[MALT](https://software-ab.cs.uni-tuebingen.de/download/malt) is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.

<details markdown="1">
<summary>Output files</summary>

- `malt/`
- `malt_index/`: directory containing MALT index files

</details>

The `malt_index` directory can be given to MALT itself with `malt-run --index <your_database>/` etc.
7 changes: 4 additions & 3 deletions lib/WorkflowCreatetaxdb.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ class WorkflowCreatetaxdb {
"Tools used in the workflow included:",
params.build_diamond ? "DIAMOND (Buchfink et al. 2015)," : "",
params.build_kaiju ? "Kaiju (Menzel et al. 2016)," : "",
params.build_malt ? "MALT (Vågene et al. 2018)," : "",
"and MultiQC (Ewels et al. 2016)",
"."
].join(' ').trim()
Expand All @@ -72,9 +73,9 @@ class WorkflowCreatetaxdb {
// Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "<li>Author (2023) Pub name, Journal, DOI</li>" : "",
// Uncomment function in methodsDescriptionText to render in MultiQC report
def reference_text = [
params.build_diamond ? "<li>Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. <a href=\"https://doi.org/10.1038/nmeth.3176\">10.1038/nmeth.3176</a></li>" : "",
params.build_kaiju ? "<li>Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. <a href=\"https://doi.org/10.1038/ncomms11257\">10.1038/ncomms11257</a></li>" : "",
"<li>Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354</li>"
params.build_diamond ? "<li>Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. <a href=\"https://doi.org/10.1038/nmeth.3176\">10.1038/nmeth.3176</a></li>" : "",
params.build_kaiju ? "<li>Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. <a href=\"https://doi.org/10.1038/ncomms11257\">10.1038/ncomms11257</a></li>" : "",
params.build_malt ? "<li>Vågene, Å. J., Herbig, A., Campana, M. G., Robles García, N. M., Warinner, C., Sabin, S., Spyrou, M. A., Andrades Valtueña, A., Huson, D., Tuross, N., Bos, K. I., & Krause, J. (2018). Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature Ecology & Evolution, 2(3), 520–528. <a href=\"https://doi.org/10.1038/s41559-017-0446-6\">10.1038/s41559-017-0446-6</a></li>" : "", "<li>Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354</li>"
].join(' ').trim()

return reference_text
Expand Down
8 changes: 8 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,14 @@
"type": "boolean",
"fa_icon": "fas fa-toggle-on",
"description": "Turn on building of MALT database. Requires nucleotide FASTA file input."
},
"malt_sequencetype": {
"type": "string",
"default": "DNA",
"description": "Specify type of input sequence being given to MALT",
"enum": ["DNA", "Protein"],
"help_text": "Use to specify whether the reference sequences are DNA or Protein sequences. (For RNA sequences, use the DNA setting) - from [MALT manual](https://software-ab.cs.uni-tuebingen.de/download/malt/).\n\n> Modifies tool(s) parameter(s)\n> - malt-build: `--sequenceType` ",
"fa_icon": "fas fa-dna"
}
},
"fa_icon": "fas fa-database"
Expand Down
52 changes: 34 additions & 18 deletions workflows/createtaxdb.nf
Original file line number Diff line number Diff line change
Expand Up @@ -59,13 +59,14 @@ ch_multiqc_custom_methods_description = params.multiqc_methods_description ? fil
include { MULTIQC } from '../modules/nf-core/multiqc/main'
include { CUSTOM_DUMPSOFTWAREVERSIONS } from '../modules/nf-core/custom/dumpsoftwareversions/main'

include { CAT_CAT as CAT_CAT_DNA } from '../modules/nf-core/cat/cat/main'
include { CAT_CAT as CAT_CAT_AA } from '../modules/nf-core/cat/cat/main'
include { KAIJU_MKFMI } from '../modules/nf-core/kaiju/mkfmi/main'
include { DIAMOND_MAKEDB } from '../modules/nf-core/diamond/makedb/main'
include { MALT_BUILD } from '../modules/nf-core/malt/build/main'
include { PIGZ_COMPRESS } from '../modules/nf-core/pigz/compress/main'
include { UNZIP } from '../modules/nf-core/unzip/main'
include { CAT_CAT as CAT_CAT_DNA } from '../modules/nf-core/cat/cat/main'
include { CAT_CAT as CAT_CAT_AA } from '../modules/nf-core/cat/cat/main'
include { KAIJU_MKFMI } from '../modules/nf-core/kaiju/mkfmi/main'
include { DIAMOND_MAKEDB } from '../modules/nf-core/diamond/makedb/main'
include { MALT_BUILD } from '../modules/nf-core/malt/build/main'
include { PIGZ_COMPRESS as PIGZ_COMPRESS_DNA } from '../modules/nf-core/pigz/compress/main'
include { PIGZ_COMPRESS as PIGZ_COMPRESS_AA } from '../modules/nf-core/pigz/compress/main'
include { UNZIP } from '../modules/nf-core/unzip/main'
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RUN MAIN WORKFLOW
Expand Down Expand Up @@ -101,9 +102,9 @@ workflow CREATETAXDB {
unzipped: true
}

PIGZ_COMPRESS ( ch_dna_for_zipping.unzipped )
PIGZ_COMPRESS_DNA ( ch_dna_for_zipping.unzipped )

ch_prepped_dna_fastas = PIGZ_COMPRESS.out.archive.mix(ch_dna_for_zipping.zipped).groupTuple()
ch_prepped_dna_fastas = PIGZ_COMPRESS_DNA.out.archive.mix(ch_dna_for_zipping.zipped).groupTuple()

// Place in single file
ch_singleref_for_dna = CAT_CAT_DNA ( ch_prepped_dna_fastas )
Expand All @@ -118,17 +119,24 @@ workflow CREATETAXDB {
// idea: try just appending `_<tax_id_from_meta>` to end of each sequence header using a local sed module... it might be sufficient
if ( [params.build_kaiju, params.build_diamond].any() ) {

// Pull just AA sequences
ch_aa_refs_for_singleref = ch_input
.map{meta, fasta_dna, fasta_aa -> [[id: params.dbname], fasta_aa]}
.filter{meta, fasta_aa ->
fasta_aa
}
.groupTuple()
.map{meta, fasta_dna, fasta_aa -> [[id: params.dbname], fasta_aa]}
.filter{meta, fasta_aa ->
fasta_aa
}

ch_aa_for_zipping = ch_aa_refs_for_singleref
.branch {
meta, fasta ->
zipped: fasta.extension == 'gz'
unzipped: true
}

PIGZ_COMPRESS_AA ( ch_aa_for_zipping.unzipped )

ch_prepped_aa_fastas = PIGZ_COMPRESS_AA.out.archive.mix(ch_aa_for_zipping.zipped).groupTuple()

// TODO: BROKEN -> CATS UNZIPPED AND ZIPPED FATSAS (Also for DNA) - Place in a single file
ch_singleref_for_aa = CAT_CAT_AA ( ch_aa_refs_for_singleref )
ch_singleref_for_aa = CAT_CAT_AA ( ch_prepped_aa_fastas )
ch_versions = ch_versions.mix(CAT_CAT_AA.out.versions.first())
}

Expand Down Expand Up @@ -165,7 +173,15 @@ workflow CREATETAXDB {
ch_malt_mapdb = file(params.malt_mapdb)
}

MALT_BUILD (ch_prepped_dna_fastas.map{ meta, file -> file }, [], ch_malt_mapdb)
ch_input_for_malt

if ( params.malt_sequencetype == 'Protein') {
ch_input_for_malt = ch_prepped_aa_fastas.map{ meta, file -> file }
} else {
ch_input_for_malt = ch_prepped_dna_fastas.map{ meta, file -> file }
}

MALT_BUILD (ch_input_for_malt, [], ch_malt_mapdb)
}

CUSTOM_DUMPSOFTWAREVERSIONS (
Expand Down

0 comments on commit 14a1507

Please sign in to comment.