Skip to content

Commit

Permalink
Merge pull request #22 from nf-core/add-malt
Browse files Browse the repository at this point in the history
Add MALT build
  • Loading branch information
jfy133 authored Feb 8, 2024
2 parents 8c5e8ad + bab1926 commit fa482c3
Show file tree
Hide file tree
Showing 32 changed files with 694 additions and 58 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/download_pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,4 @@ jobs:
env:
NXF_SINGULARITY_CACHEDIR: ./
NXF_SINGULARITY_HOME_MOUNT: true
run: nextflow run ./${{ env.REPOTITLE_LOWERCASE }}/$( sed 's/\W/_/g' <<< ${{ env.REPO_BRANCH }}) -stub -profile test,singularity --outdir ./results
run: nextflow run ./${{ env.REPOTITLE_LOWERCASE }}/$( sed 's/\W/_/g' <<< ${{ env.REPO_BRANCH }}) -stub -profile test,singularity --outdir ./results
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Adds database building support for:

- DIAMOND (added by @jfy133)
- Kaiju (added by @jfy133)
- MALT (added by @jfy133)

### `Added`

Expand Down
4 changes: 4 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,7 @@
- [Kaiju](https://doi.org/10.1038/ncomms11257)

> Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. https://doi.org/10.1038/ncomms11257
- [MALT](https://doi.org/10.1038/s41559-017-0446-6)

> Vågene, Å. J., Herbig, A., Campana, M. G., Robles García, N. M., Warinner, C., Sabin, S., Spyrou, M. A., Andrades Valtueña, A., Huson, D., Tuross, N., Bos, K. I., & Krause, J. (2018). Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature Ecology & Evolution, 2(3), 520–528. https://doi.org/10.1038/s41559-017-0446-6
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,16 @@
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->

<!--
1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
-->

1. Prepares input FASTA files for building
2. Build's databases for:
- [DIAMOND](https://doi.org/10.1038/nmeth.3176)
- [Kaiju](https://doi.org/10.1038/ncomms11257)
- [MALT](https://doi.org/10.1038/s41559-017-0446-6)

## Usage

Expand Down
2 changes: 2 additions & 0 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
"anyOf": [
{
"type": "string",
"format": "file-path",
"pattern": "^\\S+\\.(fasta|fas|fa|fna)(\\.gz)?$"
},
{
Expand All @@ -40,6 +41,7 @@
"anyOf": [
{
"type": "string",
"format": "file-path",
"pattern": "^\\S+\\.(fasta|fas|fa|faa)(\\.gz)?$"
},
{
Expand Down
4 changes: 4 additions & 0 deletions conf/base.config
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,8 @@ process {
withName:CUSTOM_DUMPSOFTWAREVERSIONS {
cache = false
}
withName:'KAIJU_MAKEDB'{
memory = { check_max( 24.GB * task.attempt, 'memory' ) }

}
}
13 changes: 13 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -35,4 +35,17 @@ process {
]
}

withName: 'CAT_CAT_DNA' {
ext.prefix = { "${meta.id}.fna" }
}

withName: 'CAT_CAT_AA' {
ext.prefix = { "${meta.id}.faa" }
}

withName: 'MALT_BUILD' {
ext.args = { "--sequenceType ${params.malt_sequencetype}" }

}

}
4 changes: 3 additions & 1 deletion conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,12 @@ params {

input = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/samplesheets/test.csv'

build_kaiju = true
build_diamond = true
build_kaiju = true
build_malt = true

prot2taxid = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot.accession2taxid.gz'
nodesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_nodes.dmp'
namesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_names.dmp'
malt_mapdb = 's3://ngi-igenomes/test-data/createtaxdb/taxonomy/megan-nucl-Feb2022.db.zip'
}
34 changes: 34 additions & 0 deletions conf/test_nothing.config
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
/*
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Nextflow config file for running minimal tests
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Defines input files and everything required to run a fast and simple pipeline test.
Use as follows:
nextflow run nf-core/createtaxdb -profile test,<docker/singularity> --outdir <OUTDIR>
----------------------------------------------------------------------------------------
*/

params {
config_profile_name = 'Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function'

// Limit resources so that this can run on GitHub Actions
max_cpus = 2
max_memory = '6.GB'
max_time = '6.h'

// Input data

input = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/samplesheets/test.csv'

build_diamond = false
build_kaiju = false
build_malt = false

prot2taxid = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot.accession2taxid.gz'
nodesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_nodes.dmp'
namesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_names.dmp'
malt_mapdb = 's3://ngi-igenomes/test-data/createtaxdb/taxonomy/megan-nucl-Feb2022.db.zip'
}
65 changes: 41 additions & 24 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,62 +14,79 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d

- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
- [DIAMOND](#diamond) - Database files for DIAMOND
- [Kaiju](#kaiju) - Database files for Kaiju
- [MALT](#malt) - Database files for MALT

### Diamond
### MultiQC

<details markdown="1">
<summary>Output files</summary>

- `diamond/`
- `<database>.dmnd`: DIAMOND dmnd database file
- `multiqc/`
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- `multiqc_plots/`: directory containing static images from the report in various formats.

</details>

[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.
[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.

### Kaiju
### Pipeline information

<details markdown="1">
<summary>Output files</summary>

- `kaiju/`
- `<database_name>.fmi`: Kaiju FMI index file
- `pipeline_info/`
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
- Parameters used by the pipeline run: `params.json`.

</details>

[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.
[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.
### Diamond

### MultiQC
[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

<details markdown="1">
<summary>Output files</summary>

- `multiqc/`
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
- `multiqc_plots/`: directory containing static images from the report in various formats.
- `diamond/`
- `<database>.dmnd`: DIAMOND dmnd database file

</details>

[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
### Kaiju

### Pipeline information
[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

<details markdown="1">
<summary>Output files</summary>

- `pipeline_info/`
- Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`.
- Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
- Parameters used by the pipeline run: `params.json`.
- `kaiju/`
- `<database_name>.fmi`: Kaiju FMI index file

</details>

[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.

### MALT

[MALT](https://software-ab.cs.uni-tuebingen.de/download/malt) is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.

<details markdown="1">
<summary>Output files</summary>

- `malt/`
- `malt_index/`: directory containing MALT index files

</details>

The `malt_index` directory can be given to MALT itself with `malt-run --index <your_database>/` etc.
7 changes: 4 additions & 3 deletions lib/WorkflowCreatetaxdb.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ class WorkflowCreatetaxdb {
"Tools used in the workflow included:",
params.build_diamond ? "DIAMOND (Buchfink et al. 2015)," : "",
params.build_kaiju ? "Kaiju (Menzel et al. 2016)," : "",
params.build_malt ? "MALT (Vågene et al. 2018)," : "",
"and MultiQC (Ewels et al. 2016)",
"."
].join(' ').trim()
Expand All @@ -72,9 +73,9 @@ class WorkflowCreatetaxdb {
// Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "<li>Author (2023) Pub name, Journal, DOI</li>" : "",
// Uncomment function in methodsDescriptionText to render in MultiQC report
def reference_text = [
params.build_diamond ? "<li>Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. <a href=\"https://doi.org/10.1038/nmeth.3176\">10.1038/nmeth.3176</a></li>" : "",
params.build_kaiju ? "<li>Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. <a href=\"https://doi.org/10.1038/ncomms11257\">10.1038/ncomms11257</a></li>" : "",
"<li>Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354</li>"
params.build_diamond ? "<li>Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. <a href=\"https://doi.org/10.1038/nmeth.3176\">10.1038/nmeth.3176</a></li>" : "",
params.build_kaiju ? "<li>Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. <a href=\"https://doi.org/10.1038/ncomms11257\">10.1038/ncomms11257</a></li>" : "",
params.build_malt ? "<li>Vågene, Å. J., Herbig, A., Campana, M. G., Robles García, N. M., Warinner, C., Sabin, S., Spyrou, M. A., Andrades Valtueña, A., Huson, D., Tuross, N., Bos, K. I., & Krause, J. (2018). Salmonella enterica genomes from victims of a major sixteenth-century epidemic in Mexico. Nature Ecology & Evolution, 2(3), 520–528. <a href=\"https://doi.org/10.1038/s41559-017-0446-6\">10.1038/s41559-017-0446-6</a></li>" : "", "<li>Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354</li>"
].join(' ').trim()

return reference_text
Expand Down
12 changes: 11 additions & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,23 @@
},
"malt/build": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"git_sha": "7d3bac628092d1aead36960c4b6ae41302a9f797",
"installed_by": ["modules"]
},
"multiqc": {
"branch": "master",
"git_sha": "8ec825f465b9c17f9d83000022995b4f7de6fe93",
"installed_by": ["modules"]
},
"pigz/compress": {
"branch": "master",
"git_sha": "0eab94fc1e48703c1b0a8704bd665f554905c39d",
"installed_by": ["modules"]
},
"unzip": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
}
}
}
Expand Down
13 changes: 12 additions & 1 deletion modules/nf-core/malt/build/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

83 changes: 83 additions & 0 deletions modules/nf-core/malt/build/tests/main.nf.test

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit fa482c3

Please sign in to comment.