Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first two modules (diamond and kaiju), missing docs and tests #14

Merged
merged 22 commits into from
Jan 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 59 additions & 15 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,43 +1,87 @@
name: nf-core CI
# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors
name: nf-core CI
on:
push:
branches:
- dev
- "dev"
pull_request:
branches:
- "dev"
- "master"
release:
types: [published]
types:
- "published"

env:
NXF_ANSI_LOG: false
NFTEST_VER: "0.7.3"

concurrency:
group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
define_nxf_versions:
name: Choose nextflow versions to test against depending on target branch
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.nxf_versions.outputs.matrix }}
steps:
- id: nxf_versions
run: |
if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.base_ref }}" == "dev" && "${{ matrix.NXF_VER }}" != "latest-everything" ]]; then
echo matrix='["latest-everything"]' | tee -a $GITHUB_OUTPUT
else
echo matrix='["latest-everything", "23.10.0"]' | tee -a $GITHUB_OUTPUT
fi

test:
name: Run pipeline with test data
# Only run on push if this is the nf-core dev branch (merged PRs)
if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/createtaxdb') }}"
name: nf-test
needs: define_nxf_versions
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
NXF_VER:
- "23.04.0"
- "latest-everything"
NXF_VER: ${{ fromJson(needs.define_nxf_versions.outputs.matrix) }}
tags:
- "test"
profile:
- "docker"

steps:
- name: Check out pipeline code
uses: actions/checkout@v4

- name: Check out test data
uses: actions/checkout@v3
with:
repository: nf-core/test-datasets
ref: createtaxdb
path: test-datasets/
fetch-depth: 1

- name: Install Nextflow
uses: nf-core/setup-nextflow@v1
with:
version: "${{ matrix.NXF_VER }}"

- name: Run pipeline with test data
# TODO nf-core: You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
- name: Install nf-test
run: |
wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
sudo mv nf-test /usr/local/bin/

- name: Run nf-test
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
nf-test test --tag ${{ matrix.tags }} --profile ${{ matrix.tags }},${{ matrix.profile }} --junitxml=test.xml

- name: Output log on failure
if: failure()
run: |
sudo apt install bat > /dev/null
batcat --decorations=always --color=always ${{ github.workspace }}/.nf-test/tests/*/output/pipeline_info/software_versions.yml

- name: Publish Test Report
uses: mikepenz/action-junit-report@v3
if: always() # always run even if the previous step fails
with:
report_paths: "*.xml"
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ results/
testing/
testing*
*.pyc
.nf-test*
test.xml
3 changes: 3 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
repository_type: pipeline
## TODO: re-activate once nf-test ci.yml structure updated
lint:
actions_ci: False
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

Initial release of nf-core/createtaxdb, created with the [nf-core](https://nf-co.re/) template.

Adds database building support for:

- DIAMOND (added by @jfy133)
- Kaiju (added by @jfy133)

### `Added`

### `Fixed`
Expand Down
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,11 @@
- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

> Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

- [DIAMOND](https://doi.org/10.1038/nmeth.3176)

> Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176

- [Kaiju](https://doi.org/10.1038/ncomms11257)

> Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. https://doi.org/10.1038/ncomms11257
52 changes: 41 additions & 11 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,60 @@
"items": {
"type": "object",
"properties": {
"sample": {
"id": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample name must be provided and cannot contain spaces"
"unique": true,
"errorMessage": "Sequence reference name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"taxid": {
"type": "integer",
"unique": true,
"errorMessage": "Please provide a valid taxonomic ID in integer format",
"meta": ["taxid"]
},
"fasta_dna": {
"anyOf": [
{
"type": "string",
"pattern": "^\\S+\\.(fasta|fas|fa|fna)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
],
"unique": true,
"errorMessage": "FASTA file for nucleotide sequence cannot contain spaces and must have a valid FASTA extension (fasta, fna, fa, fas, faa), optionally gzipped",
"exists": true,
"format": "file-path"
},
"fastq_2": {
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'",
"fasta_aa": {
"anyOf": [
{
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$"
"pattern": "^\\S+\\.(fasta|fas|fa|faa)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
]
],
"unique": true,
"errorMessage": "FASTA file for amino acid reference sequence cannot contain spaces and must have a valid FASTA extension (fasta, fna, fa, fas, faa), optionally gzipped",
"exists": true,
"format": "file-path"
}
},
"required": ["sample", "fastq_1"]
"required": ["id", "taxid"],
"anyOf": [
{
"required": ["fasta_dna"]
},
{
"required": ["fasta_aa"]
}
]
}
}
3 changes: 3 additions & 0 deletions assets/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,727,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz,
12 changes: 0 additions & 12 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,6 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]

withName: SAMPLESHEET_CHECK {
publishDir = [
path: { "${params.outdir}/pipeline_info" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: FASTQC {
ext.args = '--quiet'
}
jfy133 marked this conversation as resolved.
Show resolved Hide resolved

withName: CUSTOM_DUMPSOFTWAREVERSIONS {
publishDir = [
path: { "${params.outdir}/pipeline_info" },
Expand Down
13 changes: 8 additions & 5 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,13 @@ params {
max_time = '6.h'

// Input data
// TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
// TODO nf-core: Give any required params for the test so that command line flags are not needed
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'

// Genome references
genome = 'R64-1-1'
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/samplesheets/test.csv'

build_kaiju = true
build_diamond = true

prot2taxid = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot.accession2taxid.gz'
nodesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_nodes.dmp'
namesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_names.dmp'
}
28 changes: 16 additions & 12 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,32 +12,36 @@ The directories listed below will be created in the results directory after the

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - Raw read QC
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

### FastQC
### Diamond

<details markdown="1">
<summary>Output files</summary>

- `fastqc/`
- `*_fastqc.html`: FastQC report containing quality metrics.
- `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
- `diamond/`
- `<database>.dmnd`: DIAMOND dmnd database file

</details>

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be a cool if we could extract (dynamically) the nf-core pipelines from nf-co.re that require or use the databases of this module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I exactly follow, but we already see this: on the modules page. For example if you search for DIAMOND here:

https://nf-co.re/modules

you see

image

That taxprofiler is using the DIAMOND_BLASTX module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats exactly what I mean but then have it at in the readme of description of the pipeline.
Output of Kraken can be used in: taxprofiler, MAG, viralrecon, ...


![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
### Kaiju

![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
<details markdown="1">
<summary>Output files</summary>

- `kaiju/`
- `<database_name>.fmi`: Kaiju FMI index file

</details>

[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

:::note
The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
:::
The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.

### MultiQC

Expand Down
20 changes: 10 additions & 10 deletions lib/WorkflowCreatetaxdb.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ class WorkflowCreatetaxdb {
genomeExistsError(params, log)


if (!params.fasta) {
Nextflow.error "Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file."
}
// if (!params.fasta) {
// Nextflow.error "Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file."
// }
}

//
Expand Down Expand Up @@ -58,8 +58,9 @@ class WorkflowCreatetaxdb {
// Uncomment function in methodsDescriptionText to render in MultiQC report
def citation_text = [
"Tools used in the workflow included:",
"FastQC (Andrews 2010),",
"MultiQC (Ewels et al. 2016)",
params.build_diamond ? "DIAMOND (Buchfink et al. 2015)," : "",
params.build_kaiju ? "Kaiju (Menzel et al. 2016)," : "",
"and MultiQC (Ewels et al. 2016)",
"."
].join(' ').trim()

Expand All @@ -68,11 +69,11 @@ class WorkflowCreatetaxdb {

public static String toolBibliographyText(params) {

// TODO Optionally add bibliographic entries to this list.
// Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "<li>Author (2023) Pub name, Journal, DOI</li>" : "",
// Uncomment function in methodsDescriptionText to render in MultiQC report
def reference_text = [
"<li>Andrews S, (2010) FastQC, URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).</li>",
params.build_diamond ? "<li>Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. <a href=\"https://doi.org/10.1038/nmeth.3176\">10.1038/nmeth.3176</a></li>" : "",
params.build_kaiju ? "<li>Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. <a href=\"https://doi.org/10.1038/ncomms11257\">10.1038/ncomms11257</a></li>" : "",
"<li>Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354</li>"
].join(' ').trim()

Expand All @@ -93,9 +94,8 @@ class WorkflowCreatetaxdb {
meta["tool_citations"] = ""
meta["tool_bibliography"] = ""

// TODO Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
//meta["tool_citations"] = toolCitationText(params).replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".")
//meta["tool_bibliography"] = toolBibliographyText(params)
meta["tool_citations"] = toolCitationText(params).replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".")
meta["tool_bibliography"] = toolBibliographyText(params)


def methods_text = mqc_methods_yaml.text
Expand Down
19 changes: 17 additions & 2 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,29 @@
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"cat/cat": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"custom/dumpsoftwareversions": {
"branch": "master",
"git_sha": "bba7e362e4afead70653f84d8700588ea28d0f9e",
"installed_by": ["modules"]
},
"fastqc": {
"diamond/makedb": {
"branch": "master",
"git_sha": "b29f6beb86d1d24d680277fb1a3f4de7b8b8a92c",
"installed_by": ["modules"]
},
"kaiju/mkfmi": {
"branch": "master",
"git_sha": "7365564c402cbd01e9407810730efd10039997a3",
"installed_by": ["modules"]
},
"malt/build": {
jfy133 marked this conversation as resolved.
Show resolved Hide resolved
"branch": "master",
"git_sha": "65ad3e0b9a4099592e1102e92e10455dc661cf53",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"multiqc": {
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/cat/cat/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading