Skip to content

Commit

Permalink
Merge pull request #14 from nf-core/input-validation
Browse files Browse the repository at this point in the history
Add first two modules (diamond and kaiju), missing docs and tests
  • Loading branch information
jfy133 authored Jan 5, 2024
2 parents 4bc13ea + a317cbe commit 467459b
Show file tree
Hide file tree
Showing 44 changed files with 1,191 additions and 329 deletions.
74 changes: 59 additions & 15 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,43 +1,87 @@
name: nf-core CI
# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors
name: nf-core CI
on:
push:
branches:
- dev
- "dev"
pull_request:
branches:
- "dev"
- "master"
release:
types: [published]
types:
- "published"

env:
NXF_ANSI_LOG: false
NFTEST_VER: "0.7.3"

concurrency:
group: "${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}"
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true

jobs:
define_nxf_versions:
name: Choose nextflow versions to test against depending on target branch
runs-on: ubuntu-latest
outputs:
matrix: ${{ steps.nxf_versions.outputs.matrix }}
steps:
- id: nxf_versions
run: |
if [[ "${{ github.event_name }}" == "pull_request" && "${{ github.base_ref }}" == "dev" && "${{ matrix.NXF_VER }}" != "latest-everything" ]]; then
echo matrix='["latest-everything"]' | tee -a $GITHUB_OUTPUT
else
echo matrix='["latest-everything", "23.10.0"]' | tee -a $GITHUB_OUTPUT
fi
test:
name: Run pipeline with test data
# Only run on push if this is the nf-core dev branch (merged PRs)
if: "${{ github.event_name != 'push' || (github.event_name == 'push' && github.repository == 'nf-core/createtaxdb') }}"
name: nf-test
needs: define_nxf_versions
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
NXF_VER:
- "23.04.0"
- "latest-everything"
NXF_VER: ${{ fromJson(needs.define_nxf_versions.outputs.matrix) }}
tags:
- "test"
profile:
- "docker"

steps:
- name: Check out pipeline code
uses: actions/checkout@v4

- name: Check out test data
uses: actions/checkout@v3
with:
repository: nf-core/test-datasets
ref: createtaxdb
path: test-datasets/
fetch-depth: 1

- name: Install Nextflow
uses: nf-core/setup-nextflow@v1
with:
version: "${{ matrix.NXF_VER }}"

- name: Run pipeline with test data
# TODO nf-core: You can customise CI pipeline run tests as required
# For example: adding multiple test runs with different parameters
# Remember that you can parallelise this by using strategy.matrix
- name: Install nf-test
run: |
wget -qO- https://code.askimed.com/install/nf-test | bash -s $NFTEST_VER
sudo mv nf-test /usr/local/bin/
- name: Run nf-test
run: |
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
nf-test test --tag ${{ matrix.tags }} --profile ${{ matrix.tags }},${{ matrix.profile }} --junitxml=test.xml
- name: Output log on failure
if: failure()
run: |
sudo apt install bat > /dev/null
batcat --decorations=always --color=always ${{ github.workspace }}/.nf-test/tests/*/output/pipeline_info/software_versions.yml
- name: Publish Test Report
uses: mikepenz/action-junit-report@v3
if: always() # always run even if the previous step fails
with:
report_paths: "*.xml"
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,5 @@ results/
testing/
testing*
*.pyc
.nf-test*
test.xml
3 changes: 3 additions & 0 deletions .nf-core.yml
Original file line number Diff line number Diff line change
@@ -1 +1,4 @@
repository_type: pipeline
## TODO: re-activate once nf-test ci.yml structure updated
lint:
actions_ci: False
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

Initial release of nf-core/createtaxdb, created with the [nf-core](https://nf-co.re/) template.

Adds database building support for:

- DIAMOND (added by @jfy133)
- Kaiju (added by @jfy133)

### `Added`

### `Fixed`
Expand Down
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,11 @@
- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

> Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
- [DIAMOND](https://doi.org/10.1038/nmeth.3176)

> Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. https://doi.org/10.1038/nmeth.3176
- [Kaiju](https://doi.org/10.1038/ncomms11257)

> Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. https://doi.org/10.1038/ncomms11257
52 changes: 41 additions & 11 deletions assets/schema_input.json
Original file line number Diff line number Diff line change
Expand Up @@ -7,30 +7,60 @@
"items": {
"type": "object",
"properties": {
"sample": {
"id": {
"type": "string",
"pattern": "^\\S+$",
"errorMessage": "Sample name must be provided and cannot contain spaces"
"unique": true,
"errorMessage": "Sequence reference name must be provided and cannot contain spaces",
"meta": ["id"]
},
"fastq_1": {
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
"taxid": {
"type": "integer",
"unique": true,
"errorMessage": "Please provide a valid taxonomic ID in integer format",
"meta": ["taxid"]
},
"fasta_dna": {
"anyOf": [
{
"type": "string",
"pattern": "^\\S+\\.(fasta|fas|fa|fna)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
],
"unique": true,
"errorMessage": "FASTA file for nucleotide sequence cannot contain spaces and must have a valid FASTA extension (fasta, fna, fa, fas, faa), optionally gzipped",
"exists": true,
"format": "file-path"
},
"fastq_2": {
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'",
"fasta_aa": {
"anyOf": [
{
"type": "string",
"pattern": "^\\S+\\.f(ast)?q\\.gz$"
"pattern": "^\\S+\\.(fasta|fas|fa|faa)(\\.gz)?$"
},
{
"type": "string",
"maxLength": 0
}
]
],
"unique": true,
"errorMessage": "FASTA file for amino acid reference sequence cannot contain spaces and must have a valid FASTA extension (fasta, fna, fa, fas, faa), optionally gzipped",
"exists": true,
"format": "file-path"
}
},
"required": ["sample", "fastq_1"]
"required": ["id", "taxid"],
"anyOf": [
{
"required": ["fasta_dna"]
},
{
"required": ["fasta_aa"]
}
]
}
}
3 changes: 3 additions & 0 deletions assets/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
id,taxid,fasta_dna,fasta_aa
Severe_acute_respiratory_syndrome_coronavirus_2,2697049,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.fasta,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/sarscov2.faa
Haemophilus_influenzae,727,https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/fasta/haemophilus_influenzae.fna.gz,
12 changes: 0 additions & 12 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,6 @@ process {
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]

withName: SAMPLESHEET_CHECK {
publishDir = [
path: { "${params.outdir}/pipeline_info" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
}

withName: FASTQC {
ext.args = '--quiet'
}

withName: CUSTOM_DUMPSOFTWAREVERSIONS {
publishDir = [
path: { "${params.outdir}/pipeline_info" },
Expand Down
13 changes: 8 additions & 5 deletions conf/test.config
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,13 @@ params {
max_time = '6.h'

// Input data
// TODO nf-core: Specify the paths to your test data on nf-core/test-datasets
// TODO nf-core: Give any required params for the test so that command line flags are not needed
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv'

// Genome references
genome = 'R64-1-1'
input = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/samplesheets/test.csv'

build_kaiju = true
build_diamond = true

prot2taxid = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot.accession2taxid.gz'
nodesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_nodes.dmp'
namesdmp = 'https://raw.githubusercontent.com/nf-core/test-datasets/createtaxdb/data/taxonomy/prot_names.dmp'
}
28 changes: 16 additions & 12 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,32 +12,36 @@ The directories listed below will be created in the results directory after the

The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:

- [FastQC](#fastqc) - Raw read QC
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution

### FastQC
### Diamond

<details markdown="1">
<summary>Output files</summary>

- `fastqc/`
- `*_fastqc.html`: FastQC report containing quality metrics.
- `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
- `diamond/`
- `<database>.dmnd`: DIAMOND dmnd database file

</details>

[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
[DIAMOND](https://github.com/bbuchfink/diamond) is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.

![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
### Kaiju

![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
<details markdown="1">
<summary>Output files</summary>

- `kaiju/`
- `<database_name>.fmi`: Kaiju FMI index file

</details>

[Kaiju](https://bioinformatics-centre.github.io/kaiju/) is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

:::note
The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
:::
The `fmi` file can be given to Kaiju itself with `kaiju -f <your_database>.fmi` etc.

### MultiQC

Expand Down
20 changes: 10 additions & 10 deletions lib/WorkflowCreatetaxdb.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ class WorkflowCreatetaxdb {
genomeExistsError(params, log)


if (!params.fasta) {
Nextflow.error "Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file."
}
// if (!params.fasta) {
// Nextflow.error "Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file."
// }
}

//
Expand Down Expand Up @@ -58,8 +58,9 @@ class WorkflowCreatetaxdb {
// Uncomment function in methodsDescriptionText to render in MultiQC report
def citation_text = [
"Tools used in the workflow included:",
"FastQC (Andrews 2010),",
"MultiQC (Ewels et al. 2016)",
params.build_diamond ? "DIAMOND (Buchfink et al. 2015)," : "",
params.build_kaiju ? "Kaiju (Menzel et al. 2016)," : "",
"and MultiQC (Ewels et al. 2016)",
"."
].join(' ').trim()

Expand All @@ -68,11 +69,11 @@ class WorkflowCreatetaxdb {

public static String toolBibliographyText(params) {

// TODO Optionally add bibliographic entries to this list.
// Can use ternary operators to dynamically construct based conditions, e.g. params["run_xyz"] ? "<li>Author (2023) Pub name, Journal, DOI</li>" : "",
// Uncomment function in methodsDescriptionText to render in MultiQC report
def reference_text = [
"<li>Andrews S, (2010) FastQC, URL: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).</li>",
params.build_diamond ? "<li>Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59–60. <a href=\"https://doi.org/10.1038/nmeth.3176\">10.1038/nmeth.3176</a></li>" : "",
params.build_kaiju ? "<li>Menzel, P., Ng, K. L., & Krogh, A. (2016). Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications, 7, 11257. <a href=\"https://doi.org/10.1038/ncomms11257\">10.1038/ncomms11257</a></li>" : "",
"<li>Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics , 32(19), 3047–3048. doi: /10.1093/bioinformatics/btw354</li>"
].join(' ').trim()

Expand All @@ -93,9 +94,8 @@ class WorkflowCreatetaxdb {
meta["tool_citations"] = ""
meta["tool_bibliography"] = ""

// TODO Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
//meta["tool_citations"] = toolCitationText(params).replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".")
//meta["tool_bibliography"] = toolBibliographyText(params)
meta["tool_citations"] = toolCitationText(params).replaceAll(", \\.", ".").replaceAll("\\. \\.", ".").replaceAll(", \\.", ".")
meta["tool_bibliography"] = toolBibliographyText(params)


def methods_text = mqc_methods_yaml.text
Expand Down
19 changes: 17 additions & 2 deletions modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,29 @@
"https://github.com/nf-core/modules.git": {
"modules": {
"nf-core": {
"cat/cat": {
"branch": "master",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"custom/dumpsoftwareversions": {
"branch": "master",
"git_sha": "bba7e362e4afead70653f84d8700588ea28d0f9e",
"installed_by": ["modules"]
},
"fastqc": {
"diamond/makedb": {
"branch": "master",
"git_sha": "b29f6beb86d1d24d680277fb1a3f4de7b8b8a92c",
"installed_by": ["modules"]
},
"kaiju/mkfmi": {
"branch": "master",
"git_sha": "7365564c402cbd01e9407810730efd10039997a3",
"installed_by": ["modules"]
},
"malt/build": {
"branch": "master",
"git_sha": "65ad3e0b9a4099592e1102e92e10455dc661cf53",
"git_sha": "3f5420aa22e00bd030a2556dfdffc9e164ec0ec5",
"installed_by": ["modules"]
},
"multiqc": {
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/cat/cat/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 467459b

Please sign in to comment.