Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first two modules (diamond and kaiju), missing docs and tests #14

Merged
merged 22 commits into from
Jan 5, 2024

Conversation

jfy133
Copy link
Member

@jfy133 jfy133 commented Dec 5, 2023

This is more of a PoC to draft rough structure. Subject to change during development.

Note input validation not working correctly, as only need to require one of either fasta_dna or fasta_aa, or both. But if neither supplied then it just runs with empty lists :( now working thanks to @mirpedrol :D

Adds:

  • Input schema sheet draft
  • DIAMOND
  • Kaiju

database building and nf-tests :)

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/createtaxdb branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

nextflow_schema.json Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
This was linked to issues Dec 9, 2023
Copy link

github-actions bot commented Dec 14, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit a317cbe

+| ✅ 160 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗  21 tests had warnings |!

❗ Test warnings:

  • readme - README contains the placeholder zenodo.XXXXXXX. This should be replaced with the zenodo doi (after the first release).
  • pipeline_todos - TODO string in nextflow.config: Specify your pipeline's command line flags
  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file
  • pipeline_todos - TODO string in README.md: TODO nf-core:
  • pipeline_todos - TODO string in README.md: Include a figure that guides the user through the major workflow steps. Many nf-core
  • pipeline_todos - TODO string in README.md: Fill in short bullet-pointed list of the default steps in the pipeline
  • pipeline_todos - TODO string in README.md: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
  • pipeline_todos - TODO string in README.md: update the following command to include all required parameters for a minimal example
  • pipeline_todos - TODO string in README.md: If applicable, make list of people who have also contributed
  • pipeline_todos - TODO string in README.md: Add citation for pipeline after first release. Uncomment lines below and update Zenodo doi and badge at the top of this file.
  • pipeline_todos - TODO string in README.md: Add bibliography of tools and data used in your pipeline
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in usage.md: Add documentation about anything specific to running your pipeline. For general topics, please point to (and add to) the main nf-core website.
  • pipeline_todos - TODO string in output.md: Write this documentation describing your workflow's output
  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in WorkflowCreatetaxdb.groovy: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in WorkflowMain.groovy: Add Zenodo DOI for pipeline after first release
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes
  • pipeline_todos - TODO string in base.config: Customise requirements for specific processes.
  • pipeline_todos - TODO string in test_full.config: Specify the paths to your full test data ( on nf-core/test-datasets or directly in repositories, e.g. SRA)
  • pipeline_todos - TODO string in test_full.config: Give any required params for the test so that command line flags are not needed

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.11.1
  • Run at 2024-01-05 16:36:02

@jfy133 jfy133 requested a review from mashehu December 14, 2023 12:14
@LilyAnderssonLee LilyAnderssonLee self-requested a review December 18, 2023 09:57
modules.json Outdated Show resolved Hide resolved
Copy link
Contributor

@Joon-Klaps Joon-Klaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to have this pipeline, so happy to contribute!
Small suggestions you might have missed

assets/test.csv Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved

![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
The `dmnd` file can be given to one of the DIAMOND alignment commands with `diamond blast<x/p> -d <your_database>.dmnd` etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be a cool if we could extract (dynamically) the nf-core pipelines from nf-co.re that require or use the databases of this module?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I exactly follow, but we already see this: on the modules page. For example if you search for DIAMOND here:

https://nf-co.re/modules

you see

image

That taxprofiler is using the DIAMOND_BLASTX module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats exactly what I mean but then have it at in the readme of description of the pipeline.
Output of Kraken can be used in: taxprofiler, MAG, viralrecon, ...

tests/test.nf.test Outdated Show resolved Hide resolved
modules.json Show resolved Hide resolved
workflows/createtaxdb.nf Outdated Show resolved Hide resolved
Copy link

@Midnighter Midnighter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is very early work. I'm fine with merging as is but I had two concerns that I propose to change in the long run:

  1. There are two separate input options for nodesdmp and namesdmp. I would expect a path/tar of a directory with those files (possibly containing more of the dump files).
  2. I would create one subworkflow per tool that the main createtaxdb workflow calls to.

@jfy133
Copy link
Member Author

jfy133 commented Jan 2, 2024

I know this is very early work. I'm fine with merging as is but I had two concerns that I propose to change in the long run:

Do you mind if you give me an approval then I can merge this in and follow up depending on feedback of my status below? Then can start doing more parallel PRs to add the below

  1. There are two separate input options for nodesdmp and namesdmp. I would expect a path/tar of a directory with those files (possibly containing more of the dump files).

Is expectation this based purely on ncbi taxdump files? I currently had kept it separate because if people want to make custom databases that may include customised dump files whereby I don't see why one would necessarily re-tar...

That said it is reasonable to expect someone may want to do that... I may consider adding it as an option (if one gives the taxdump as tar it'll auto extract - but I vaguely remember plucking specific files from a directory is not directly trivial with Nxf).

But I would make this a separate issue as a separate functionality (including maybe auto downloading taxdump files, but I'm not sure yet how to do this properly e.g. with gtdbtk taxdump stuff etc)

  1. I would create one subworkflow per tool that the main createtaxdb workflow calls to.

Yes that's my plan for multi-module database construction commands (e.g. kraken2), or do you have a motivation to do this for single build modules too?

@Midnighter
Copy link

Do you mind if you give me an approval then I can merge this in and follow up depending on feedback of my status below?

I didn't approve due to the open comments. Do you plan to address them or see them as irrelevant?

Is expectation this based purely on ncbi taxdump files? I currently had kept it separate because if people want to make custom databases that may include customised dump files whereby I don't see why one would necessarily re-tar...

That said it is reasonable to expect someone may want to do that... I may consider adding it as an option (if one gives the taxdump as tar it'll auto extract - but I vaguely remember plucking specific files from a directory is not directly trivial with Nxf).

My expectation is based on taxonkit usage, yes. With a custom taxonomy I would just pass a path to a directory and then expect {dir}/nodes.dmp and {dir}/names.dmp to exist.

Yes that's my plan for multi-module database construction commands (e.g. kraken2), or do you have a motivation to do this for single build modules too?

No, I don't see a reason for single modules. I didn't read the file properly and somehow thought the input channel transformation was Kaiju-specific.

@jfy133
Copy link
Member Author

jfy133 commented Jan 2, 2024

Do you mind if you give me an approval then I can merge this in and follow up depending on feedback of my status below?

I didn't approve due to the open comments. Do you plan to address them or see them as irrelevant?

Hm, I thought I had addressed them 🤔, but I see now there is no commit. Maybe I didn't push...

Is expectation this based purely on ncbi taxdump files? I currently had kept it separate because if people want to make custom databases that may include customised dump files whereby I don't see why one would necessarily re-tar...

That said it is reasonable to expect someone may want to do that... I may consider adding it as an option (if one gives the taxdump as tar it'll auto extract - but I vaguely remember plucking specific files from a directory is not directly trivial with Nxf).

My expectation is based on taxonkit usage, yes. With a custom taxonomy I would just pass a path to a directory and then expect {dir}/nodes.dmp and {dir}/names.dmp to exist.

Hrm, ok. I'll maybe look more in detail to taxonkit and try and use that as a structure to follow. But I think I would still do that as a follow up PR (because it should be quite straightforward to just change how those files are picked up and passed to the module).

Yes that's my plan for multi-module database construction commands (e.g. kraken2), or do you have a motivation to do this for single build modules too?

No, I don't see a reason for single modules. I didn't read the file properly and somehow thought the input channel transformation was Kaiju-specific.

Ah ok! Then I leave that bit as is for now at least.

Copy link

@Midnighter Midnighter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provided comments are addressed, this looks like a good start.

@jfy133
Copy link
Member Author

jfy133 commented Jan 5, 2024

OK thank you for the reviews @maxulysse @Joon-Klaps @Midnighter !

I've addressed all the suggestions now (except for the larger one from @Midnighter regarding taxdump, but I'll do that as a follow up), so I will merge so others can start to get involved, and other unaddressed changes can be dealt with in a follow up :)

@jfy133 jfy133 merged commit 467459b into dev Jan 5, 2024
7 checks passed
@jfy133 jfy133 deleted the input-validation branch January 5, 2024 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Kaiju database build Add DIAMOND database build
5 participants