-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mapping samples to multiple reference fasta #181
Comments
I think this would be a nice addition, particularly for population (epi)genetics. We need to think about how this could work together with iGenomes though |
I have done something similar in the nanoseq pipeline. Check out the format of the input samplesheet here. The important thing is to have a validation script for the samplesheet that checks the entries provided by users. I wrote this function to resolve the iGenomes per sample if provided. Note, you will also have to configure the channels to only build genome indices once if multiple samples share the same reference. It will be tricky but not impossible 🙂 |
this is inspired by the functionality in nf-core/nanoseq and nf-core/rnaseq The idea is to require a samplesheet to run the pipeline, which will allow for single/paired end auto-detection and mapping samples against different reference genomes. addresses nf-core#181
this is inspired by the functionality in nf-core/nanoseq and nf-core/rnaseq The idea is to require a samplesheet to run the pipeline, which will allow for single/paired end auto-detection and mapping samples against different reference genomes. addresses nf-core#181
There is now a dsl2 branch with this functionality. Please give it a try 👍 |
NB: This was removed from the dsl2 branch so that we could work with centralised modules in nf-core/modules. However, I then wrote Now it just needs implementing again.. |
Hey all, thanks so much for all your hard work on the DSL2 pipeline! input = Channel.of(
["homo_sapiens", [fastq_1, fastq_2]],
["homo_sapiens", [fastq_1, fastq_2]],
["mus_musculus", [fastq_1, fastq_2]]
)
refs = Channel.of(["homo_sapiens", path/to/ref], ["mus_musculus", path/to/ref])
input
.combine(refs, by: 0) // "homo_sapiens", [fastq_1, fastq_2], path/to/ref
.multiMap{ it ->
fastqs: it[0..1]
refs: it[2]
}.set{ for_align }
aligner(for_align.fastqs, for_align.refs) I've had succes using this in pipelines that I've written for personal use. The |
It would be great to get this issue going. For my population epigenomics project, I will have an individual genome for each sample, so I would priotize providing paths to FASTA files in the ' Second, do I understand correctly that would like to keep the nf-core modules untouched so they stay for general use? @njspix do you have a full repository of the pipeline with your additions available and mind sharing it? |
Hey @mobilegenome thanks for your interest! Unfortunately I don't have a lot to contribute - most of the previous work on this (I think) was done with bespoke/custom modules, which we prefer to avoid (especially for core modules like alignment, etc). The core issue here is that the aligner modules take the fastq files as one channel and the index/reference in a separate channel. Nextflow ordinarily doesn't care much about the order of items in a channel, so it's difficult to sync up two channels (e.g. a 'fastq' channel with files from species a, b, and c; and a 'reference' channel with indexes for species a, b, and c). To overcome this issue, there are at least 2 ideas -
Phil wrote the Solution (2) is really untested in a real-life pipeline, but would avoid having to patch a stock module. That's all I have - sorry I can't devote more time to this right now! |
So is this:
or
The first I think is going to be fundamentally incompatible with nf-core modules for now (See nextflow-io/nextflow#2085). Nextflow doesn't join across metas and there's just no way to link between tuples like you'd expect. The second option might be pretty easy. I think in either case, it might just be easier to run the pipeline twice, but I get the desire to want to do so. |
Hey @emiller88 could you clarify why 1) is incompatible with nf-core modules?? I'm not following you 100% |
It's a rabbit hole, I'll try to keep this short 😆 So usually there's a patten like: input:
tuple val(meta), path(reads)
path index So you'd think you could do: input:
tuple val(meta), path(reads)
tuple val(meta), path(index) And it would give you a So you really need: input:
tuple val(meta), path(reads), path(index) To ensure you're matching the index with the proper reads. But it starts to get really messy quickly. I guess with could |
ok - thank you so much for clarifying!! It's a bit magical (and I don't mean that in a good way), but it allows you to use the pattern
as is by synchronizing the two input channels. in essence, you let me know if i'm missing the bus here! Thanks much for your work on this. |
Ah, I missed that, that's pretty cool! I guess we can just add it to a meta map, but in the methylseq pipeline, I don't actually see any other use of the reference besides just alignment? I just searched for "fasta", "index" "genome" and it only looks like it's the first few steps. So that would be pretty easy to maintain! |
Yeah it's a pretty nifty feature! I'm not sure about It's a bit inelegant but my first attempt would probably be something like this for each process requiring a reference:
If that's too clunky I wonder if we could abstract out the repeated code into a function... |
Hi all, thanks for your work and sorry for not getting back to your earlier answer @njspix! Unfortunately, I couldn't spend any time on this so far. Regarding the use of the reference/index in steps following the alignment, it is also used in the |
Good to know, thanks very much! |
Hi,
I was thinking of adapting this pipeline to take care of multiple reference genomes. Something like each sample would be aligned to different reference fasta file. Any votes for such a feature?
The input
if given
would be a csv file with sample_id, path_to_reference_fasta, path_to_sample_fastqCheers,
Rahul
The text was updated successfully, but these errors were encountered: