-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential problem with demultiplexing and proposed solution #613
Labels
enhancement
New feature or request
Comments
Hm, maybe it would be possible to:
admittedly, I have no idea how the last two steps work, that needs to be tested with channel operators whether that is possible. If not, my whole idea ofc is not feasible. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description of feature
The early steps in the pipeline run multiple samples in parallel (e.g., if there is 10 samples, fastqc gets run 10 times). This is the normal way of doing things in nf-core and I think it's great in lots of situations, however I don't think this is an efficient way of handling demultiplexing (at least with Cutadapt). If you have 100 samples and two large fastq file, running Cutadapt 100 times might not be the most efficient use of memory. You would also create lots of redundant data. For each sample, the reads for the other samples would get placed in unknown files and you can quickly fill up storage space.
One lazy solution to the last problem is to just delete those unknown file, but I image people might want the option to be able to look at the reads that didn't get assigned.
My solution is to run Cutadapt on all samples at the same time, I also have a module that creates the input files necessary for Cutadapt to be used this way. I use a mmv module to rename the files so that they contain their sample names. I create a new sample sheet with the original sample sheet data plus the paths to the new fastq files. Lastly I run the new sample sheet through nf-cores samplesheet_check module (I believe Ampliseq uses parse_input instead) which outputs a channel in the [ meta, fastqs ] format that's compatible with other nf-core modules (e.g., fastqc). It feels a bit hacky to create a new sample sheet in the pipeline, then re-use the samplesheet_check module, but that was the easiest solution I found and I lack the Nextflow experience to come up with a better solution.
The text was updated successfully, but these errors were encountered: