Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to further speed up dada? #1976

Open
constructivedio opened this issue Jun 27, 2024 · 3 comments
Open

Is there any way to further speed up dada? #1976

constructivedio opened this issue Jun 27, 2024 · 3 comments

Comments

@constructivedio
Copy link

constructivedio commented Jun 27, 2024

Hi and thanks for providing this amazing tool.

I am currently running dada on a set of very deep sequenced samples. Around 3-6M 300bp NextSeq reads per sample remain after filtering.

What i'm currently doing now:

  1. As you previously suggested here, I've sped up the learnErrors process by subsampling my samples (using 10% of the reads) and it worked.

  2. I then parallelise the sample inference step for each sample. I am using a 96 cpus machine, so I give each sample 8 cpus and I can see the jobs are using ~30-60gb of memory each. However this step has been running for about 16hrs and not even the forward reads have been finished processing.
    sam = sample.names[1]
    ddF <- dada(filtFs, err=errF, multithread=TRUE)
    ddR <- dada(filtRs, err=errR, multithread=TRUE)
    merger <- mergePairs(ddF, filtFs, ddR, filtRs, verbose=TRUE)
    filtFs and filtRs are pointing to only one file each.

Number of unique sequences differ from sample to sample but the (unique_sequences/reads) ratio is always around 0.25:
Sample 1 - 4908635 reads in 1163375 unique sequences.

Adapters have been removed and reads have been trimmed both using fastp and dada:
filterAndTrim(forward_files, filtFs, reverse_files, filtRs, trimRight=c(10,10), maxN=0, truncQ=2, maxEE=c(2,6), truncLen=c(230, 230), rm.phix=TRUE, compress=TRUE, verbose=TRUE, multithread=TRUE)

Do you have any suggestions on how to speed this up?

Thank you so much

@benjjneb
Copy link
Owner

Extremely deep samples are the most difficult for DADA2 performance-wise, so there is no silver bullet fix. Some things that can help are to be more stringent about filtering (this will reduce the number of unique sequences in the data by removing reads more likely to contain errors) and truncating reads to be shorter (reduce alignment time). My experience suggests that your data is tractable, I have run dada2 on 1-2M unique sequences data, but it does take a while.

@constructivedio
Copy link
Author

constructivedio commented Jun 27, 2024

Thanks so much for your prompt response! I’ll play a bit more with filtering.

Would splitting a sample’s reads into sub samples, run dada on the subsamples and then sum the counts across the subsamples help you think? And if so, what kind of reads amount I should try per subsample?

Once again thanks for taking the time to help!

@benjjneb
Copy link
Owner

Would splitting a sample’s reads into sub samples, run dada on the subsamples and then sum the counts across the subsamples help you think?

Yes this would speed things up, but it isn't recommended because you are reducing your ability to detect rare variants in each split. The better approach is to crank up the quality filtering -- throwing away lower quality reads to reduce overall read count (or more importantly for DADA2 computation time, unique sequences in the data) is a win-win.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants