Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975

Open
M-o-Y-u opened this issue Jun 21, 2024 · 11 comments

Comments

@M-o-Y-u
Copy link

M-o-Y-u commented Jun 21, 2024

I processed Pacbio full-length 16S sequencing data using DaDa2 v1.30.0
Followed the tutorial all the way to the use dada command to process the de-duplicated data
such as "dd <- dada(drp, err=err, multithread=TRUE)"
The dd obtained is similar to

$ms-1.fastq.gz
dada-class: object describing DADA2 denoising results
229 sequence variants were inferred from 9600 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

But after I generate the ASV table using "seqtab <- makeSequenceTable(dd)"
Is it normal that the number of all sequences in the column ms-1.fastq.gz in the ASV table adds up to only 659? Could it be that out of 9600 unique sequences, only 659 can be clustered in the obtained ASVs?

Thank you for any replies!

@benjjneb
Copy link
Owner

I'm not really sure where the number 659 is coming from. Could you generate a table of read numbers through pipeline as shown in the dada2 tutorial section: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

Also, what environment is being sampled?

@M-o-Y-u
Copy link
Author

M-o-Y-u commented Jun 22, 2024

Following the tutorial, I generated a table of read numbers through pipeline with the following results, but due to the use of single-ended sequencing without “mergers”, it appears from the data that there is a significant loss of sequence at the "dada" step.
image

I'm not sure if this is normal. All samples were from soil.
Thanks for your reply!

@benjjneb
Copy link
Owner

Can you clarify what the column "drped" represents?

@M-o-Y-u
Copy link
Author

M-o-Y-u commented Jun 22, 2024

Can you clarify what the column "drped" represents?

drp <- derepFastq(filt, verbose=TRUE)
De-duplication of filtered files with derepFastq。

@benjjneb
Copy link
Owner

It is stil unclear what that column means. The output written to the console from that command does not describe what is recorded in the table printed above.

@M-o-Y-u
Copy link
Author

M-o-Y-u commented Jun 22, 2024

I built the pipeline based on LRASms_fecal.Rmd in LRASManuscript-master.
image
image

Here's my script.
image

The command used to generate the table is

track <- cbind(out, sapply(drp, getN), sapply(dd, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "drped", "denoised", "seqtab", "nonchim")

I think the drped column represents the number of unique sequences remaining after each sample has gone through sequence de-duplication?

Thank you for your reply!

@M-o-Y-u
Copy link
Author

M-o-Y-u commented Jun 22, 2024

image
I tried removing the derepFastq step and it seems that things don't seem to change.

@benjjneb
Copy link
Owner

The $drp column is important to understand, because it still doesn't make sense and may explain what is going on here. What is your getN function? If it's the one from the dada2 tutorial, then I don't understand how you could get those results, since the number of dereplicated reads and the number of filtered reads should be identical.

In addition, what is the output of

derepFastq("ms-1.fastq.gz") # should print summary to the console

@M-o-Y-u
Copy link
Author

M-o-Y-u commented Jun 24, 2024

My get(N) function is the same as the one in the tutorial
getN <- function(x) sum(getUniques(x))
The output of derepFastq("ms-1.fastq.gz") is
image

@benjjneb
Copy link
Owner

Unfortunately, based on the output of derepFastq("ms-1.fastq.gz"), your data is not suitable for being denoised by DADA2.

The key observation is that there are almost as many unique sequences as their are total reads -- that is, basically every read is a distinct sequence from all the other reads. DADA2 relies on repeated observations of real sequences to discriminate ASVs from noise. This data doesn't have enough repeated observations to make DADA2 appropriate.

Usually the reason for this in long-read data is high error rates. I see you are using PacBio HiFi data, but there is a lot of variation over the years and across instruments/chemistries in the error rate of the data. If it is too low, then there will be multiple errors in most reads, and the data won't be appropriate for denoising by DADA2. It is also possible that you are sequencing a hyper-diverse environment and there are legitimately few to no repeated observations of the same sequence. In my experience that is less common than the too-high-error-rate scenario though.

@M-o-Y-u
Copy link
Author

M-o-Y-u commented Jun 26, 2024

Thank you very much for your reply, it explains my confusion, have a nice life!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants