When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975

M-o-Y-u · 2024-06-21T10:38:31Z

I processed Pacbio full-length 16S sequencing data using DaDa2 v1.30.0
Followed the tutorial all the way to the use dada command to process the de-duplicated data
such as "dd <- dada(drp, err=err, multithread=TRUE)"
The dd obtained is similar to

$ms-1.fastq.gz
dada-class: object describing DADA2 denoising results
229 sequence variants were inferred from 9600 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

But after I generate the ASV table using "seqtab <- makeSequenceTable(dd)"
Is it normal that the number of all sequences in the column ms-1.fastq.gz in the ASV table adds up to only 659? Could it be that out of 9600 unique sequences, only 659 can be clustered in the obtained ASVs?

Thank you for any replies!

The text was updated successfully, but these errors were encountered:

benjjneb · 2024-06-21T14:58:20Z

I'm not really sure where the number 659 is coming from. Could you generate a table of read numbers through pipeline as shown in the dada2 tutorial section: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline

Also, what environment is being sampled?

M-o-Y-u · 2024-06-22T02:12:04Z

Following the tutorial, I generated a table of read numbers through pipeline with the following results, but due to the use of single-ended sequencing without “mergers”, it appears from the data that there is a significant loss of sequence at the "dada" step.

I'm not sure if this is normal. All samples were from soil.
Thanks for your reply!

benjjneb · 2024-06-22T02:33:21Z

Can you clarify what the column "drped" represents?

M-o-Y-u · 2024-06-22T02:37:04Z

Can you clarify what the column "drped" represents?

drp <- derepFastq(filt, verbose=TRUE)
De-duplication of filtered files with derepFastq。

benjjneb · 2024-06-22T03:08:55Z

It is stil unclear what that column means. The output written to the console from that command does not describe what is recorded in the table printed above.

M-o-Y-u · 2024-06-22T03:32:13Z

I built the pipeline based on LRASms_fecal.Rmd in LRASManuscript-master.

Here's my script.

The command used to generate the table is

track <- cbind(out, sapply(drp, getN), sapply(dd, getN), rowSums(seqtab), rowSums(seqtab.nochim))
colnames(track) <- c("input", "filtered", "drped", "denoised", "seqtab", "nonchim")

I think the drped column represents the number of unique sequences remaining after each sample has gone through sequence de-duplication?

Thank you for your reply！

M-o-Y-u · 2024-06-22T08:50:15Z

I tried removing the derepFastq step and it seems that things don't seem to change.

benjjneb · 2024-06-23T22:27:54Z

The $drp column is important to understand, because it still doesn't make sense and may explain what is going on here. What is your getN function? If it's the one from the dada2 tutorial, then I don't understand how you could get those results, since the number of dereplicated reads and the number of filtered reads should be identical.

In addition, what is the output of

derepFastq("ms-1.fastq.gz") # should print summary to the console

M-o-Y-u · 2024-06-24T00:50:49Z

My get(N) function is the same as the one in the tutorial
getN <- function(x) sum(getUniques(x))
The output of derepFastq("ms-1.fastq.gz") is

benjjneb · 2024-06-26T00:23:05Z

Unfortunately, based on the output of derepFastq("ms-1.fastq.gz"), your data is not suitable for being denoised by DADA2.

The key observation is that there are almost as many unique sequences as their are total reads -- that is, basically every read is a distinct sequence from all the other reads. DADA2 relies on repeated observations of real sequences to discriminate ASVs from noise. This data doesn't have enough repeated observations to make DADA2 appropriate.

Usually the reason for this in long-read data is high error rates. I see you are using PacBio HiFi data, but there is a lot of variation over the years and across instruments/chemistries in the error rate of the data. If it is too low, then there will be multiple errors in most reads, and the data won't be appropriate for denoising by DADA2. It is also possible that you are sequencing a hyper-diverse environment and there are legitimately few to no repeated observations of the same sequence. In my experience that is less common than the too-high-error-rate scenario though.

M-o-Y-u · 2024-06-26T00:57:53Z

Thank you very much for your reply, it explains my confusion, have a nice life!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975

When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975

M-o-Y-u commented Jun 21, 2024 •

edited

Loading

benjjneb commented Jun 21, 2024

M-o-Y-u commented Jun 22, 2024 •

edited

Loading

benjjneb commented Jun 22, 2024

M-o-Y-u commented Jun 22, 2024

benjjneb commented Jun 22, 2024

M-o-Y-u commented Jun 22, 2024

M-o-Y-u commented Jun 22, 2024

benjjneb commented Jun 23, 2024

M-o-Y-u commented Jun 24, 2024

benjjneb commented Jun 26, 2024

M-o-Y-u commented Jun 26, 2024

When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975

When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975

Comments

M-o-Y-u commented Jun 21, 2024 • edited Loading

benjjneb commented Jun 21, 2024

M-o-Y-u commented Jun 22, 2024 • edited Loading

benjjneb commented Jun 22, 2024

M-o-Y-u commented Jun 22, 2024

benjjneb commented Jun 22, 2024

M-o-Y-u commented Jun 22, 2024

M-o-Y-u commented Jun 22, 2024

benjjneb commented Jun 23, 2024

M-o-Y-u commented Jun 24, 2024

benjjneb commented Jun 26, 2024

M-o-Y-u commented Jun 26, 2024

M-o-Y-u commented Jun 21, 2024 •

edited

Loading

M-o-Y-u commented Jun 22, 2024 •

edited

Loading