-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When processing full-length 16S data, the ASV table output by makeSequenceTable has an unusually low number of sequences in the sample #1975
Comments
I'm not really sure where the number 659 is coming from. Could you generate a table of read numbers through pipeline as shown in the dada2 tutorial section: https://benjjneb.github.io/dada2/tutorial.html#track-reads-through-the-pipeline Also, what environment is being sampled? |
Can you clarify what the column "drped" represents? |
drp <- derepFastq(filt, verbose=TRUE) |
It is stil unclear what that column means. The output written to the console from that command does not describe what is recorded in the table printed above. |
The In addition, what is the output of
|
Unfortunately, based on the output of The key observation is that there are almost as many unique sequences as their are total reads -- that is, basically every read is a distinct sequence from all the other reads. DADA2 relies on repeated observations of real sequences to discriminate ASVs from noise. This data doesn't have enough repeated observations to make DADA2 appropriate. Usually the reason for this in long-read data is high error rates. I see you are using PacBio HiFi data, but there is a lot of variation over the years and across instruments/chemistries in the error rate of the data. If it is too low, then there will be multiple errors in most reads, and the data won't be appropriate for denoising by DADA2. It is also possible that you are sequencing a hyper-diverse environment and there are legitimately few to no repeated observations of the same sequence. In my experience that is less common than the too-high-error-rate scenario though. |
Thank you very much for your reply, it explains my confusion, have a nice life! |
I processed Pacbio full-length 16S sequencing data using DaDa2 v1.30.0
Followed the tutorial all the way to the use dada command to process the de-duplicated data
such as "dd <- dada(drp, err=err, multithread=TRUE)"
The dd obtained is similar to
$
ms-1.fastq.gz
dada-class: object describing DADA2 denoising results
229 sequence variants were inferred from 9600 input unique sequences.
Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
But after I generate the ASV table using "seqtab <- makeSequenceTable(dd)"
Is it normal that the number of all sequences in the column ms-1.fastq.gz in the ASV table adds up to only 659? Could it be that out of 9600 unique sequences, only 659 can be clustered in the obtained ASVs?
Thank you for any replies!
The text was updated successfully, but these errors were encountered: