PacBio Revio Kinnex data #1892

cjfields · 2024-02-20T22:38:00Z

We're just getting in our first standard 16S data from the PacBio Revio which is using the Kinnex kit, and will soon be getting 16S-ITS-23S data. I know the FASTQ data are binned now (similar to NovaSeq I believe). Is anyone aware of issues with these? I'm more concerned about issues similar to the NovaSeq binning issues documented elsewhere:

I'm also wondering whether the kit itself (which concatenates the reads prior to sequencing, and then splits the resulting reads after sequencing based on PacBio adapters) may also affect the error profile.

cjfields · 2024-02-21T16:06:32Z

It does appear there are issues with learnErrors, at least when using our workflow. We run this script:

filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)

# Learn forward error rates
errs <- learnErrors(filts, 
    errorEstimationFunction=PacBioErrfun, 
    BAND_SIZE=32, 
    multithread=4)

pdf(paste0("R1",".err.pdf"))
plotErrors(errs, nominalQ=TRUE)
dev.off()

The error appears to be due to the binning. From our Nextflow output

Command exit status:
  140

Command output:
  213565386 total bases in 147764 reads from 1 samples will be used for learning the error rates.                                   

Command error:
  The max qual score of 93 was not detected. Using standard error fitting.                                                          
  The max qual score of 93 was not detected. Using standard error fitting.                                                          
  The max qual score of 93 was not detected. Using standard error fitting.

We're working through the workflow using R+dada2 and a small data set to manually diagnose it and see if we can create a workaround.

cjfields · 2024-02-21T22:04:14Z

It looks like the issue above was a workflow problem (the task timed out). However we are seeing increased loss of sequence data compared to past runs, and there are similar issues with the binning step as seen for the NovaSeq runs interestingly enough. I suspect this might be from differences in estimated error as the problem occurs after filterAndTrim. The error profile also looks like an issue, similar to the dip seen in NovaSeq runs.

Attached are the results from three random samples from a larger 174 sample run using plotQualityProfile (you can ignore the 'aggregate' labels, these are from individual samples). Lots of read data per sample. I'll see if I can dig up some comparative standard Sequel IIe runs.

aggregate-qualities.12.pdf
aggregate-qualities.27.pdf
aggregate-qualities.77.pdf

This is the error profile:

Note the dip.

EDIT: I am a bit concerned about the 'flatness' of that error frequency line compared to Illumina and older PacBio data, like from @benjjneb 's past runs, e.g. https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html.

cjfields · 2024-02-22T20:30:37Z

We just tried a run using the alternative function for error estimation provided by @jonalim and tested by @hhollandmoritz here: #1307 (comment) (option 4). This seems to help, though the data are pretty sparse; we're working on newer runs with more nbases to see if this improves anything

cjfields · 2024-02-23T00:33:41Z

This is an updated plot with nbases=1e10

I should mention that the PacBio workflow steps for Kinnex still appear to use the original PacBioErrFunc. We are downloading their Zymo mock samples to see if the pattern above holds:

https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/DATA-ZymoATCCMix-32plex-Revio/2-Demux/fastq_files/

I should also mention these data (and our in-house data) are from the Revio, I'll update the title.

cjfields · 2024-02-23T16:38:21Z

Error profiles for the mock community samples from here:

https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/DATA-ZymoATCCMix-32plex-Revio/2-Demux/fastq_files/

The original PacBio model, `PacBioErrfun`

code:

errs <- learnErrors(dereps, 
    nbases = 1e9, 
    errorEstimationFunction=PacBioErrfun, 
    BAND_SIZE=32, 
    multithread=36,
    verbose=TRUE)

Updated version

Using the code from #1307 (comment) (option 4):

suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(dplyr))

# see https://github.com/benjjneb/dada2/issues/1307
# and https://github.com/benjjneb/dada2/issues/1892

# THIS IS EXPERIMENTAL
loessErrfun_mod4 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
        
        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)
        
        # jonalim's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)
        
        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))
  
  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
  
  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)
  
  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)

dereps <- derepFastq(filts, n=100000, verbose=TRUE)

# Learn forward error rates
errs <- learnErrors(dereps, 
    nbases = 1e9,
    errorEstimationFunction=loessErrfun_mod4, 
    BAND_SIZE=32, 
    multithread=36,
    verbose=2)

Conclusion

At the moment we're going with the second one, but we're about to have a meeting with some PacBio folks who can hopefully give more insight. @benjjneb any thoughts on this?

benjjneb · 2024-03-01T22:46:52Z

Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.

We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/

cjfields · 2024-03-01T23:16:10Z

Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.

Thx, but the original implementation was from the Illumina NovaSeq binning tests from @jonalim (which I think came from someone in his group...). That this works across binned data from different technologies is interesting on it's own.

We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/

Yep, that's part of the data I had tried here:

#1892 (comment)

32-sample Zymo mock communities. Similar pattern when we ran it. I noticed they also have Kinnex/Sequel data.

cjfields · 2024-03-12T20:29:00Z

Small update: based on feedback from our PacBio BI rep it sounds like SMRTLink can be configured to generate full quality scores for Revio, though it sounds like these top out at Q40 regardless (unlike Sequel with the Q93). It requires a custom run sheet for now. Default is to bin all data.

benjjneb · 2024-03-12T23:18:43Z

Thanks for updates. We have a call with PacBio about Kinnex tomorrow, and I'm going to reach out to them about their quality score approach going forward too.

cjfields · 2024-03-26T16:00:48Z

@benjjneb to add to this, we redid a Kinnex run but set it to generate the full quality range. Despite PacBio's indication these would top out at 40 they do go all the way to Q93 (using the Kinnex correction above):

We're doing a second run through switching back to the standard PacBio model, which may work better in this case.

cjfields · 2024-04-05T17:48:21Z

As a final addition, if you use the full quality scores the original PacBio error function in DADA2 it seems to work well:

Note that the qualities go all the way to 93 again. However, I noticed that the dada() step takes a tremendous amount of time, approximately 3-4 days using 36 cores and 300GB mem with a full SMRTcell and ~200 samples.

jonalim · 2024-05-02T21:42:57Z

I appreciate this thread a whole lot @cjfields ! I had also noticed the reduced quality range on our first Revio Kinnex run. Unrelated to this, our flowcell was underloaded(sic?) and the median read quality less than what it should have been, so I was confused as to what was going on.

I'm actually unconvinced that the error model a lot of people have been attributing to me (with loess weights = log10(tot), span=0.95, degree = 1) is better than the original, because it often yields a fit that underestimates the error rate of Q37 (illumina) / Q40 (PacBio) bases, but I'll save that for another post.

cjfields · 2024-05-02T21:50:06Z

@jonalim agreed, and I think this is where having a decent mock data set would be useful to assess this. There are ones available from PacBio which we have tested and which I linked to above.

Also, re: the median quality issue, is it possible the data were not filtered to 99.9% accuracy? We've seen something like this when we were mistakenly given default PacBio output (99%), which did not work.

cjfields changed the title ~~PacBio Kinnex data~~ PacBio Revio Kinnex data Feb 23, 2024

benjjneb added the question label May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PacBio Revio Kinnex data #1892

PacBio Revio Kinnex data #1892

cjfields commented Feb 20, 2024 •

edited

Loading

cjfields commented Feb 21, 2024

cjfields commented Feb 21, 2024 •

edited

Loading

cjfields commented Feb 22, 2024

cjfields commented Feb 23, 2024 •

edited

Loading

cjfields commented Feb 23, 2024 •

edited

Loading

benjjneb commented Mar 1, 2024

cjfields commented Mar 1, 2024

cjfields commented Mar 12, 2024

benjjneb commented Mar 12, 2024

cjfields commented Mar 26, 2024

cjfields commented Apr 5, 2024

jonalim commented May 2, 2024

cjfields commented May 2, 2024 •

edited

Loading

PacBio Revio Kinnex data #1892

PacBio Revio Kinnex data #1892

Comments

cjfields commented Feb 20, 2024 • edited Loading

cjfields commented Feb 21, 2024

cjfields commented Feb 21, 2024 • edited Loading

cjfields commented Feb 22, 2024

cjfields commented Feb 23, 2024 • edited Loading

cjfields commented Feb 23, 2024 • edited Loading

The original PacBio model, PacBioErrfun

Updated version

Conclusion

benjjneb commented Mar 1, 2024

cjfields commented Mar 1, 2024

cjfields commented Mar 12, 2024

benjjneb commented Mar 12, 2024

cjfields commented Mar 26, 2024

cjfields commented Apr 5, 2024

jonalim commented May 2, 2024

cjfields commented May 2, 2024 • edited Loading

cjfields commented Feb 20, 2024 •

edited

Loading

cjfields commented Feb 21, 2024 •

edited

Loading

cjfields commented Feb 23, 2024 •

edited

Loading

cjfields commented Feb 23, 2024 •

edited

Loading

The original PacBio model, `PacBioErrfun`

cjfields commented May 2, 2024 •

edited

Loading