Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PacBio Revio Kinnex data #1892

Open
cjfields opened this issue Feb 20, 2024 · 13 comments
Open

PacBio Revio Kinnex data #1892

cjfields opened this issue Feb 20, 2024 · 13 comments
Labels

Comments

@cjfields
Copy link

cjfields commented Feb 20, 2024

We're just getting in our first standard 16S data from the PacBio Revio which is using the Kinnex kit, and will soon be getting 16S-ITS-23S data. I know the FASTQ data are binned now (similar to NovaSeq I believe). Is anyone aware of issues with these? I'm more concerned about issues similar to the NovaSeq binning issues documented elsewhere:

I'm also wondering whether the kit itself (which concatenates the reads prior to sequencing, and then splits the resulting reads after sequencing based on PacBio adapters) may also affect the error profile.

@cjfields
Copy link
Author

It does appear there are issues with learnErrors, at least when using our workflow. We run this script:

filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)

# Learn forward error rates
errs <- learnErrors(filts, 
    errorEstimationFunction=PacBioErrfun, 
    BAND_SIZE=32, 
    multithread=4)

pdf(paste0("R1",".err.pdf"))
plotErrors(errs, nominalQ=TRUE)
dev.off()

The error appears to be due to the binning. From our Nextflow output

Command exit status:
  140

Command output:
  213565386 total bases in 147764 reads from 1 samples will be used for learning the error rates.                                   

Command error:
  The max qual score of 93 was not detected. Using standard error fitting.                                                          
  The max qual score of 93 was not detected. Using standard error fitting.                                                          
  The max qual score of 93 was not detected. Using standard error fitting.  

We're working through the workflow using R+dada2 and a small data set to manually diagnose it and see if we can create a workaround.

@cjfields
Copy link
Author

cjfields commented Feb 21, 2024

It looks like the issue above was a workflow problem (the task timed out). However we are seeing increased loss of sequence data compared to past runs, and there are similar issues with the binning step as seen for the NovaSeq runs interestingly enough. I suspect this might be from differences in estimated error as the problem occurs after filterAndTrim. The error profile also looks like an issue, similar to the dip seen in NovaSeq runs.

Attached are the results from three random samples from a larger 174 sample run using plotQualityProfile (you can ignore the 'aggregate' labels, these are from individual samples). Lots of read data per sample. I'll see if I can dig up some comparative standard Sequel IIe runs.

aggregate-qualities.12.pdf
aggregate-qualities.27.pdf
aggregate-qualities.77.pdf

This is the error profile:

image

Note the dip.

EDIT: I am a bit concerned about the 'flatness' of that error frequency line compared to Illumina and older PacBio data, like from @benjjneb 's past runs, e.g. https://benjjneb.github.io/LRASManuscript/LRASms_fecal.html.

@cjfields
Copy link
Author

We just tried a run using the alternative function for error estimation provided by @jonalim and tested by @hhollandmoritz here: #1307 (comment) (option 4). This seems to help, though the data are pretty sparse; we're working on newer runs with more nbases to see if this improves anything

image

@cjfields
Copy link
Author

cjfields commented Feb 23, 2024

This is an updated plot with nbases=1e10

image

I should mention that the PacBio workflow steps for Kinnex still appear to use the original PacBioErrFunc. We are downloading their Zymo mock samples to see if the pattern above holds:

https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/DATA-ZymoATCCMix-32plex-Revio/2-Demux/fastq_files/

I should also mention these data (and our in-house data) are from the Revio, I'll update the title.

@cjfields cjfields changed the title PacBio Kinnex data PacBio Revio Kinnex data Feb 23, 2024
@cjfields
Copy link
Author

cjfields commented Feb 23, 2024

Error profiles for the mock community samples from here:

https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/DATA-ZymoATCCMix-32plex-Revio/2-Demux/fastq_files/

The original PacBio model, PacBioErrfun

code:

errs <- learnErrors(dereps, 
    nbases = 1e9, 
    errorEstimationFunction=PacBioErrfun, 
    BAND_SIZE=32, 
    multithread=36,
    verbose=TRUE)

image

Updated version

Using the code from #1307 (comment) (option 4):

suppressPackageStartupMessages(library(dada2))
suppressPackageStartupMessages(library(dplyr))

# see https://github.com/benjjneb/dada2/issues/1307
# and https://github.com/benjjneb/dada2/issues/1892

# THIS IS EXPERIMENTAL
loessErrfun_mod4 <- function(trans) {
  qq <- as.numeric(colnames(trans))
  est <- matrix(0, nrow=0, ncol=length(qq))
  for(nti in c("A","C","G","T")) {
    for(ntj in c("A","C","G","T")) {
      if(nti != ntj) {
        errs <- trans[paste0(nti,"2",ntj),]
        tot <- colSums(trans[paste0(nti,"2",c("A","C","G","T")),])
        rlogp <- log10((errs+1)/tot)  # 1 psuedocount for each err, but if tot=0 will give NA
        rlogp[is.infinite(rlogp)] <- NA
        df <- data.frame(q=qq, errs=errs, tot=tot, rlogp=rlogp)
        
        # original
        # ###! mod.lo <- loess(rlogp ~ q, df, weights=errs) ###!
        # mod.lo <- loess(rlogp ~ q, df, weights=tot) ###!
        # #        mod.lo <- loess(rlogp ~ q, df)
        
        # jonalim's solution
        # https://github.com/benjjneb/dada2/issues/938
        mod.lo <- loess(rlogp ~ q, df, weights = log10(tot),degree = 1, span = 0.95)
        
        pred <- predict(mod.lo, qq)
        maxrli <- max(which(!is.na(pred)))
        minrli <- min(which(!is.na(pred)))
        pred[seq_along(pred)>maxrli] <- pred[[maxrli]]
        pred[seq_along(pred)<minrli] <- pred[[minrli]]
        est <- rbind(est, 10^pred)
      } # if(nti != ntj)
    } # for(ntj in c("A","C","G","T"))
  } # for(nti in c("A","C","G","T"))
  
  # HACKY
  MAX_ERROR_RATE <- 0.25
  MIN_ERROR_RATE <- 1e-7
  est[est>MAX_ERROR_RATE] <- MAX_ERROR_RATE
  est[est<MIN_ERROR_RATE] <- MIN_ERROR_RATE
  
  # enforce monotonicity
  # https://github.com/benjjneb/dada2/issues/791
  estorig <- est
  est <- est %>%
    data.frame() %>%
    mutate_all(funs(case_when(. < X40 ~ X40,
                              . >= X40 ~ .))) %>% as.matrix()
  rownames(est) <- rownames(estorig)
  colnames(est) <- colnames(estorig)
  
  # Expand the err matrix with the self-transition probs
  err <- rbind(1-colSums(est[1:3,]), est[1:3,],
               est[4,], 1-colSums(est[4:6,]), est[5:6,],
               est[7:8,], 1-colSums(est[7:9,]), est[9,],
               est[10:12,], 1-colSums(est[10:12,]))
  rownames(err) <- paste0(rep(c("A","C","G","T"), each=4), "2", c("A","C","G","T"))
  colnames(err) <- colnames(trans)
  # Return
  return(err)
}

filts <- list.files('.', pattern="filtered.fastq.gz", full.names = TRUE)
sample.namesF <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
set.seed(100)

dereps <- derepFastq(filts, n=100000, verbose=TRUE)

# Learn forward error rates
errs <- learnErrors(dereps, 
    nbases = 1e9,
    errorEstimationFunction=loessErrfun_mod4, 
    BAND_SIZE=32, 
    multithread=36,
    verbose=2)

image

Conclusion

At the moment we're going with the second one, but we're about to have a meeting with some PacBio folks who can hopefully give more insight. @benjjneb any thoughts on this?

@benjjneb
Copy link
Owner

benjjneb commented Mar 1, 2024

Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.

We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/

@cjfields
Copy link
Author

cjfields commented Mar 1, 2024

Visually the fit with the updated version looks much better. Without a deep understanding, my surface read is that the updated code might be more effectively working across Q scores with very low representation in the data, and thus the improved visual fitting could be explained by a better match between the algorithm and this sort of binned-Q data.

Thx, but the original implementation was from the Illumina NovaSeq binning tests from @jonalim (which I think came from someone in his group...). That this works across binned data from different technologies is interesting on it's own.

We are actually looking at PacBio Kinnex/Revio for a current project so this is very timely. I haven't ever seen real data from the platform before, but googling around I just found this: https://downloads.pacbcloud.com/public/dataset/Kinnex-16S/

Yep, that's part of the data I had tried here:

#1892 (comment)

32-sample Zymo mock communities. Similar pattern when we ran it. I noticed they also have Kinnex/Sequel data.

@cjfields
Copy link
Author

Small update: based on feedback from our PacBio BI rep it sounds like SMRTLink can be configured to generate full quality scores for Revio, though it sounds like these top out at Q40 regardless (unlike Sequel with the Q93). It requires a custom run sheet for now. Default is to bin all data.

@benjjneb
Copy link
Owner

Thanks for updates. We have a call with PacBio about Kinnex tomorrow, and I'm going to reach out to them about their quality score approach going forward too.

@cjfields
Copy link
Author

@benjjneb to add to this, we redid a Kinnex run but set it to generate the full quality range. Despite PacBio's indication these would top out at 40 they do go all the way to Q93 (using the Kinnex correction above):

image

We're doing a second run through switching back to the standard PacBio model, which may work better in this case.

@cjfields
Copy link
Author

cjfields commented Apr 5, 2024

As a final addition, if you use the full quality scores the original PacBio error function in DADA2 it seems to work well:

image

Note that the qualities go all the way to 93 again. However, I noticed that the dada() step takes a tremendous amount of time, approximately 3-4 days using 36 cores and 300GB mem with a full SMRTcell and ~200 samples.

@jonalim
Copy link

jonalim commented May 2, 2024

I appreciate this thread a whole lot @cjfields ! I had also noticed the reduced quality range on our first Revio Kinnex run. Unrelated to this, our flowcell was underloaded(sic?) and the median read quality less than what it should have been, so I was confused as to what was going on.

I'm actually unconvinced that the error model a lot of people have been attributing to me (with loess weights = log10(tot), span=0.95, degree = 1) is better than the original, because it often yields a fit that underestimates the error rate of Q37 (illumina) / Q40 (PacBio) bases, but I'll save that for another post.

@cjfields
Copy link
Author

cjfields commented May 2, 2024

@jonalim agreed, and I think this is where having a decent mock data set would be useful to assess this. There are ones available from PacBio which we have tested and which I linked to above.

Also, re: the median quality issue, is it possible the data were not filtered to 99.9% accuracy? We've seen something like this when we were mistakenly given default PacBio output (99%), which did not work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants