Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue merging reads across sequence runs and diagnosing sequence batch effects PacBio ITS Sequel II samples #1900

Open
dawsonfairbanks opened this issue Mar 3, 2024 · 1 comment
Labels

Comments

@dawsonfairbanks
Copy link

Hi, thank you for the great tools! I am having some issues with merging reads across sequence runs. I am working with PacBio ITS sequel II data. I've run the workflow through the DADA2 algorithm by individual sequence runs using cutadapt to trim primers and using the same QC parameters. The samples were merged using mergeSequenceTables(). I then removed chimeras on the joined sequence table and assigned taxonomy. I am getting strange output where samples cluster based on sequence run. The samples vary quite a bit in # of reads and read quality. I have 700+ samples from 2 sequencing centers across 6 sequence runs.

Trimming parameters:

filter_trackReads <- filterAndTrim(path.cut, filtFs, truncQ = 2, minQ= 3, minLen = 600, maxLen= 3000, maxEE = 2,# set to 2 multithread = TRUE, compress = TRUE, rm.phix = TRUE)

Cumulative expected errors across sequence run:
Screenshot 2024-03-03 at 1 30 11 PM

Read counts by run:
Read_Count_Boxplot_Raw_AllData

Prefilter histogram of sequence lengths:

Screenshot 2024-03-03 at 1 29 44 PM

Error plot for example from one sequence run. Pretty similar across runs.

ErrorPlotsB1 1

I've tried a number of different filtering parameters and have gotten the same results varying maxEE+2-6, minQ and minLen. I am wondering if anyone else has seen significant batch effects in their samples with PacBio sequences. I've grouped all of the samples together and ran through the workflow and the batch effect significantly diminishes. Does anyone else have experience like this? I know that it is not recommended to run all samples together due to the differences in the assigned quality scores generated through learnErrorRates() but are there cases when it might be necessary? What parameters can I adjust or consider to mitigate this effect?

@cjfields
Copy link

cjfields commented Mar 18, 2024

@dawsonfairbanks do you know if the sequences are Q20 (99% base accuracy) or Q30 (99.9% base accuracy)? Looking at the error plot it looks more like 99%, but worth checking. For comparison, here is one from ours that is 99.9% (note the difference in the error frequency scale):

image

PacBio's default is 99%, but DADA2 works best with 99.9%. Your sequence vendors should hopefully provide you with both FASTQ and the PacBio BAM files; the run metrics, including the read quality, will normally only be in the BAM.

The BAM file can be parsed using samtools or similar to check for the read quality, stored in the rq tag; samtools view -c -e '[rq] < 0.999' my.bam would count the number of reads that are less than the expected DADA2 cutoff. For example:

$ samtools view -@ $SLURM_NTASKS -c -e '[rq] < 0.999' testbam.hifi_reads.bam
0
$ samtools view -@ $SLURM_NTASKS -c -e '[rq] >= 0.999' testbam.hifi_reads.bam
2265195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants